Invention Grant
US07680785B2 Systems and methods for inferring uniform resource locator (URL) normalization rules
有权
用于推导统一资源定位符(URL)规范化规则的系统和方法
- Patent Title: Systems and methods for inferring uniform resource locator (URL) normalization rules
- Patent Title (中): 用于推导统一资源定位符(URL)规范化规则的系统和方法
-
Application No.: US11089988Application Date: 2005-03-25
-
Publication No.: US07680785B2Publication Date: 2010-03-16
- Inventor: Marc Alexander Najork
- Applicant: Marc Alexander Najork
- Applicant Address: US WA Redmond
- Assignee: Microsoft Corporation
- Current Assignee: Microsoft Corporation
- Current Assignee Address: US WA Redmond
- Agency: Woodcock Washburn LLP
- Main IPC: G06F17/30
- IPC: G06F17/30 ; G06F17/00 ; G06F17/20

Abstract:
Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
Public/Granted literature
- US20060218143A1 Systems and methods for inferring uniform resource locator (URL) normalization rules Public/Granted day:2006-09-28
Information query