Systems and methods for inferring uniform resource locator (URL) normalization rules

Invention Grant

US07680785B2 Systems and methods for inferring uniform resource locator (URL) normalization rules 有权

Title translation: 用于推导统一资源定位符（URL）规范化规则的系统和方法

Please log in to see more content

Patent Title: Systems and methods for inferring uniform resource locator (URL) normalization rules
Patent Title (中): 用于推导统一资源定位符（URL）规范化规则的系统和方法
Application No.: US11089988

Application Date: 2005-03-25
Publication No.: US07680785B2

Publication Date: 2010-03-16
Inventor: Marc Alexander Najork
Applicant: Marc Alexander Najork
Applicant Address: US WA Redmond
Assignee: Microsoft Corporation
Current Assignee: Microsoft Corporation
Current Assignee Address: US WA Redmond
Agency: Woodcock Washburn LLP
Main IPC: G06F17/30
IPC: G06F17/30 ; G06F17/00 ; G06F17/20

Systems and methods for inferring uniform resource locator (URL) normalization rules

Abstract:

Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.

Abstract(Chinese):

检测到实际引用相同网页或其他网络资源的不同URL，并且该信息用于仅从网站下载一个网页或网页资源的一个实例。将从Web服务器下载的所有网页或网络资源进行比较，以确定哪些基本相同。一旦找到具有不同URL的相同的网页或网页资源，就分析不同的URL来识别URL的哪些部分对于识别特定的网页或web资源是必不可少的，哪些部分是不相关的。一旦对每组基本相同的网页或网络资源（在本文中也称为“等价类”）进行了这一操作，这些每等价类规则被推广到跨等价类规则。有两个规则学习步骤：步骤（1），其中为每个等价类学习，该类中的URL的哪些部分与选择页面和哪些部分不相关？和步骤（2），其中在步骤（1）中构造的每等价类规则被推广到覆盖许多等价类的规则。一旦确定规则，它将被应用于网页或网页资源类别以识别错误。如果没有错误，该规则将被激活，然后被网络爬网程序用于将来的抓取，以避免下载重复的网页或Web资源。

Public/Granted literature

US20060218143A1 Systems and methods for inferring uniform resource locator (URL) normalization rules Public/Granted day:2006-09-28

Information query

Espacenet