Method and apparatus for web crawling

Invention Grant

US08712992B2 Method and apparatus for web crawling 有权

Title translation: 网络爬行的方法和装置

Please log in to see more content

Patent Title: Method and apparatus for web crawling
Patent Title (中): 网络爬行的方法和装置
Application No.: US12413528

Application Date: 2009-03-28
Publication No.: US08712992B2

Publication Date: 2014-04-29
Inventor: Alexey Maykov , Matthew F. Hurst
Applicant: Alexey Maykov , Matthew F. Hurst
Applicant Address: US WA Redmond
Assignee: Microsoft Corporation
Current Assignee: Microsoft Corporation
Current Assignee Address: US WA Redmond
Agent Steve Spellman; Jim Ross; Micky Minhas
Main IPC: G06F7/00
IPC: G06F7/00 ; G06F7/08

Abstract:

A method and system for retrieving data from a webpage is described herein. A scheduler organizes, or rather orders, a group of webpage identifiers according to some predetermined criteria. Based upon this ordering, a fetcher may be configured to fetch data from webpages identified by the identifiers. To promote efficiency and reduce the latency between when a webpage is updated and when the fetcher retrieves data from the webpage, the scheduler may be configured to reorder the identifiers in such a manner that it causes an identifier that was less relevant, and would not have been sent to the fetcher, to become more relevant. In this way, the method and system may be particularly useful for retrieving data related to webpages that are updated frequently, such as social media webpages, for example.

Abstract(Chinese):

本文描述了用于从网页检索数据的方法和系统。调度器根据某些预定标准来组织或者相当地命令一组网页标识符。基于该排序，提取器可以被配置为从由标识符标识的网页获取数据。为了提高效率并减少网页更新时和提取器从网页检索数据之间的延迟，调度器可以被配置为以这样的方式重新排序标识符，使得它导致不相关的标识符，并且不会被发送到提取者，变得更加相关。以这种方式，该方法和系统可能特别适用于检索与频繁更新的网页相关的数据，例如社交媒体网页。

Public/Granted literature

US20100250516A1 METHOD AND APPARATUS FOR WEB CRAWLING Public/Granted day:2010-09-30

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F7/00	通过待处理的数据的指令或内容进行运算的数据处理的方法或装置（逻辑电路入H03K19/00）