System and method for web content extraction

Invention Grant

US08819028B2 System and method for web content extraction 有权

Title translation: 网页内容提取的系统和方法

Please log in to see more content

Patent Title: System and method for web content extraction
Patent Title (中): 网页内容提取的系统和方法
Application No.: US13258482

Application Date: 2009-12-14
Publication No.: US08819028B2

Publication Date: 2014-08-26
Inventor: Ping Luo , Jian Fan , Samson J. Liu , Yuhong Xiong , Jerry J. Liu
Applicant: Ping Luo , Jian Fan , Samson J. Liu , Yuhong Xiong , Jerry J. Liu
Applicant Address: US TX Houston
Assignee: Hewlett-Packard Development Company, L.P.
Current Assignee: Hewlett-Packard Development Company, L.P.
Current Assignee Address: US TX Houston
International Application: PCT/CN2009/075545 WO 20091214
International Announcement: WO2011/072434 WO 20110623
Main IPC: G06F17/30
IPC: G06F17/30 ; G06F3/12

System and method for web content extraction

Abstract:

A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.

Abstract(Chinese):

公开了一种用于提取Web内容的方法和系统。在一个实施例中，通过基于线间歇节点确定来识别Web内容中的段落来提取网页中的Web内容。然后使用最大记分子序列来识别与识别的段落相关联的文本体的范围。此外，使用基本上水平对齐的启发式规则来改进所识别的文本体。此外，提取与Web内容相关联的一个或多个标题和一个或多个图像。此外，输出包括识别的段落的Web内容，一个或多个标题和一个或多个图像。

Public/Granted literature

US20120303636A1 System and Method for Web Content Extraction Public/Granted day:2012-11-29

Information query

Espacenet