Invention Grant
- Patent Title: System and method for web content extraction
- Patent Title (中): 网页内容提取的系统和方法
-
Application No.: US13258482Application Date: 2009-12-14
-
Publication No.: US08819028B2Publication Date: 2014-08-26
- Inventor: Ping Luo , Jian Fan , Samson J. Liu , Yuhong Xiong , Jerry J. Liu
- Applicant: Ping Luo , Jian Fan , Samson J. Liu , Yuhong Xiong , Jerry J. Liu
- Applicant Address: US TX Houston
- Assignee: Hewlett-Packard Development Company, L.P.
- Current Assignee: Hewlett-Packard Development Company, L.P.
- Current Assignee Address: US TX Houston
- International Application: PCT/CN2009/075545 WO 20091214
- International Announcement: WO2011/072434 WO 20110623
- Main IPC: G06F17/30
- IPC: G06F17/30 ; G06F3/12

Abstract:
A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.
Public/Granted literature
- US20120303636A1 System and Method for Web Content Extraction Public/Granted day:2012-11-29
Information query