Invention Grant
US08819028B2 System and method for web content extraction 有权
网页内容提取的系统和方法

System and method for web content extraction
Abstract:
A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.
Public/Granted literature
Information query
Patent Agency Ranking
0/0