System and method for recognizing non-body text in webpage

Invention Grant

US10042827B2 System and method for recognizing non-body text in webpage 有权

Please log in to see more content

Patent Title: System and method for recognizing non-body text in webpage
Application No.: US14411013

Application Date: 2013-06-09
Publication No.: US10042827B2

Publication Date: 2018-08-07
Inventor: Zhigang Wang
Applicant: BEIJING QIHOO TECHNOLOGY COMPANY LIMITED
Applicant Address: CN Beijing
Assignee: Beijing Qihoo Technology Company Limited
Current Assignee: Beijing Qihoo Technology Company Limited
Current Assignee Address: CN Beijing
Agency: Polsinelli PC
Priority: CN201210214385 20120625
International Application: PCT/CN2013/077102 WO 20130609
International Announcement: WO2014/000571 WO 20140103
Main IPC: G06F17/20
IPC: G06F17/20 ; G06F17/22 ; G06F17/27 ; G06F17/30

System and method for recognizing non-body text in webpage

Abstract:

The invention discloses a system and method for recognizing the non-body text in a webpage, and relates to the field of main body extraction. The system comprises: a webpage grabber configured to grab data of all the webpages of a target website; a DOM tree construction unit configured to construct a DOM tree corresponding to each webpage of the target website; a DOM tree analysis unit configured to find out a unit text section in the webpage according to the DOM tree; a text statistics unit configured to conduct statistics on the number of occurrence of the unit text section in all the webpages of the target website; and a text recognition unit configured to recognize the unit text section as a non-body text when the number of occurrence is greater than a predetermined threshold. The system and the method overcome the problem of lag of recognition of a non-body text in the prior art method, and have a high recognition accuracy.

Public/Granted literature

US20150205769A1 System and method for recognizing non-body text in webpage Public/Granted day:2015-07-23

Information query

Espacenet