Invention Grant
- Patent Title: System and method for recognizing non-body text in webpage
-
Application No.: US14411013Application Date: 2013-06-09
-
Publication No.: US10042827B2Publication Date: 2018-08-07
- Inventor: Zhigang Wang
- Applicant: BEIJING QIHOO TECHNOLOGY COMPANY LIMITED
- Applicant Address: CN Beijing
- Assignee: Beijing Qihoo Technology Company Limited
- Current Assignee: Beijing Qihoo Technology Company Limited
- Current Assignee Address: CN Beijing
- Agency: Polsinelli PC
- Priority: CN201210214385 20120625
- International Application: PCT/CN2013/077102 WO 20130609
- International Announcement: WO2014/000571 WO 20140103
- Main IPC: G06F17/20
- IPC: G06F17/20 ; G06F17/22 ; G06F17/27 ; G06F17/30

Abstract:
The invention discloses a system and method for recognizing the non-body text in a webpage, and relates to the field of main body extraction. The system comprises: a webpage grabber configured to grab data of all the webpages of a target website; a DOM tree construction unit configured to construct a DOM tree corresponding to each webpage of the target website; a DOM tree analysis unit configured to find out a unit text section in the webpage according to the DOM tree; a text statistics unit configured to conduct statistics on the number of occurrence of the unit text section in all the webpages of the target website; and a text recognition unit configured to recognize the unit text section as a non-body text when the number of occurrence is greater than a predetermined threshold. The system and the method overcome the problem of lag of recognition of a non-body text in the prior art method, and have a high recognition accuracy.
Public/Granted literature
- US20150205769A1 System and method for recognizing non-body text in webpage Public/Granted day:2015-07-23
Information query