System and method for recognizing non-body text in webpage
Abstract:
The invention discloses a system and method for recognizing the non-body text in a webpage, and relates to the field of main body extraction. The system comprises: a webpage grabber configured to grab data of all the webpages of a target website; a DOM tree construction unit configured to construct a DOM tree corresponding to each webpage of the target website; a DOM tree analysis unit configured to find out a unit text section in the webpage according to the DOM tree; a text statistics unit configured to conduct statistics on the number of occurrence of the unit text section in all the webpages of the target website; and a text recognition unit configured to recognize the unit text section as a non-body text when the number of occurrence is greater than a predetermined threshold. The system and the method overcome the problem of lag of recognition of a non-body text in the prior art method, and have a high recognition accuracy.
Public/Granted literature
Information query
Patent Agency Ranking
0/0