Invention Grant
US08196036B2 Method and system for converting hypertext markup language web page to plain text
有权
将超文本标记语言网页转换为纯文本的方法和系统
- Patent Title: Method and system for converting hypertext markup language web page to plain text
- Patent Title (中): 将超文本标记语言网页转换为纯文本的方法和系统
-
Application No.: US12031855Application Date: 2008-02-15
-
Publication No.: US08196036B2Publication Date: 2012-06-05
- Inventor: Tzu-Kuei Huang , Hong-Yang Tsai
- Applicant: Tzu-Kuei Huang , Hong-Yang Tsai
- Applicant Address: KY George Town
- Assignee: Esobi, Inc.
- Current Assignee: Esobi, Inc.
- Current Assignee Address: KY George Town
- Agency: Fox Rothschild, LLP
- Agent Robert J. Sacco
- Priority: TW96106121A 20070216
- Main IPC: G06F17/00
- IPC: G06F17/00

Abstract:
A method for converting an HTML web page to plain text includes extracting from HTML source code of the HTML web page a portion containing a plurality of character strings and tags, calculating length and position of each character string in the extracted portion so as to find a first predetermined percentage of the character strings with the longest lengths, analyzing a number of position intervals between adjacent ones of the character strings belonging to the first predetermined percentage of the character strings with the longest lengths, labeling the corresponding character strings as belonging to a same block if the number of position intervals is not greater than a second predetermined value so as to find a largest character string block, and deleting the tags in the largest character string block so as to obtain main content of the HTML web page in plain text.
Public/Granted literature
- US20080201633A1 METHOD AND SYSTEM FOR CONVERTING HYPERTEXT MARKUP LANGUAGE WEB PAGE TO PLAIN TEXT Public/Granted day:2008-08-21
Information query