Invention Grant
US09122898B2 Systems and methods for processing documents of unknown or unspecified format 有权
用于处理未知或未指定格式的文件的系统和方法

Systems and methods for processing documents of unknown or unspecified format
Abstract:
A computer implemented method for extracting meaningful text from a document of unknown or unspecified format. In a particular embodiment, the method includes reading the document, thereby to extract raw encoded text, analysing the raw encoded text, thereby to identify one or more text chunks, and for a given chunk, performing compression identification analysis to determine whether compression is likely. The method can further include performing a decompression process, performing an encoding identification process thereby to identify a likely character encoding protocol, and converting the chunk using the identified likely character encoding protocol, thereby to output the chunk as readable text.
Information query
Patent Agency Ranking
0/0