Invention Grant
US08356045B2 Method to identify common structures in formatted text documents 失效
识别格式化文本文档中常见结构的方法

Method to identify common structures in formatted text documents
Abstract:
A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
Public/Granted literature
Information query
Patent Agency Ranking
0/0