Invention Grant
- Patent Title: Method to identify common structures in formatted text documents
- Patent Title (中): 识别格式化文本文档中常见结构的方法
-
Application No.: US12634176Application Date: 2009-12-09
-
Publication No.: US08356045B2Publication Date: 2013-01-15
- Inventor: Yuan-chi Chang , Debdoot Mukherjee , Vibha Singhal Sinha , Biplav Srivastava
- Applicant: Yuan-chi Chang , Debdoot Mukherjee , Vibha Singhal Sinha , Biplav Srivastava
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agency: McGinn IP Law Group, PLLC
- Agent Preston J. Young, Esq.
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
A computer implemented method, computer program product and data processing system, for identifying common structures shared across a plurality of formatted text documents. The common structure is presented as a sequence of landmarks, each of which has a starting and ending marker to describe the borders of text. The common structure is identified by counting the occurrences of repeating text segments across documents. Frequently co-occurred adjacent segments become candidates for markers of landmarks. In addition, styling information of textual content within a landmark is extracted and mapped to rules. The rules are used to merge and summarize content from multiple documents, which gives an advantage over current practice of content concatenation.
Public/Granted literature
- US20110137900A1 METHOD TO IDENTIFY COMMON STRUCTURES IN FORMATTED TEXT DOCUMENTS Public/Granted day:2011-06-09
Information query