Invention Grant
- Patent Title: Method and apparatus for structuring documents based on layout, content and collection
- Patent Title (中): 基于布局,内容和收集构建文档的方法和装置
-
Application No.: US11033016Application Date: 2005-01-10
-
Publication No.: US07693848B2Publication Date: 2010-04-06
- Inventor: Hervé Déjean , Veronika Lux , Sandrine Ribeau
- Applicant: Hervé Déjean , Veronika Lux , Sandrine Ribeau
- Applicant Address: US CT Norwalk
- Assignee: Xerox Corporation
- Current Assignee: Xerox Corporation
- Current Assignee Address: US CT Norwalk
- Agency: Fay Sharpe LLP
- Main IPC: G06F17/00
- IPC: G06F17/00 ; G06F17/24

Abstract:
A method and apparatus is provided for converting a document in a first format essentially comprising a flat layout structure into a structured document in a hierarchical form in accordance with predetermined attributes identified from the input format. The process comprises fragmenting the input document into a plurality of document content elements in accordance with a predetermined set of document attributes identifiable from the input document format. The content elements are clustered into selective sets having similar document attributes. The clustered sets are validated with reference to common textual properties organizational content common in documents in the collection. The clustered sets are then categorized into predetermined categories comprising structured elements of the structured document format and the document content elements are organized by hierarchical dependency from the predetermined categories wherein the organized document elements comprise the desired structured document format.
Public/Granted literature
- US20060155700A1 Method and apparatus for structuring documents based on layout, content and collection Public/Granted day:2006-07-13
Information query