Invention Grant
- Patent Title: Table of contents extraction based on textual similarity and formal aspects
- Patent Title (中): 基于文本相似性和形式方面的目录提取
-
Application No.: US11923904Application Date: 2007-10-25
-
Publication No.: US09224041B2Publication Date: 2015-12-29
- Inventor: Herve Dejean , Jean-Luc Meunier
- Applicant: Herve Dejean , Jean-Luc Meunier
- Applicant Address: US CT Norwalk
- Assignee: XEROX CORPORATION
- Current Assignee: XEROX CORPORATION
- Current Assignee Address: US CT Norwalk
- Agency: Fay Sharpe LLP
- Main IPC: G06F17/27
- IPC: G06F17/27 ; G06K9/00

Abstract:
An initial organizational table for a document is determined based on textual similarity between entries of the organizational table and target text fragments and not taking into account text formatting. A classifier is trained to identify text fragment pairs consisting of entries of the organizational table and corresponding target text fragments based at least in part on text formatting features. The training employs a training set of examples annotated based on the initial organizational table. The initial organizational table is updated using the trained classifier.
Public/Granted literature
- US20090110268A1 TABLE OF CONTENTS EXTRACTION BASED ON TEXTUAL SIMILARITY AND FORMAL ASPECTS Public/Granted day:2009-04-30
Information query