Invention Grant
- Patent Title: Automated generation of structured training data from unstructured documents
-
Application No.: US16784726Application Date: 2020-02-07
-
Publication No.: US11244203B2Publication Date: 2022-02-08
- Inventor: Peter Zhong , Antonio Jose Jimeno Yepes , Jianbin Tang
- Applicant: International Business Machines Corporation
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agency: Cantor Colburn LLP
- Agent Joseph Petrokaitis
- Main IPC: G06K9/62
- IPC: G06K9/62 ; G06F16/332 ; G06F16/35 ; G06F40/205 ; G06K9/32 ; G06F16/93

Abstract:
Methods, systems and computer program products for automatically generating structured training data based on an unstructured document are provided. Aspects include receiving an unstructured document and a corresponding structured document that includes labeled portions. Aspects also include generating a parsed document that has one or more extracted objects by applying a parsing tool to the unstructured document. Aspects also include identifying one or more matching extracted objects by applying a matching algorithm to the structured document and the parsed document. Each matching extracted object is an extracted object of the parsed document that corresponds to a labeled portion of the structured document. Aspects also include annotating a region of the unstructured document that corresponds to the bounding box of the respective matching extracted object with a respective label of the corresponding labeled portion of the unstructured document.
Public/Granted literature
- US20210248420A1 AUTOMATED GENERATION OF STRUCTURED TRAINING DATA FROM UNSTRUCTURED DOCUMENTS Public/Granted day:2021-08-12
Information query