System and method to extract information from unstructured image documents

Invention Grant

US11769341B2 System and method to extract information from unstructured image documents 有权

Please log in to see more content

Patent Title: System and method to extract information from unstructured image documents
Application No.: US17405964

Application Date: 2021-08-18
Publication No.: US11769341B2

Publication Date: 2023-09-26
Inventor: Yashu Seth , Ravil Kashyap , Shaik Kamran Moinuddin , Vijayendra Mysore Shamanna , Henry Thomas Peter , Simha Sadasiva
Applicant: Ushur, Inc.
Applicant Address: US CA Santa Clara
Assignee: Ushur, Inc.
Current Assignee: Ushur, Inc.
Current Assignee Address: US CA Santa Clara
Agency: Lowenstein Sandler LLP
Agent Madhumita Datta
Main IPC: G06V30/40
IPC: G06V30/40 ; G06T7/10 ; G06V10/94 ; G06F18/24 ; G06V30/19 ; G06V30/148

Abstract:

The present disclosure relates to a system and method to extract information from unstructured image documents. The extraction technique is content-driven and not dependent on the layout of a particular image document type. The disclosed method breaks down an image document into smaller images using the text cluster detection algorithm. The smaller images are converted into text samples using optical character recognition (OCR). Each of the text samples is fed to a trained machine learning model. The model classifies each text sample into one of a plurality of pre-determined field types. The desired value extraction problem may be converted into a question-answering problem using a pre-trained model. A fixed question is formed on the basis of the classified field type. The output of the question-answering model may be passed through a rule-based post-processing step to obtain the final answer.

Public/Granted literature

US20220058383A1 SYSTEM AND METHOD TO EXTRACT INFORMATION FROM UNSTRUCTURED IMAGE DOCUMENTS Public/Granted day:2022-02-24

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06V	图像或视频识别或理解
G06V30/00	字符识别；数字墨迹识别；面向文档的基于图像的模式识别（文档等的扫描、传输或复制 H04N1/00）
G06V30/40	.面向文档的基于图像的模式识别