Systems and methods for automated end-to-end text extraction of electronic documents

Invention Grant

US12217524B2 Systems and methods for automated end-to-end text extraction of electronic documents 有权

Please log in to see more content

Patent Title: Systems and methods for automated end-to-end text extraction of electronic documents
Application No.: US17850618

Application Date: 2022-06-27
Publication No.: US12217524B2

Publication Date: 2025-02-04
Inventor: Keerthan Ramnath , Punitha Chandrasekar , Hui Su , Shyam Subramanian , Rachna Saxena , Mohamed Mahdi Alouane , Vinay Iyengar
Applicant: FMR LLC
Applicant Address: US MA Boston
Assignee: FMR LLC
Current Assignee: FMR LLC
Current Assignee Address: US MA Boston
Agency: Cesari and McKenna, LLP
Main IPC: G06V30/414
IPC: G06V30/414 ; G06F40/232 ; G06F40/263 ; G06F40/284

Systems and methods for automated end-to-end text extraction of electronic documents

Abstract:

Systems and methods for extracting data from electronic documents using optical character recognition (OCR) and non-OCR based text extraction. A server computing device initiates non-OCR based text extraction for each page of an electronic document. The server calculates a document text coverage percentage corresponding to the non-OCR based text extraction for the whole document and, in response to determining that the document text coverage percentage is below a first threshold, initiates OCR for the document. The server calculates a page text coverage percentage corresponding to the non-OCR based text extraction for one or more pages of the electronic document and, in response to determining that the page text coverage percentage is below a second threshold, initiates OCR for the pages. The server combines first text extracted from the electronic document using non-OCR based text extraction and second text extracted from the electronic document using OCR.

Public/Granted literature

US20230419711A1 SYSTEMS AND METHODS FOR AUTOMATED END-TO-END TEXT EXTRACTION OF ELECTRONIC DOCUMENTS Public/Granted day:2023-12-28

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06V	图像或视频识别或理解
G06V30/00	字符识别；数字墨迹识别；面向文档的基于图像的模式识别（文档等的扫描、传输或复制 H04N1/00）
G06V30/40	.面向文档的基于图像的模式识别
G06V30/41	..文件内容分析（基于代码标记的印刷字符识别G06V30/224）
G06V30/414	...提取几何结构，例如布局树；块分割，例如图形或文本的边界框