Invention Grant
- Patent Title: Training language models using text corpora comprising realistic optical character recognition (OCR) errors
-
Application No.: US16375478Application Date: 2019-04-04
-
Publication No.: US11341757B2Publication Date: 2022-05-24
- Inventor: Ivan Germanovich Zagaynov
- Applicant: ABBYY Production LLC
- Applicant Address: RU Moscow
- Assignee: ABBYY Production LLC
- Current Assignee: ABBYY Production LLC
- Current Assignee Address: RU Moscow
- Agency: Lowenstein Sandler LLP
- Priority: RURU2019109198 20190329
- Main IPC: G06K9/34
- IPC: G06K9/34 ; G06V30/148 ; G06N20/00 ; G06F40/20 ; G06T5/00 ; G06V30/10

Abstract:
Systems and methods for generating text corpora comprising realistic optical character recognition (OCR) errors and training language models using the text corpora are provided. An example method comprises: generating, by a computer system, an initial set of images based on an input text corpus comprising text; overlaying, by the computer system, one or more simulated defects over the initial set of images to generate an augmented set of images; generating an output text corpus based on the augmented set of image; and training, using the output text corpus, a language model for optical character recognition.
Public/Granted literature
Information query