Optical character recognition segmentation

    公开(公告)号:GB2604970A

    公开(公告)日:2022-09-21

    申请号:GB202116057

    申请日:2021-11-09

    Applicant: IBM

    Abstract: An optical character recognition (OCR) segmentation and processing method comprising: receiving a document 502; detecting different types of text data in the document 502a-d; dividing the document into a plurality of text regions containing a single type of text 510, 514a-d; removing optical noise from each text region 512, (600, Fig. 6); selecting a suitable OCR recognition model for each text region (704, 708, Fig. 7) to extracting the text (710, Fig. 7). The types of text detected may be tables, watermarks, handwriting and rotated text. Removing optical noise, such as unnecessary background text and/or images, from each text region may comprise: encoding each text region as a total semantic vector; dividing each text region into multiple sub-regions and encoding the sub-regions as a vector; generating a dot product for each sub-region; comparing the dot-product to a threshold; and deleting all pixels of each subregion that exceeds the threshold. Selection of the OCR code may be based on a classification of the text in a region (704, Fig. 7). The optical character recognition modules may comprise self-learning software such as neural networks.

Patent Agency Ranking