System to extract information from documents
Abstract:
A method of training a system to extract information from documents comprises feeding digital form of training documents to an OCR module, which identifies multiple logical blocks in the documents and text present in the logical blocks. One or more tags for the whole of the document, the logical blocks and word tokens on the document are received by a tagging module. A text input comprising the text identified in the document and the tags for the whole of the document are received by a machine learning module. A first image of the document with layout of the one or more of the identified blocks superimposed, and the tags of the logical blocks in the document are received by the machine learning module, wherein the received text input, first image and tags for the logical blocks corresponds to a plurality of the training documents.
Public/Granted literature
Information query
Patent Agency Ranking
0/0