Method and system for tabular information extraction
Abstract:
A method and a system for extracting information from a table in a document is provided. The method includes: receiving a document that includes information that is arranged in a table; determining three sets of coordinates that respectively relate to lines, words, and characters included in the document; extracting a list of lines based on the first set of coordinates; reconstructing the rows of the table based on list of lines and the second set of coordinates; reconstructing the columns of the table based on the reconstructed rows and the third set of coordinates; and outputting a reconstruction of the table. The three sets of coordinates are expressible in an hOCR format that is based on an open standard for representation of scanned information that is obtainable by using an optical character recognition (OCR) technique.
Public/Granted literature
Information query
Patent Agency Ranking
0/0