METHOD AND SYSTEM FOR PROCESSING DOCUMENT AND MEDIUM

    公开(公告)号:JP2002032770A

    公开(公告)日:2002-01-31

    申请号:JP2000190335

    申请日:2000-06-23

    Applicant: IBM

    Abstract: PROBLEM TO BE SOLVED: To extract a meaningful text block from a document that is optionally subjected to a layout, such as a table, itemization and a multicolumn composition. SOLUTION: A document subjected to a layout with blanks, etc., is inputted and a symbol associated with the spatial coordinates of the document is acquired. The continuation of the same type of characters is extracted from the symbol and tokens and spaces are generated. A stream is generated from spaces continuing in the column direction, and a text block is generated from the streams and the tokens. A link between text blocks is generated and defined as a document graph. The propriety of the connection (link) between the text blocks in the document graph is evaluated by using a language model, and when the connection is proper, the text blocks are merged.

Patent Agency Ranking