Invention Grant
- Patent Title: Method of identifying redundant text in an electronic document
- Patent Title (中): 识别电子文档中冗余文本的方法
-
Application No.: US11405771Application Date: 2006-04-18
-
Publication No.: US07643682B2Publication Date: 2010-01-05
- Inventor: Serge Bronstein
- Applicant: Serge Bronstein
- Applicant Address: DE
- Assignee: PDFlib GmbH
- Current Assignee: PDFlib GmbH
- Current Assignee Address: DE
- Agency: Jansson Shupe & Munger Ltd.
- Priority: EP05012452 20050609
- Main IPC: G06K9/34
- IPC: G06K9/34

Abstract:
A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic page description language document includes a) providing a page having a plurality of text fragments, each text fragment comprising at least one glyph, the document including Unicode values for all glyphs and geometric information of all text fragments on the page and page description language parameters of all glyphs, b) identifying two text fragments as redundant candidates, if the Unicode sequence of the text fragments have identical corresponding Unicode sequences, c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics, d) calculating the overlapping area of the two bounding boxes, and e) determining whether the two candidates form redundant text fragments by comparing the ratio of the overlapping area to the area of the smaller bounding box of both text fragments with a predetermined threshold.
Public/Granted literature
- US20060282769A1 Method of identifying redundant text in an electronic document Public/Granted day:2006-12-14
Information query