- Patent Title: System and method for near and exact de-duplication of documents
-
Application No.: US13587597Application Date: 2012-08-16
-
Publication No.: US08504578B2Publication Date: 2013-08-06
- Inventor: Johannes C. Scholtes , Siebe Bloembergen
- Applicant: Johannes C. Scholtes , Siebe Bloembergen
- Applicant Address: NL Amsterdam
- Assignee: MSC Intellectual Properties B.V.
- Current Assignee: MSC Intellectual Properties B.V.
- Current Assignee Address: NL Amsterdam
- Agency: The Villamar Firm PLLC
- Agent Carlos R. Villamar
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.
Public/Granted literature
- US20120317126A1 SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS Public/Granted day:2012-12-13
Information query