Invention Grant
- Patent Title: Optimized mapping of documents to candidate duplicate documents in a document corpus
-
Application No.: US14573849Application Date: 2014-12-17
-
Publication No.: US09607029B1Publication Date: 2017-03-28
- Inventor: Sivaranjini Dharmalingam , Nathan Thomas Close , Shantanu Shailendrakumar Fauji , Sean Gwizdak , Jiahui Jiang , Yohan Mammen , Roshan Rammohan
- Applicant: Amazon Technologies, Inc.
- Applicant Address: US WA Seattle
- Assignee: Amazon Technologies, Inc.
- Current Assignee: Amazon Technologies, Inc.
- Current Assignee Address: US WA Seattle
- Agency: Lee & Hayes, PLLC
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.
Information query