Invention Grant
- Patent Title: Estimating document similarity using bit-strings
- Patent Title (中): 使用位串来估计文档相似度
-
Application No.: US13031265Application Date: 2011-02-21
-
Publication No.: US08594239B2Publication Date: 2013-11-26
- Inventor: Mark S. Manasse , Arnd Christian König
- Applicant: Mark S. Manasse , Arnd Christian König
- Applicant Address: US WA Redmond
- Assignee: Microsoft Corporation
- Current Assignee: Microsoft Corporation
- Current Assignee Address: US WA Redmond
- Agency: Microsoft Corporation
- Main IPC: H04L27/00
- IPC: H04L27/00

Abstract:
Each of a plurality of documents is divided into samples. Small bit-strings are generated for selected samples from each of the documents and used to create a sketch for each document. Because the bit-strings are small (e.g., only one, two, or three bits in length), the generated sketches are smaller than the sketches generated using previous methods for generating sketches, and therefore use less storage space. The generated sketches are compared to determine documents that are near-duplicates of one another.
Public/Granted literature
- US20120213313A1 ESTIMATING DOCUMENT SIMILARITY USING BIT-STRINGS Public/Granted day:2012-08-23
Information query