Invention Grant
- Patent Title: Detecting duplicate and near-duplicate files
- Patent Title (中): 检测重复和近似重复的文件
-
Application No.: US11499260Application Date: 2006-08-04
-
Publication No.: US08015162B2Publication Date: 2011-09-06
- Inventor: Monika H. Henzinger
- Applicant: Monika H. Henzinger
- Applicant Address: US CA Mountain View
- Assignee: Google Inc.
- Current Assignee: Google Inc.
- Current Assignee Address: US CA Mountain View
- Agency: Fish & Richardson P.C.
- Main IPC: G06F7/02
- IPC: G06F7/02

Abstract:
Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.
Public/Granted literature
- US20080044016A1 Detecting duplicate and near-duplicate files Public/Granted day:2008-02-21
Information query