Detecting duplicate and near-duplicate files

Invention Grant

US08015162B2 Detecting duplicate and near-duplicate files 有权

Title translation: 检测重复和近似重复的文件

Please log in to see more content

Patent Title: Detecting duplicate and near-duplicate files
Patent Title (中): 检测重复和近似重复的文件
Application No.: US11499260

Application Date: 2006-08-04
Publication No.: US08015162B2

Publication Date: 2011-09-06
Inventor: Monika H. Henzinger
Applicant: Monika H. Henzinger
Applicant Address: US CA Mountain View
Assignee: Google Inc.
Current Assignee: Google Inc.
Current Assignee Address: US CA Mountain View
Agency: Fish & Richardson P.C.
Main IPC: G06F7/02
IPC: G06F7/02

Detecting duplicate and near-duplicate files

Abstract:

Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.

Abstract(Chinese):

可以通过处理接受的一组文档来确定近似重复的文档，以使用第一技术来确定第一组近似重复的文档，并且使用第二技术来处理第一组以确定第二组近似重复的文档。第一种技术可能取决于令牌顺序，第二种技术可能是独立的。第一种技术可能是令牌频率无关，第二种技术可能是频率依赖性的。第一种技术可以基于文档的单词或令牌的子集来确定两个文档是否是近似重复的，并且第二种技术可以使用基于所有单词或令牌的表示来确定两个文档是否是近似重复的的文件。第一种技术可能使用集合交集来确定文档是否是近似重复的，第二种技术可能使用随机投影来确定文档是否是重复的。

Public/Granted literature

US20080044016A1 Detecting duplicate and near-duplicate files Public/Granted day:2008-02-21

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F7/00	通过待处理的数据的指令或内容进行运算的数据处理的方法或装置（逻辑电路入H03K19/00）
G06F7/02	.比较数字值的（G06F7/06，G06F7/38优先）