Efficient indexing of documents with similar content

Invention Grant

US08554561B2 Efficient indexing of documents with similar content 有权

Title translation: 具有类似内容的文件的高效索引

Please log in to see more content

Patent Title: Efficient indexing of documents with similar content
Patent Title (中): 具有类似内容的文件的高效索引
Application No.: US13571316

Application Date: 2012-08-09
Publication No.: US08554561B2

Publication Date: 2013-10-08
Inventor: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
Applicant: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
Applicant Address: US CA Mountain View
Assignee: Google Inc.
Current Assignee: Google Inc.
Current Assignee Address: US CA Mountain View
Agency: Morgan, Lewis & Bockius LLP
Main IPC: G10L15/06
IPC: G10L15/06

Efficient indexing of documents with similar content

Abstract:

A computer system comprising one or more processors and memory groups a set of documents into a plurality of clusters. Each cluster includes one or more documents of the set of documents and a respective cluster of documents of the plurality of clusters includes respective cluster data corresponding to a plurality of documents including a first document and a second document. The computer system determines that the second document includes duplicate data that is duplicative of corresponding data in the first document, identifies a respective subset of the respective cluster data that excludes at least a subset of the duplicate data, and generates an index of the respective subset of the respective cluster data.

Abstract(Chinese):

一种包括一个或多个处理器和存储器组的计算机系统，一组文档成为多个集群。每个集群包括文档集合中的一个或多个文档，并且多个集群的相应文档集合包括对应于包括第一文档和第二文档的多个文档的相应集群数据。计算机系统确定第二文档包括与第一文档中的对应数据重复的重复数据，识别排除重复数据的至少一个子集的相应集群数据的相应子集，并且生成相应子集的索引的各个集群数据。

Public/Granted literature

US20120303622A1 Efficient Indexing of Documents with Similar Content Public/Granted day:2012-11-29

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）
G10L15/06	.创建基准模板；训练语音识别系统，例如对说话者声音特征的适应（G10L15/14优先）