Invention Grant
- Patent Title: Efficient indexing of documents with similar content
-
Application No.: US11419423Application Date: 2006-05-19
-
Publication No.: US08175875B1Publication Date: 2012-05-08
- Inventor: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
- Applicant: Jeffrey A. Dean , Sanjay Ghemawat , Gautham Thambidorai
- Applicant Address: US CA Mountain View
- Assignee: Google Inc.
- Current Assignee: Google Inc.
- Current Assignee Address: US CA Mountain View
- Agency: Morgan, Lewis & Bockius LLP
- Main IPC: G10L15/06
- IPC: G10L15/06

Abstract:
A set of documents may be stored and indexed as a compressed sequence of tokens. A set of documents are grouped into clusters. Sequences of tokens representing the clusters of documents are encoded to elide some repeating instances of tokens. A compressed sequence of tokens is generated from the compressed cluster sequences of tokens. Queries on the compressed sequence are performed by identifying cluster sequences within the compressed sequence that are likely to have documents that satisfy the query and then identifying, within these identified clusters, the documents that actually satisfies the query.
Information query