Detecting duplicate records in databases

Invention Grant

US07685090B2 Detecting duplicate records in databases 有权

Title translation: 检测数据库中的重复记录

Please log in to see more content

Patent Title: Detecting duplicate records in databases
Patent Title (中): 检测数据库中的重复记录
Application No.: US11182590

Application Date: 2005-07-14
Publication No.: US07685090B2

Publication Date: 2010-03-23
Inventor: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
Applicant: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
Applicant Address: US WA Redmond
Assignee: Microsoft Corporation
Current Assignee: Microsoft Corporation
Current Assignee Address: US WA Redmond
Agency: Collins & Collins Incorporated
Agent L. Alan Collins
Main IPC: G06F17/30
IPC: G06F17/30

Detecting duplicate records in databases

Abstract:

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

Abstract(Chinese):

本发明涉及对数据库中的重复元组的检测。复制元组的先前的域独立检测依赖于多属性元组之间的标准相似度函数（例如，编辑距离，余弦度量）。然而，如果这些现有技术的方法用于识别领域特定的缩写和惯例，则会产生大量的假阳性。根据本发明，基于解释数据仓库中来自多个维度表的记录来实现重复检测的过程，数据仓库与通过雪花模式中的关键 - 外键关系指定的层次相关联。本发明利用表层次结构中可用的额外知识来开发高质量，可扩展的重复检测过程。

Public/Granted literature

US20050262044A1 Detecting duplicate records in databases Public/Granted day:2005-11-24

Information query

Espacenet