Invention Grant
- Patent Title: Detecting duplicate records in databases
- Patent Title (中): 检测数据库中的重复记录
-
Application No.: US11182590Application Date: 2005-07-14
-
Publication No.: US07685090B2Publication Date: 2010-03-23
- Inventor: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
- Applicant: Surajit Chaudhuri , Venkatesh Ganti , Rohit Ananthakrishna
- Applicant Address: US WA Redmond
- Assignee: Microsoft Corporation
- Current Assignee: Microsoft Corporation
- Current Assignee Address: US WA Redmond
- Agency: Collins & Collins Incorporated
- Agent L. Alan Collins
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key—foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.
Public/Granted literature
- US20050262044A1 Detecting duplicate records in databases Public/Granted day:2005-11-24
Information query