Quality-performance optimized identification of duplicate data

Invention Grant

US11573721B2 Quality-performance optimized identification of duplicate data 有权

Please log in to see more content

Patent Title: Quality-performance optimized identification of duplicate data
Application No.: US17357176

Application Date: 2021-06-24
Publication No.: US11573721B2

Publication Date: 2023-02-07
Inventor: Soma Shekar Naganna , Abhishek Seth , Neeraj Ramkrishna Singh
Applicant: International Business Machines Corporation
Applicant Address: US NY Armonk
Assignee: International Business Machines Corporation
Current Assignee: International Business Machines Corporation
Current Assignee Address: US NY Armonk
Agency: Keohane & D'Alessandro, PLLC
Agent Rakesh Roy; Hunter E. Webb
Main IPC: G06F3/06
IPC: G06F3/06 ; G06F9/50

Quality-performance optimized identification of duplicate data

Abstract:

An approach is provided for providing optimized identification of duplicate data in a networked computing environment. An aggregate feature vector is created that is specific to an attribute of the data (e.g., a field that holds specific informational content). The aggregate feature vector has a set of dimensions that each define a specific comparison function used to test for similarity between data entries in the attribute. Each dimension in the aggregate feature vector is assigned an effectiveness, and a cost is computed for each dimension. Based on these two, a subset of dimensions is selected to form an optimized feature vector. This optimized feature vector can then be used to analyze a dataset to find matching data.

Public/Granted literature

US20220413727A1 QUALITY-PERFORMANCE OPTIMIZED IDENTIFICATION OF DUPLICATE DATA Public/Granted day:2022-12-29

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F3/00	用于将所要处理的数据转变成为计算机能够处理的形式的输入装置；用于将数据从处理机传送到输出设备的输出装置，例如，接口装置
G06F3/06	.来自记录载体的数字输入，或者到记录载体上去的数字输出