Scalable identification of duplicate datasets in heterogeneous datasets

Invention Grant

US11886385B2 Scalable identification of duplicate datasets in heterogeneous datasets 有权

Please log in to see more content

Patent Title: Scalable identification of duplicate datasets in heterogeneous datasets
Application No.: US17805134

Application Date: 2022-06-02
Publication No.: US11886385B2

Publication Date: 2024-01-30
Inventor: Praduemn K. Goyal , Sandeep Hans , Samiulla Zakir Hussain Shaikh , Diptikalyan Saha
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address: US NY Armonk
Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee Address: US NY Armonk
Agent Daniel G. DeLuca
Main IPC: G06F16/174
IPC: G06F16/174 ; G06F16/14 ; G06F16/16 ; G06F18/22

Scalable identification of duplicate datasets in heterogeneous datasets

Abstract:

An embodiment for identifying and sorting duplicate datasets within a large pool of heterogeneous datasets may include received a plurality of heterogeneous datasets. The embodiment may automatically compare schema information and metadata within each of the received plurality of heterogeneous datasets to generate name-based similarity scores for each dataset. The embodiment may also automatically compare data distribution information within each of the received plurality of heterogeneous datasets to generate a plurality of data distribution similarity scores for each heterogeneous dataset. The embodiment may further include automatically calculating an overall distance metric using the name-based similarity scores and plurality of data distribution similarity scores. The embodiment may also include based on the calculate overall distance metric, automatically generating distance graphs that identifying clusters of similar datasets and illustrate inferred lineage for the clusters of similar datasets.

Public/Granted literature

US20230394011A1 SCALABLE IDENTIFICATION OF DUPLICATE DATASETS IN HETEROGENEOUS DATASETS Public/Granted day:2023-12-07

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/10	.•文件系统；文件服务器
G06F16/17	..••文件系统功能的进一步细节
G06F16/174	...•••文件系统执行的冗余消失（涉及使用重复数据消除的备份或备份还原的数据管理入G06F 11/14）