Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques

Invention Grant

US11561944B2 Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques 有权

Please log in to see more content

Patent Title: Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques
Application No.: US17136124

Application Date: 2020-12-29
Publication No.: US11561944B2

Publication Date: 2023-01-24
Inventor: Ganesh Prasath Ramani , Aasish Chandra , Jayanth Shenai , Raja Angamuthu , Pankaj Kumar Mishra
Applicant: Tata Consultancy Services Limited
Applicant Address: IN Mumbai
Assignee: Tata Consultancy Services Limited
Current Assignee: Tata Consultancy Services Limited
Current Assignee Address: IN Mumbai
Agency: Finnegan, Henderson, Farabow, Garrett & Dunner LLP
Priority: IN202021013506 20200327
Main IPC: G06F16/21
IPC: G06F16/21 ; G06F16/215 ; G06F16/22 ; G06N20/00 ; G06F40/211 ; G06F40/30

Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques

Abstract:

With the availability of huge amount of data, it has becoming difficult to identify and manage duplicate data, especially when the data is in a plurality of columns. A method and system for identifying duplicate columns using statistical, semantics and machine learning techniques have been provided. The system provides a design framework to compare huge datasets at column level and identify potential duplicate columns, not based on the column title, but based on all of its values. The disclosure has ability to compare values in multiple columns and identify potential duplicate columns wherein comparison of values is not only for the exact match, but for semantic match, smart match, fuzzy match, and match after UOM conversion etc. using Statistical, semantics and machine learning techniques.

Public/Granted literature

US20210342320A1 Method and system for identifying duplicate columns using statistical, semantics and machine learning techniques Public/Granted day:2021-11-04

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/20	.•结构化数据，例如关系型数据
G06F16/21	..••数据库设计、管理或维护