Deriving a multi-pass matching algorithm for data de-duplication
Abstract:
Methods, systems, and computer program products for deriving a multi-pass matching algorithm for data de-duplication are provided herein. A method includes identifying multiple passes across multiple databases using a set of one or more blocking columns derived from a set of trained input data; identifying, in each of the multiple passes, one or more columns across the multiple databases that match one or more of the blocking columns; selecting a given pass from the multiple passes, wherein said given pass comprises a maximum number of matching columns within the multiple passes; determining, for the given pass, data that conform to the given pass comprising (i) a set of matching columns, (ii) one or more matching types and (iii) one or more weights; and determining one or more subsequent passes across the multiple databases iteratively by removing the data that conform to the given pass.
Public/Granted literature
Information query
Patent Agency Ranking
0/0