Invention Grant
- Patent Title: Identification of high deduplication data
-
Application No.: US15954702Application Date: 2018-04-17
-
Publication No.: US10255290B2Publication Date: 2019-04-09
- Inventor: Danny Harnik , Ety Khaitzin , Sergey Marenkov , Dmitry Sotnikov
- Applicant: International Business Machines Corporation
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agent Aaron N. Pontikos
- Main IPC: G06F7/00
- IPC: G06F7/00 ; G06F17/30 ; G06F17/00

Abstract:
A computer-implemented method includes dividing a data set into a plurality of regions and dividing the plurality of regions into a plurality of chunks of fixed size. The computer-implemented method further includes determining a sample size of the plurality of chunks to be sampled for each region, wherein the sample size is determined based, at least in part, on an acceptance of a likelihood of identifying at least one collision between two regions corresponding to logical entities of a first cluster of logical entities. The computer-implemented method further includes sampling the plurality of chunks for each region based on the determined sample size. The computer-implemented method further includes generating a hash value for each chunk sampled and storing each hash value in an index. The computer-implemented method further includes identifying one or more collisions between the plurality of regions. A corresponding computer system and computer program product are also disclosed.
Public/Granted literature
- US20180225300A1 IDENTIFICATION OF HIGH DEDUPLICATION DATA Public/Granted day:2018-08-09
Information query