Identification of high deduplication data

Invention Grant

US10795862B2 Identification of high deduplication data 有权

Please log in to see more content

Patent Title: Identification of high deduplication data
Application No.: US15364727

Application Date: 2016-11-30
Publication No.: US10795862B2

Publication Date: 2020-10-06
Inventor: Danny Harnik , Ety Khaitzin , Sergey Marenkov , Dmitry Sotnikov
Applicant: International Business Machines Corporation
Applicant Address: US NY Armonk
Assignee: International Business Machines Corporation
Current Assignee: International Business Machines Corporation
Current Assignee Address: US NY Armonk
Agent Aaron N. Pontikos
Main IPC: G06F7/00
IPC: G06F7/00 ; G06F16/174 ; G06F17/00

Identification of high deduplication data

Abstract:

A computer-implemented method includes dividing a data set into a plurality of regions and dividing the plurality of regions into a plurality of chunks of fixed size. The computer-implemented method further includes determining a sample size of the plurality of chunks to be sampled for each region, wherein the sample size is determined based, at least in part, on an acceptance of a likelihood of identifying at least one collision between two regions corresponding to logical entities of a first cluster of logical entities. The computer-implemented method further includes sampling the plurality of chunks for each region based on the determined sample size. The computer-implemented method further includes generating a hash value for each chunk sampled and storing each hash value in an index. The computer-implemented method further includes identifying one or more collisions between the plurality of regions. A corresponding computer system and computer program product are also disclosed.

Public/Granted literature

US20180150473A1 IDENTIFICATION OF HIGH DEDUPLICATION DATA Public/Granted day:2018-05-31

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F7/00	通过待处理的数据的指令或内容进行运算的数据处理的方法或装置（逻辑电路入H03K19/00）