Invention Grant
- Patent Title: Sampling-based deduplication estimation
-
Application No.: US14994161Application Date: 2016-01-13
-
Publication No.: US10198455B2Publication Date: 2019-02-05
- Inventor: Danny Harnik , David Chambliss , Oded Margalit , Dmitry Sotnikov
- Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agent Daniel Kligler
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.
Public/Granted literature
- US20170199895A1 SAMPLING-BASED DEDUPLICATION ESTIMATION Public/Granted day:2017-07-13
Information query