Abstract:
The present disclosure provides for implementing a two-level fingerprint caching scheme for a client cache and a server cache. The client cache hit ratio can be improved by pre-populating the client cache with fingerprints that are relevant to the client. Relevant fingerprints include fingerprints used during a recent time period (e.g., fingerprints of segments that are included in the last full backup image and any following incremental backup images created for the client after the last full backup image), and thus are referred to as fingerprints with good temporal locality. Relevant fingerprints also include fingerprints associated with a storage container that has good spatial locality, and thus are referred to as fingerprints with good spatial locality. A pre-set threshold established for the client cache (e.g., threshold Tc) is used to determine whether a storage container (and thus fingerprints associated with the storage container) has good spatial locality.
Abstract:
A method to partition a deduplication pool is provided. The method includes determining that an amount of data in a plurality of data containers of the deduplication pool has reached a data capacity threshold and comparing each data container of the plurality of data containers with at least one other of the plurality of data containers as to amount of shared data. The method includes grouping, based on results of the comparing, the plurality of data containers into a plurality of groups of data containers, with data sharing from each of the plurality of groups of data containers to each other of the plurality of groups of data containers less than a data sharing threshold and data sharing inside each of the plurality of groups of data containers greater than the data sharing threshold.
Abstract:
A method for data locality control in a deduplication system is provided. The method includes forming a fingerprint cache from a backup image corresponding to a first backup operation. The method includes removing one or more fingerprints from inclusion in the fingerprint cache, in response to the one or more fingerprints having a data segment locality, in a container, less than a threshold of data segment locality. The container has one or more data segments corresponding to the one or more fingerprints. The method includes applying the fingerprint cache, with the one or more fingerprints removed from inclusion therein, to a second backup operation, wherein at least one method operation is executed through a processor.
Abstract:
A method for data container group management in a deduplication system is provided. The method includes arranging a plurality of data container groups according to a plurality of file systems. A subset of the plurality of data container groups correspond to each of the plurality of file systems, each of the plurality of data container groups having a reference database, a plurality of data containers, and a data container group identifier (ID). The method includes performing a first backup process for a first client-policy pair with deduplication via a first one of the plurality of data container groups and performing a second backup process for a second client-policy pair with deduplication via a second one of the plurality of data container groups.
Abstract:
A deduplication storage system and associated methods are described. The deduplication storage system may split data objects into segments and store the segments. A plurality of data segment containers may be maintained. Each of the containers may include two or more of the data segments. Maintaining the containers may include maintaining a respective logical size of each container. In response to detecting that the logical size of a particular container has fallen below a threshold level, the deduplication storage system may perform an operation to reclaim the storage space allocated to one or more of the data segments included in the particular container.
Abstract:
The present disclosure provides for implementing a two-level fingerprint caching scheme for a client cache and a server cache. The client cache hit ratio can be improved by pre-populating the client cache with fingerprints that are relevant to the client. Relevant fingerprints include fingerprints used during a recent time period (e.g., fingerprints of segments that are included in the last full backup image and any following incremental backup images created for the client after the last full backup image), and thus are referred to as fingerprints with good temporal locality. Relevant fingerprints also include fingerprints associated with a storage container that has good spatial locality, and thus are referred to as fingerprints with good spatial locality. A pre-set threshold established for the client cache (e.g., threshold Tc) is used to determine whether a storage container (and thus fingerprints associated with the storage container) has good spatial locality.
Abstract:
In some embodiments, a method of maintaining a reference list for data deduplication is provided. The method includes discarding a newly arriving data segment in response to finding a fingerprint of the newly arriving data segment matches an existing fingerprint in a plurality of fingerprints on a fingerprint-to-file reference list. The method includes adding, in the fingerprint-to-file reference list, to a list for the existing fingerprint, a source for the newly arriving data segment, in response to the fingerprint-to-file reference list indicating the existing fingerprint does not correspond to a hot data segment and setting an indication in the fingerprint-to-file reference list that the existing fingerprint corresponds to the hot data segment in response to the list for the existing fingerprint meeting or exceeding a predetermined number of entries. Other embodiments are included.
Abstract:
A method for managing deduplication reference data may include (1) identifying multiple of data containers configured to store a plurality of deduplicated data segments that are referenced by multiple data objects within a deduplicated data system, (2) maintaining multiple reference databases including (i) a first reference database corresponding to a first subset of the data containers and (ii) a second reference database corresponding to a second subset of the data containers, the second subset differing from the first subset, (3) determining that a data object references at least one segment within a first data container within the first subset but does not reference any data segment within a second data container within the second subset and (4) updating the first reference database with information specifying that the data object references at least one data segment within at least one data container within the first subset of data containers.