Jaccard similarity estimation of weighted samples: circular smearing with scaling and randomized rounding sample selection
Abstract:
The disclosed systems and methods include pre-calculation, per object, of object feature bin values, for identifying close matches between objects, such as text documents, that have numerous weighted features, such as specific-length word sequences. Predetermined feature weights get scaled with two or more selected adjacent scaling factors, and randomly rounded. The expanded set of weighted features of an object gets min-hashed into a predetermined number of feature bins. For each feature that qualifies to be inserted by min-hashing into a particular feature bin, and across successive feature bins, the expanded set of weighted features get min-hashed and circularly smeared into the predetermined number of feature bins. Completed pre-calculated sets of feature bin values for each scaling of the object, together with the scaling factor, are stored for use in comparing sampled features of the object with sampled features of other objects by calculating an estimated Jaccard similarity index.
Information query
Patent Agency Ranking
0/0