Invention Grant
- Patent Title: Techniques for estimating item frequencies in large data sets
- Patent Title (中): 用于估计大数据集中项目频率的技术
-
Application No.: US10950800Application Date: 2004-09-27
-
Publication No.: US08489645B2Publication Date: 2013-07-16
- Inventor: George Andrei Mihaila , Min Wang
- Applicant: George Andrei Mihaila , Min Wang
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agency: Ryan, Mason & Lewis, LLP
- Agent Anne V. Dougherty
- Main IPC: G06F7/00
- IPC: G06F7/00

Abstract:
Techniques for estimating items (e.g., data item or objects) frequencies in large data sets are disclosed. For example, a technique for determining items and their frequencies at multiple levels of interest in a collection of nested bags includes the following steps. A hierarchy of a plurality of levels of nested bags and the levels of interest are inputted. Among the plurality of levels, a subset of bags is sampled from at least one level. At each level of interest, the frequency is counted of each distinct item in the bags obtained in the sampling step. At each level of interest, the item frequencies obtained in the counting step are extrapolated based on sampling ratios associated with the sampling step. At each level of interest, the items are sorted according to their frequencies obtained from the extrapolating step and those items with highest frequencies are retained. A bag may refer to one or more subsets or groups of data items or objects. Also, a bag may, itself, contain one or more other bags.
Public/Granted literature
- US20060074963A1 Techniques for estimating item frequencies in large data sets Public/Granted day:2006-04-06
Information query