Invention Grant
- Patent Title: Token matching in large document corpora
-
Application No.: US16271839Application Date: 2019-02-10
-
Publication No.: US10796092B2Publication Date: 2020-10-06
- Inventor: Guy Leibovitz
- Applicant: NETAPP, INC.
- Applicant Address: US CA Sunnyvale
- Assignee: NETAPP, INC.
- Current Assignee: NETAPP, INC.
- Current Assignee Address: US CA Sunnyvale
- Agency: Haynes and Boone, LLP
- Main IPC: G06F40/284
- IPC: G06F40/284 ; G06F17/18 ; G06F16/31 ; G06F16/33 ; G06F40/242

Abstract:
A method comprising receiving a dictionary comprising a plurality of entities, wherein each entity has a length of between 1 and n tokens; constructing a probabilistic data representation model comprising n Bloom filter (BF) pairs indexed from 1 to n; populating said probabilistic data representation model with a data representation of said entities, wherein, with respect to each BF pair indexed i: (i) a first BF is populated with the first i tokens of all said entities having at least i+1 tokens, and (ii) a second BF in populated with all said entities having exactly i tokens; receiving a text corpus, wherein said text corpus is segmented into tokens; and automatically matching each token in said text corpus against said populated probabilistic data representation model, wherein said matching comprises sequentially querying each said BF pair in the order of said indexing, to determine a match.
Public/Granted literature
- US20200065371A1 TOKEN MATCHING IN LARGE DOCUMENT CORPORA Public/Granted day:2020-02-27
Information query