Token matching in large document corpora

Invention Grant

US10796092B2 Token matching in large document corpora 有权

Please log in to see more content

Patent Title: Token matching in large document corpora
Application No.: US16271839

Application Date: 2019-02-10
Publication No.: US10796092B2

Publication Date: 2020-10-06
Inventor: Guy Leibovitz
Applicant: NETAPP, INC.
Applicant Address: US CA Sunnyvale
Assignee: NETAPP, INC.
Current Assignee: NETAPP, INC.
Current Assignee Address: US CA Sunnyvale
Agency: Haynes and Boone, LLP
Main IPC: G06F40/284
IPC: G06F40/284 ; G06F17/18 ; G06F16/31 ; G06F16/33 ; G06F40/242

Token matching in large document corpora

Abstract:

A method comprising receiving a dictionary comprising a plurality of entities, wherein each entity has a length of between 1 and n tokens; constructing a probabilistic data representation model comprising n Bloom filter (BF) pairs indexed from 1 to n; populating said probabilistic data representation model with a data representation of said entities, wherein, with respect to each BF pair indexed i: (i) a first BF is populated with the first i tokens of all said entities having at least i+1 tokens, and (ii) a second BF in populated with all said entities having exactly i tokens; receiving a text corpus, wherein said text corpus is segmented into tokens; and automatically matching each token in said text corpus against said populated probabilistic data representation model, wherein said matching comprises sequentially querying each said BF pair in the order of said indexing, to determine a match.

Public/Granted literature

US20200065371A1 TOKEN MATCHING IN LARGE DOCUMENT CORPORA Public/Granted day:2020-02-27

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/284	...词汇分析，例如标记或搭配词