Invention Grant
- Patent Title: Extracting informative phrases from unstructured text
- Patent Title (中): 从非结构化文本中提取信息短语
-
Application No.: US11231075Application Date: 2005-09-20
-
Publication No.: US08209335B2Publication Date: 2012-06-26
- Inventor: Jasmine Novak
- Applicant: Jasmine Novak
- Applicant Address: US NY Armonk
- Assignee: International Business Machines Corporation
- Current Assignee: International Business Machines Corporation
- Current Assignee Address: US NY Armonk
- Agency: Gibb I.P. Law Firm, LLC
- Main IPC: G06F7/00
- IPC: G06F7/00 ; G06F17/30

Abstract:
Disclosed is a method of extracting informative phrases from a full corpus of documents. An index of phrases contained in the full corpus of documents is built. Then, a user specifies a subset of text to analyze. The subset may be defined as: (1) all paragraphs or sentences containing terms selected as defining a subject; (2) all documents in a category; (3) all documents written within a date range; and/or (3) all documents matching a Boolean query of terms. Once the subset is specified, it is analyzed to extract informative phrases. Specifically, the index is queried to retrieve all phrases within the subset. The number of times each of the phases occurs in the subset and in the corpus is counted. Each phrase contained in the subset is scored according to informativeness based on a comparison of a likelihood that the phrase occurs in the subset and a likelihood that the phrase occurs in the corpus as a whole. Only those phrases having an informativeness score above a predetermined value are considered highly informative and extracted.
Public/Granted literature
- US20070067289A1 Extracting informative phrases from unstructured text Public/Granted day:2007-03-22
Information query