Category-based lemmatizing of a phrase in a document

Invention Grant

US09672278B2 Category-based lemmatizing of a phrase in a document 有权

Please log in to see more content

Patent Title: Category-based lemmatizing of a phrase in a document
Application No.: US14820601

Application Date: 2015-08-07
Publication No.: US09672278B2

Publication Date: 2017-06-06
Inventor: James E. Bostick , John M. Ganci, Jr. , John P. Kaemmerer , Craig M. Trim
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address: US NY Armonk
Assignee: International Business Machines Corporation
Current Assignee: International Business Machines Corporation
Current Assignee Address: US NY Armonk
Agency: Law Office of Jim Boice
Agent John R. Pivnichny
Main IPC: G06F17/30
IPC: G06F17/30 ; G06F17/27

Category-based lemmatizing of a phrase in a document

Abstract:

A processor receives a string of binary data that represents an initial phrase that includes multiple words and is associated with a specific category. The processor removes one or more letters from an end of a word in the initial phrase to form an initial truncated version of the phrase. The processor runs a TF-IDF algorithm on the initial truncated version of the phrase, and lemmatizes subsequent truncated versions of the initial phrase by recursively removing remaining letters from the end of the word. The processor runs the TF-IDF algorithm on subsequent truncated versions of the initial truncated version of the initial phrase until a highest TF-IDF value is identified. The processor defines a breadth of a lemma for a lexeme based on the specific category of the phrase, and assigns the specific truncated version having the highest TF-IDF value to the specific category.

Public/Granted literature

US20150347575A1 CATEGORY-BASED LEMMATIZING OF A PHRASE IN A DOCUMENT Public/Granted day:2015-12-03

Information query

Espacenet