Category-based lemmatizing of a phrase in a document
Abstract:
A processor receives a string of binary data that represents an initial phrase that includes multiple words and is associated with a specific category. The processor removes one or more letters from an end of a word in the initial phrase to form an initial truncated version of the phrase. The processor runs a TF-IDF algorithm on the initial truncated version of the phrase, and lemmatizes subsequent truncated versions of the initial phrase by recursively removing remaining letters from the end of the word. The processor runs the TF-IDF algorithm on subsequent truncated versions of the initial truncated version of the initial phrase until a highest TF-IDF value is identified. The processor defines a breadth of a lemma for a lexeme based on the specific category of the phrase, and assigns the specific truncated version having the highest TF-IDF value to the specific category.
Public/Granted literature
Information query
Patent Agency Ranking
0/0