Recombining incorrectly separated tokens in natural language processing
Abstract:
To recombine incorrectly separated tokens in NLP, a determination is made whether a token from an ordered set of tokens is present in a dictionary related to a corpus from which the ordered set is extracted. When the token is not present in the dictionary, and when a compounding threshold has not been reached, the token is agglutinated with a next adjacent token in the ordered set to form the compound token. The compounding threshold limits a number of tokens that can be agglutinated to form a compound token. A determination is made whether the compound token is present in the dictionary. A weight is assigned to the compound token when the compound token is present in the dictionary and a confidence rating of the compound token is computed as a function of the weight. The compound token and the confidence rating are used in NLP of the corpus.
Information query
Patent Agency Ranking
0/0