Identifying codemixed text
Abstract:
A method for identifying codemixed text includes receiving codemixed text and segmenting the codemixed text into a plurality of tokens. Each token includes at least one character and is delineated from any adjacent tokens by a space. For each token of the codemixed text, the method also includes extracting features from the token and predicting a probability distribution over possible languages for the token using a language identifier model configured to receive the extracted features from the token as feature inputs. The method also includes assigning a language to each token of the codemixed text by executing a greedy search on the probability distribution over the possible languages predicted for each respective token.
Information query
Patent Agency Ranking
0/0