Invention Grant
- Patent Title: Identifying codemixed text
-
Application No.: US15976647Application Date: 2018-05-10
-
Publication No.: US10579733B2Publication Date: 2020-03-03
- Inventor: Jason Riesa , Daniel Gillick , Yuan Zhang , Anton Bakalov , Jason Baldridge , David Weiss
- Applicant: Google LLC
- Applicant Address: US CA Mountain View
- Assignee: Google LLC
- Current Assignee: Google LLC
- Current Assignee Address: US CA Mountain View
- Agency: Honigman LLP
- Agent Brett A. Krueger
- Main IPC: G06F17/27
- IPC: G06F17/27 ; G06N7/00

Abstract:
A method for identifying codemixed text includes receiving codemixed text and segmenting the codemixed text into a plurality of tokens. Each token includes at least one character and is delineated from any adjacent tokens by a space. For each token of the codemixed text, the method also includes extracting features from the token and predicting a probability distribution over possible languages for the token using a language identifier model configured to receive the extracted features from the token as feature inputs. The method also includes assigning a language to each token of the codemixed text by executing a greedy search on the probability distribution over the possible languages predicted for each respective token.
Information query