Identifying codemixed text

Invention Grant

US10579733B2 Identifying codemixed text 有权

Please log in to see more content

Patent Title: Identifying codemixed text
Application No.: US15976647

Application Date: 2018-05-10
Publication No.: US10579733B2

Publication Date: 2020-03-03
Inventor: Jason Riesa , Daniel Gillick , Yuan Zhang , Anton Bakalov , Jason Baldridge , David Weiss
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: Google LLC
Current Assignee: Google LLC
Current Assignee Address: US CA Mountain View
Agency: Honigman LLP
Agent Brett A. Krueger
Main IPC: G06F17/27
IPC: G06F17/27 ; G06N7/00

Abstract:

A method for identifying codemixed text includes receiving codemixed text and segmenting the codemixed text into a plurality of tokens. Each token includes at least one character and is delineated from any adjacent tokens by a space. For each token of the codemixed text, the method also includes extracting features from the token and predicting a probability distribution over possible languages for the token using a language identifier model configured to receive the extracted features from the token as feature inputs. The method also includes assigning a language to each token of the codemixed text by executing a greedy search on the probability distribution over the possible languages predicted for each respective token.

Information query

Espacenet