Invention Grant
- Patent Title: Language identification in multilingual text
- Patent Title (中): 多语言文字中的语言识别
-
Application No.: US12904642Application Date: 2010-10-14
-
Publication No.: US08635061B2Publication Date: 2014-01-21
- Inventor: Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
- Applicant: Kang Li , Stephen Allen Kloder , Ian George Johnson , Siarhei Alonichau
- Applicant Address: US WA Redmond
- Assignee: Microsoft Corporation
- Current Assignee: Microsoft Corporation
- Current Assignee Address: US WA Redmond
- Agency: Shook, Hardy & Bacon L.L.P.
- Main IPC: G06F17/20
- IPC: G06F17/20 ; G06F17/27 ; G10L15/00

Abstract:
Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.
Public/Granted literature
- US20120095748A1 Language Identification in Multilingual Text Public/Granted day:2012-04-19
Information query