Invention Grant
- Patent Title: Language identification for documents containing multiple languages
- Patent Title (中): 包含多种语言的文档的语言识别
-
Application No.: US12274182Application Date: 2008-11-19
-
Publication No.: US08224641B2Publication Date: 2012-07-17
- Inventor: Sauraj Goswami
- Applicant: Sauraj Goswami
- Applicant Address: US CA Mountain View
- Assignee: Stratify, Inc.
- Current Assignee: Stratify, Inc.
- Current Assignee Address: US CA Mountain View
- Main IPC: G06F17/20
- IPC: G06F17/20

Abstract:
Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.
Public/Granted literature
- US20100125447A1 LANGUAGE IDENTIFICATION FOR DOCUMENTS CONTAINING MULTIPLE LANGUAGES Public/Granted day:2010-05-20
Information query