Invention Grant
US08224641B2 Language identification for documents containing multiple languages 有权
包含多种语言的文档的语言识别

  • Patent Title: Language identification for documents containing multiple languages
  • Patent Title (中): 包含多种语言的文档的语言识别
  • Application No.: US12274182
    Application Date: 2008-11-19
  • Publication No.: US08224641B2
    Publication Date: 2012-07-17
  • Inventor: Sauraj Goswami
  • Applicant: Sauraj Goswami
  • Applicant Address: US CA Mountain View
  • Assignee: Stratify, Inc.
  • Current Assignee: Stratify, Inc.
  • Current Assignee Address: US CA Mountain View
  • Main IPC: G06F17/20
  • IPC: G06F17/20
Language identification for documents containing multiple languages
Abstract:
Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.
Information query
Patent Agency Ranking
0/0