Invention Grant
US08438009B2 Text categorization based on co-classification learning from multilingual corpora
有权
基于多语言语料库的共同分类学习的文本分类
- Patent Title: Text categorization based on co-classification learning from multilingual corpora
- Patent Title (中): 基于多语言语料库的共同分类学习的文本分类
-
Application No.: US12909389Application Date: 2010-10-21
-
Publication No.: US08438009B2Publication Date: 2013-05-07
- Inventor: Massih Amini , Cyril Goutte
- Applicant: Massih Amini , Cyril Goutte
- Applicant Address: CA Ottawa, Ontario
- Assignee: National Research Council of Canada
- Current Assignee: National Research Council of Canada
- Current Assignee Address: CA Ottawa, Ontario
- Agency: Benoî & Côté
- Main IPC: G06F17/20
- IPC: G06F17/20 ; G06F17/27

Abstract:
The present document describes a method and a system for generating classifiers from multilingual corpora including subsets of content-equivalent documents written in different languages. When the documents are translations of each other, their classifications must be substantially the same. Embodiments of the invention utilize this similarity in order to enhance the accuracy of the classification in one language based on the classification results in the other language, and vice versa. A system in accordance with the present embodiments implements a method which comprises generating a first classifier from a first subset of the corpora in a first language; generating a second classifier from a second subset of the corpora in a second language; and re-training each of the classifiers on its respective subset based on the classification results of the other classifier, until a training cost between the classification results produced by subsequent iterations reaches a local minima.
Public/Granted literature
- US20110098999A1 TEXT CATEGORIZATION BASED ON CO-CLASSIFICATION LEARNING FROM MULTILINGUAL CORPORA Public/Granted day:2011-04-28
Information query