Invention Grant
- Patent Title: Language modeling based on spoken and unspeakable corpuses
-
Application No.: US14711447Application Date: 2015-05-13
-
Publication No.: US09761220B2Publication Date: 2017-09-12
- Inventor: Michael Levit , Shuangyu Chang , Benoit Dumoulin
- Applicant: Microsoft Technology Licensing, LLC
- Applicant Address: US WA Redmond
- Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
- Current Assignee: MICROSOFT TECHNOLOGY LICENSING, LLC
- Current Assignee Address: US WA Redmond
- Main IPC: G10L15/06
- IPC: G10L15/06 ; G10L15/10 ; G10L15/14 ; G10L15/18 ; G10L15/19

Abstract:
A computer system for language modeling may collect training data from one or more information sources, generate a spoken corpus containing text of transcribed speech, and generate a typed corpus containing typed text. The computer system may derive feature vectors from the spoken corpus, analyze the typed corpus to determine feature vectors representing items of typed text, and generate an unspeakable corpus by filtering the typed corpus to remove each item of typed text represented by a feature vector that is within a similarity threshold of a feature vector derived from the spoken corpus. The computer system may derive feature vectors from the unspeakable corpus and train a classifier to perform discriminative data selection for language modeling based on the feature vectors derived from the spoken corpus and the feature vectors derived from the unspeakable corpus.
Public/Granted literature
- US20160336006A1 DISCRIMINATIVE DATA SELECTION FOR LANGUAGE MODELING Public/Granted day:2016-11-17
Information query