Generation of matched corpus for language model training

Invention Grant

US11276391B2 Generation of matched corpus for language model training 有权

Please log in to see more content

Patent Title: Generation of matched corpus for language model training
Application No.: US16783402

Application Date: 2020-02-06
Publication No.: US11276391B2

Publication Date: 2022-03-15
Inventor: Nobuyasu Itoh , Gakuto Kurata , Masayuki Suzuki
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address: US NY Armonk
Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee Address: US NY Armonk
Agency: Tutunjian & Bitetto, P.C.
Agent Randall Bluestone
Main IPC: G10L15/183
IPC: G10L15/183 ; G10L15/06

Generation of matched corpus for language model training

Abstract:

A computer-implemented method for generating a text is disclosed. The method includes obtaining a first text collection matched with a target domain and a second text collection including a plurality of samples, each of which describes rewriting between a first text and a second text that has a style different from the first text. The method also includes training a text generation model with the first text collection and the second text collection, in which the text generation model has, in a vocabulary, one or more operation tokens indicating rewriting. The method further includes outputting a plurality of texts obtained from the text generation model.

Public/Granted literature

US20210248996A1 GENERATION OF MATCHED CORPUS FOR LANGUAGE MODEL TRAINING Public/Granted day:2021-08-12

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）
G10L15/08	.语音分类或检索
G10L15/18	..利用自然语言模型
G10L15/183	...用上下文相关性，例如：语言模型