Universal language segment representations learning with conditional masked language model

Invention Grant

US11769011B2 Universal language segment representations learning with conditional masked language model 有权

Please log in to see more content

Patent Title: Universal language segment representations learning with conditional masked language model
Application No.: US17127734

Application Date: 2020-12-18
Publication No.: US11769011B2

Publication Date: 2023-09-26
Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: GOOGLE LLC
Current Assignee: GOOGLE LLC
Current Assignee Address: US CA Mountain View
Agency: Dority & Manning, P.A.
Main IPC: G06F40/284
IPC: G06F40/284 ; G06N3/04 ; G06N20/00

Universal language segment representations learning with conditional masked language model

Abstract:

The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.

Public/Granted literature

US20220198144A1 Universal Language Segment Representations Learning with Conditional Masked Language Model Public/Granted day:2022-06-23

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/284	...词汇分析，例如标记或搭配词