Invention Grant
- Patent Title: Universal language segment representations learning with conditional masked language model
-
Application No.: US17127734Application Date: 2020-12-18
-
Publication No.: US11769011B2Publication Date: 2023-09-26
- Inventor: Yinfei Yang , Ziyi Yang , Daniel Matthew Cer
- Applicant: Google LLC
- Applicant Address: US CA Mountain View
- Assignee: GOOGLE LLC
- Current Assignee: GOOGLE LLC
- Current Assignee Address: US CA Mountain View
- Agency: Dority & Manning, P.A.
- Main IPC: G06F40/284
- IPC: G06F40/284 ; G06N3/04 ; G06N20/00

Abstract:
The present disclosure provides a novel sentence-level representation learning method Conditional Masked Language Modeling (CMLM) for training on large scale unlabeled corpora. CMLM outperforms the previous state-of-the-art English sentence embedding models, including those trained with (semi-)supervised signals. For multilingual representations learning, it is shown that co-training CMLM with bitext retrieval and cross-lingual natural language inference (NL) fine-tuning achieves state-of-the-art performance. It is also shown that multilingual representations have the same language bias and principal component removal (PCR) can eliminate the bias by separating language identity information from semantics.
Public/Granted literature
- US20220198144A1 Universal Language Segment Representations Learning with Conditional Masked Language Model Public/Granted day:2022-06-23
Information query