Invention Grant
- Patent Title: Self-supervised document-to-document similarity system
-
Application No.: US17354333Application Date: 2021-06-22
-
Publication No.: US11580764B2Publication Date: 2023-02-14
- Inventor: Itzik Malkiel , Dvir Ginzburg , Noam Koenigstein , Oren Barkan , Nir Nice
- Applicant: Microsoft Technology Licensing, LLC
- Applicant Address: US WA Redmond
- Assignee: Microsoft Technology Licensing, LLC
- Current Assignee: Microsoft Technology Licensing, LLC
- Current Assignee Address: US WA Redmond
- Main IPC: G06V30/418
- IPC: G06V30/418 ; G06K9/62 ; G06V10/75

Abstract:
Examples provide a self-supervised language model for document-to-document similarity scoring and ranking long documents of arbitrary length in an absence of similarity labels. In a first stage of a two-staged hierarchical scoring, a sentence similarity matrix is created for each paragraph in the candidate document. A sentence similarity score is calculated based on the sentence similarity matrix. In the second stage, a paragraph similarity matrix is constructed based on aggregated sentence similarity scores associated with the first candidate document. A total similarity score for the document is calculated based on the normalize the paragraph similarity matrix for each candidate document in a collection of documents. The model is trained using a masked language model and intra-and-inter document sampling. The documents are ranked based on the similarity scores for the documents.
Public/Granted literature
- US20220405504A1 SELF-SUPERVISED DOCUMENT-TO-DOCUMENT SIMILARITY SYSTEM Public/Granted day:2022-12-22
Information query