Self-supervised document-to-document similarity system

Invention Grant

US11580764B2 Self-supervised document-to-document similarity system 有权

Please log in to see more content

Patent Title: Self-supervised document-to-document similarity system
Application No.: US17354333

Application Date: 2021-06-22
Publication No.: US11580764B2

Publication Date: 2023-02-14
Inventor: Itzik Malkiel , Dvir Ginzburg , Noam Koenigstein , Oren Barkan , Nir Nice
Applicant: Microsoft Technology Licensing, LLC
Applicant Address: US WA Redmond
Assignee: Microsoft Technology Licensing, LLC
Current Assignee: Microsoft Technology Licensing, LLC
Current Assignee Address: US WA Redmond
Main IPC: G06V30/418
IPC: G06V30/418 ; G06K9/62 ; G06V10/75

Self-supervised document-to-document similarity system

Abstract:

Examples provide a self-supervised language model for document-to-document similarity scoring and ranking long documents of arbitrary length in an absence of similarity labels. In a first stage of a two-staged hierarchical scoring, a sentence similarity matrix is created for each paragraph in the candidate document. A sentence similarity score is calculated based on the sentence similarity matrix. In the second stage, a paragraph similarity matrix is constructed based on aggregated sentence similarity scores associated with the first candidate document. A total similarity score for the document is calculated based on the normalize the paragraph similarity matrix for each candidate document in a collection of documents. The model is trained using a masked language model and intra-and-inter document sampling. The documents are ranked based on the similarity scores for the documents.

Public/Granted literature

US20220405504A1 SELF-SUPERVISED DOCUMENT-TO-DOCUMENT SIMILARITY SYSTEM Public/Granted day:2022-12-22

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06V	图像或视频识别或理解
G06V30/00	字符识别；数字墨迹识别；面向文档的基于图像的模式识别（文档等的扫描、传输或复制 H04N1/00）
G06V30/40	.面向文档的基于图像的模式识别
G06V30/41	..文件内容分析（基于代码标记的印刷字符识别G06V30/224）
G06V30/418	...文档匹配，例如文件图像