Document structure extraction using machine learning

Invention Grant

US11769072B2 Document structure extraction using machine learning 有权

Please log in to see more content

Patent Title: Document structure extraction using machine learning
Application No.: US15231294

Application Date: 2016-08-08
Publication No.: US11769072B2

Publication Date: 2023-09-26
Inventor: Michael Kraley
Applicant: Adobe Inc.
Applicant Address: US CA San Jose
Assignee: Adobe Inc.
Current Assignee: Adobe Inc.
Current Assignee Address: US CA San Jose
Agency: FINCH & MALONEY PLLC
Main IPC: G06N20/00
IPC: G06N20/00 ; G06N5/02 ; G06V30/414 ; G06F18/2413

Document structure extraction using machine learning

Abstract:

The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.

Public/Granted literature

US20180039907A1 DOCUMENT STRUCTURE EXTRACTION USING MACHINE LEARNING Public/Granted day:2018-02-08

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06N	基于特定计算模型的计算机系统
G06N20/00	机器学习