Invention Grant
- Patent Title: Document structure extraction using machine learning
-
Application No.: US15231294Application Date: 2016-08-08
-
Publication No.: US11769072B2Publication Date: 2023-09-26
- Inventor: Michael Kraley
- Applicant: Adobe Inc.
- Applicant Address: US CA San Jose
- Assignee: Adobe Inc.
- Current Assignee: Adobe Inc.
- Current Assignee Address: US CA San Jose
- Agency: FINCH & MALONEY PLLC
- Main IPC: G06N20/00
- IPC: G06N20/00 ; G06N5/02 ; G06V30/414 ; G06F18/2413

Abstract:
The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.
Public/Granted literature
- US20180039907A1 DOCUMENT STRUCTURE EXTRACTION USING MACHINE LEARNING Public/Granted day:2018-02-08
Information query