- Patent Title: Template-based structured document classification and extraction
-
Application No.: US15360939Application Date: 2016-11-23
-
Publication No.: US10657158B2Publication Date: 2020-05-19
- Inventor: Ying Sheng , Yifeng Lu , Jing Xie , Jie Yang , Luis Garcia Pueyo , Jinan Lou , James Wendt
- Applicant: Google Inc.
- Applicant Address: US CA Mountain View
- Assignee: GOOGLE LLC
- Current Assignee: GOOGLE LLC
- Current Assignee Address: US CA Mountain View
- Agency: Middleton Reutlinger
- Main IPC: G06F16/00
- IPC: G06F16/00 ; G06F16/28 ; G06N20/00 ; G06F16/93 ; G06Q10/10 ; G06N20/20 ; G06F40/174 ; G06F40/186

Abstract:
Techniques are described herein for automatically generating data extraction templates for structured documents (e.g., B2C emails, invoices, bills, invitations, etc.), and for assigning classifications to those data extraction templates to streamline data extraction from subsequent structured documents. In various implementations, a data extraction template generated from a cluster of structured documents that share fixed content may be identified. Features of the cluster of structured documents may be applied as input to extraction machine learning model(s) trained to provide location(s) of transient field(s) in structured documents, to determine location(s) of transient field(s) in the cluster of structured documents. An association between the data extraction template and the determined transient field location(s) may be stored. Based on the association, data point(s) may be extracted from a given structured document of a user that shares fixed content with the cluster of structured documents. The extracted data point(s) may be surfaced to the user.
Public/Granted literature
- US20180144042A1 TEMPLATE-BASED STRUCTURED DOCUMENT CLASSIFICATION AND EXTRACTION Public/Granted day:2018-05-24
Information query