Document structure extraction using machine learning

    公开(公告)号:US11769072B2

    公开(公告)日:2023-09-26

    申请号:US15231294

    申请日:2016-08-08

    Applicant: Adobe Inc.

    Inventor: Michael Kraley

    CPC classification number: G06N20/00 G06F18/24133 G06N5/02 G06V30/414

    Abstract: The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.

    Privacy preserving document analysis

    公开(公告)号:US11689507B2

    公开(公告)日:2023-06-27

    申请号:US16695636

    申请日:2019-11-26

    Applicant: Adobe Inc.

    CPC classification number: H04L63/04 G06N5/04 G06N20/00 G06Q30/0202

    Abstract: Systems and techniques for privacy preserving document analysis are described that derive insights pertaining to a digital document without communication of the content of the digital document. To do so, the privacy preserving document analysis techniques described herein capture visual or contextual features of the digital document and creates a stamp representation that represents these features without included the content of the digital document. The stamp representation is projected into a stamp embedding space based on a stamp encoding model generated through machine learning techniques capturing feature patterns and interaction in the stamp representations. The stamp encoding model exploits these feature interactions to define similarity of source documents based on location within the stamp embedding space. Accordingly, the techniques described herein can determine a similarity of documents without having access to the documents themselves.

    Privacy Preserving Document Analysis
    3.
    发明公开

    公开(公告)号:US20230336532A1

    公开(公告)日:2023-10-19

    申请号:US18317338

    申请日:2023-05-15

    Applicant: Adobe Inc.

    CPC classification number: H04L63/04 G06Q30/0202 G06N5/04 G06N20/00

    Abstract: Systems and techniques for privacy preserving document analysis are described that derive insights pertaining to a digital document without communication of the content of the digital document. To do so, the privacy preserving document analysis techniques described herein capture visual or contextual features of the digital document and creates a stamp representation that represents these features without included the content of the digital document. The stamp representation is projected into a stamp embedding space based on a stamp encoding model generated through machine learning techniques capturing feature patterns and interaction in the stamp representations. The stamp encoding model exploits these feature interactions to define similarity of source documents based on location within the stamp embedding space. Accordingly, the techniques described herein can determine a similarity of documents without having access to the documents themselves.

    Privacy Preserving Document Analysis

    公开(公告)号:US20210160221A1

    公开(公告)日:2021-05-27

    申请号:US16695636

    申请日:2019-11-26

    Applicant: Adobe Inc.

    Abstract: Systems and techniques for privacy preserving document analysis are described that derive insights pertaining to a digital document without communication of the content of the digital document. To do so, the privacy preserving document analysis techniques described herein capture visual or contextual features of the digital document and creates a stamp representation that represents these features without included the content of the digital document. The stamp representation is projected into a stamp embedding space based on a stamp encoding model generated through machine learning techniques capturing feature patterns and interaction in the stamp representations. The stamp encoding model exploits these feature interactions to define similarity of source documents based on location within the stamp embedding space. Accordingly, the techniques described herein can determine a similarity of documents without having access to the documents themselves.

    Identification of reading order text segments with a probabilistic language model

    公开(公告)号:US10372821B2

    公开(公告)日:2019-08-06

    申请号:US15462684

    申请日:2017-03-17

    Applicant: Adobe Inc.

    Abstract: Certain embodiments identify a correct structured reading-order sequence of text segments extracted from a file. A probabilistic language model is generated from a large text corpus to comprise observed word sequence patterns for a given language. The language model measures whether splicing together a first text segment with another continuation text segment results in a phrase that is more likely than a phrase resulting from splicing together the first text segment with other continuation text segments. Sets of text segments, which include a first set with a first text segment and a first continuation text segment as well as a second set with the first text segment and a second continuation text segment, are provided to the probabilistic model. A score indicative of a likelihood of the set providing a correct structured reading-order sequence is obtained for each set of text segments.

    Privacy preserving document analysis

    公开(公告)号:US12267305B2

    公开(公告)日:2025-04-01

    申请号:US18317338

    申请日:2023-05-15

    Applicant: Adobe Inc.

    Abstract: Systems and techniques for privacy preserving document analysis are described that derive insights pertaining to a digital document without communication of the content of the digital document. To do so, the privacy preserving document analysis techniques described herein capture visual or contextual features of the digital document and creates a stamp representation that represents these features without included the content of the digital document. The stamp representation is projected into a stamp embedding space based on a stamp encoding model generated through machine learning techniques capturing feature patterns and interaction in the stamp representations. The stamp encoding model exploits these feature interactions to define similarity of source documents based on location within the stamp embedding space. Accordingly, the techniques described herein can determine a similarity of documents without having access to the documents themselves.

Patent Agency Ranking