CLUSTERING AND DYNAMIC RE-CLUSTERING OF SIMILAR TEXTUAL DOCUMENTS

    公开(公告)号:US20230153342A1

    公开(公告)日:2023-05-18

    申请号:US18155553

    申请日:2023-01-17

    CPC classification number: G06F16/353 G06F40/30 G06N5/04 G06N20/00

    Abstract: A computer-implemented method includes obtaining a plurality of textual records divided into clusters and a residual set of the textual records, where a machine learning (ML) clustering model has divided the plurality of textual records into the clusters based on a similarity metric. The method also includes receiving, from a client device, a particular textual record representing a query and determining, by way of the ML clustering model and based on the similarity metric, that the particular textual record does not fit into any of the clusters. The method additionally includes, in response to determining that the particular textual record does not fit into any of the clusters, adding the particular textual record to the residual set of the textual records. The method can additionally include identifying, by way of the ML clustering model, that the residual set of the textual records contains a further cluster.

    CLUSTERING AND DYNAMIC RE-CLUSTERING OF SIMILAR TEXTUAL DOCUMENTS

    公开(公告)号:US20200349183A1

    公开(公告)日:2020-11-05

    申请号:US16434888

    申请日:2019-06-07

    Abstract: A computer-implemented method includes obtaining a plurality of textual records divided into clusters and a residual set of the textual records, where a machine learning (ML) clustering model has divided the plurality of textual records into the clusters based on a similarity metric. The method also includes receiving, from a client device, a particular textual record representing a query and determining, by way of the ML clustering model and based on the similarity metric, that the particular textual record does not fit into any of the clusters. The method additionally includes, in response to determining that the particular textual record does not fit into any of the clusters, adding the particular textual record to the residual set of the textual records. The method can additionally include identifying, by way of the ML clustering model, that the residual set of the textual records contains a further cluster.

    Data structures for efficient storage and updating of paragraph vectors

    公开(公告)号:US12141182B2

    公开(公告)日:2024-11-12

    申请号:US17885296

    申请日:2022-08-10

    Abstract: Systems and methods involving data structures for efficient management of paragraph vectors for textual searching are described. A database may contain records, each associated with an identifier and including a text string and timestamp. A look-up table may contain entries for text strings from the records, each entry associating: a paragraph vector for a respective unique text string, a hash of the respective unique text string, and a set of identifiers of records containing the respective unique text string. A server may receive from a client device an input string, compute a hash of the input string, and determine matching table entries, each containing a hash identical to that of the input string, or a paragraph vector similar to one calculated for the input string. A prioritized list of identifiers from the matching entries may be determined based on timestamps, and the prioritized list may be returned to the client.

    Persistent word vector input to multiple machine learning models

    公开(公告)号:US11238230B2

    公开(公告)日:2022-02-01

    申请号:US16135822

    申请日:2018-09-19

    Abstract: Word vectors are multi-dimensional vectors that represent words in a corpus of text and that are embedded in a semantically-encoded vector space. Word vectors can be used for sentiment analysis, comparison of the topic or content of sentences, paragraphs, or other passages of text or other natural language processing tasks. However, the generation of word vectors can be computationally expensive. Accordingly, when a set of word vectors is needed for a particular corpus of text, a set of word vectors previously generated from a corpus of text that is sufficiently similar to the particular corpus of text, with respect to some criteria, may be re-used for the particular corpus of text. Such similarity could include the two corpora of text containing the same or similar sets of words or containing incident reports or other time-coded sets of text from overlapping or otherwise similar periods of time.

    Clustering and dynamic re-clustering of similar textual documents

    公开(公告)号:US11586659B2

    公开(公告)日:2023-02-21

    申请号:US16434888

    申请日:2019-06-07

    Abstract: A computer-implemented method includes obtaining a plurality of textual records divided into clusters and a residual set of the textual records, where a machine learning (ML) clustering model has divided the plurality of textual records into the clusters based on a similarity metric. The method also includes receiving, from a client device, a particular textual record representing a query and determining, by way of the ML clustering model and based on the similarity metric, that the particular textual record does not fit into any of the clusters. The method additionally includes, in response to determining that the particular textual record does not fit into any of the clusters, adding the particular textual record to the residual set of the textual records. The method can additionally include identifying, by way of the ML clustering model, that the residual set of the textual records contains a further cluster.

    DATA STRUCTURES FOR EFFICIENT STORAGE AND UPDATING OF PARAGRAPH VECTORS

    公开(公告)号:US20220382792A1

    公开(公告)日:2022-12-01

    申请号:US17885296

    申请日:2022-08-10

    Abstract: Systems and methods involving data structures for efficient management of paragraph vectors for textual searching are described. A database may contain records, each associated with an identifier and including a text string and timestamp. A look-up table may contain entries for text strings from the records, each entry associating: a paragraph vector for a respective unique text string, a hash of the respective unique text string, and a set of identifiers of records containing the respective unique text string. A server may receive from a client device an input string, compute a hash of the input string, and determine matching table entries, each containing a hash identical to that of the input string, or a paragraph vector similar to one calculated for the input string. A prioritized list of identifiers from the matching entries may be determined based on timestamps, and the prioritized list may be returned to the client.

    PERSISTENT WORD VECTOR INPUT TO MULTIPLE MACHINE LEARNING MODELS

    公开(公告)号:US20200089765A1

    公开(公告)日:2020-03-19

    申请号:US16135822

    申请日:2018-09-19

    Abstract: Word vectors are multi-dimensional vectors that represent words in a corpus of text and that are embedded in a semantically-encoded vector space. Word vectors can be used for sentiment analysis, comparison of the topic or content of sentences, paragraphs, or other passages of text or other natural language processing tasks. However, the generation of word vectors can be computationally expensive. Accordingly, when a set of word vectors is needed for a particular corpus of text, a set of word vectors previously generated from a corpus of text that is sufficiently similar to the particular corpus of text, with respect to some criteria, may be re-used for the particular corpus of text. Such similarity could include the two corpora of text containing the same or similar sets of words or containing incident reports or other time-coded sets of text from overlapping or otherwise similar periods of time.

    Data structures for efficient storage and updating of paragraph vectors

    公开(公告)号:US11423069B2

    公开(公告)日:2022-08-23

    申请号:US16135891

    申请日:2018-09-19

    Abstract: Systems and methods involving data structures for efficient management of paragraph vectors for textual searching are described. A database may contain records, each associated with an identifier and including a text string and timestamp. A look-up table may contain entries for text strings from the records, each entry associating: a paragraph vector for a respective unique text string, a hash of the respective unique text string, and a set of identifiers of records containing the respective unique text string. A server may receive from a client device an input string, compute a hash of the input string, and determine matching table entries, each containing a hash identical to that of the input string, or a paragraph vector similar to one calculated for the input string. A prioritized list of identifiers from the matching entries may be determined based on timestamps, and the prioritized list may be returned to the client.

    DATA STRUCTURES FOR EFFICIENT STORAGE AND UPDATING OF PARAGRAPH VECTORS

    公开(公告)号:US20200089781A1

    公开(公告)日:2020-03-19

    申请号:US16135891

    申请日:2018-09-19

    Abstract: Systems and methods involving data structures for efficient management of paragraph vectors for textual searching are described. A database may contain records, each associated with an identifier and including a text string and timestamp. A look-up table may contain entries for text strings from the records, each entry associating: a paragraph vector for a respective unique text string, a hash of the respective unique text string, and a set of identifiers of records containing the respective unique text string. A server may receive from a client device an input string, compute a hash of the input string, and determine matching table entries, each containing a hash identical to that of the input string, or a paragraph vector similar to one calculated for the input string. A prioritized list of identifiers from the matching entries may be determined based on timestamps, and the prioritized list may be returned to the client.

Patent Agency Ranking