Language Model Preprocessing with Weighted N-grams

    公开(公告)号:US20240296284A1

    公开(公告)日:2024-09-05

    申请号:US18117304

    申请日:2023-03-03

    CPC classification number: G06F40/284 G06F40/166 G06N20/00

    Abstract: An embodiment may involve: obtaining textual content including a plurality of token strings, wherein each of the plurality of token strings includes one or more tokens; determining, for the plurality of token strings, respectively corresponding sets of n-gram tuples; assigning respective weights to the plurality of token strings, wherein, for each of the plurality of token strings, the assignment is based on the respectively corresponding set of n-gram tuples; identifying a subset of the plurality of token strings, wherein each of the subset of the plurality of token strings is characterized by a respective weight that exceeds a predetermined threshold weight; and storing sets of n-gram tuples respectively corresponding to the subset of the plurality of token strings.

    Lookup source framework for a natural language understanding (NLU) framework

    公开(公告)号:US12265796B2

    公开(公告)日:2025-04-01

    申请号:US17579028

    申请日:2022-01-19

    Abstract: A natural language understanding (NLU) framework includes a lookup source framework, which enables a lookup source system to be defined having one or more lookup sources. Each lookup source of the lookup source system includes a respective source data representation that is compiled from respective source data. For example, a source data representation may include source data arranged in a finite state transducer (IFST) structure as a set of finite-state automata (FSA) states, wherein each state is associated with a token that represents underlying source data. Different producers can be applied during compilation of a source data representation to derive additional states within the source data representation from the source data. Certain states of the source data representation that contain sensitive data can be selectively protected through encryption and/or obfuscation, while other portions of the source data representation that are not sensitive may remain in clear-text form.

    LOOKUP SOURCE FRAMEWORK FOR A NATURAL LANGUAGE UNDERSTANDING (NLU) FRAMEWORK

    公开(公告)号:US20220229998A1

    公开(公告)日:2022-07-21

    申请号:US17579028

    申请日:2022-01-19

    Abstract: A natural language understanding (NLU) framework includes a lookup source framework, which enables a lookup source system to be defined having one or more lookup sources. Each lookup source of the lookup source system includes a respective source data representation that is compiled from respective source data. For example, a source data representation may include source data arranged in a finite state transducer (IFST) structure as a set of finite-state automata (FSA) states, wherein each state is associated with a token that represents underlying source data. Different producers can be applied during compilation of a source data representation to derive additional states within the source data representation from the source data. Certain states of the source data representation that contain sensitive data can be selectively protected through encryption and/or obfuscation, while other portions of the source data representation that are not sensitive may remain in clear-text form.

Patent Agency Ranking