K-D Tree Balanced Splitting
    121.
    发明申请

    公开(公告)号:US20250086155A1

    公开(公告)日:2025-03-13

    申请号:US18772758

    申请日:2024-07-15

    Abstract: A system for clustering data into corresponding files comprises one or more processors and a memory. The one or more processors is/are configured to: 1) determine to cluster a set of data into a set of files; 2) determine a set of split points in a corresponding set of dimensions of the set of data to determine the set of files, wherein each file of the set of files has an approximate target size; and 3) store one or more items of the set of data into a corresponding file of the set of files based at least in part on the set of split points. The memory is coupled to the one or more processors and configured to provide the processor with instructions.

    Clustering key selection based on machine-learned key selection models for data processing service

    公开(公告)号:US12229169B1

    公开(公告)日:2025-02-18

    申请号:US18501830

    申请日:2023-11-03

    Abstract: The disclosed configurations provide a method (and/or a computer-readable medium or system) for determining, from a table schema describing keys of a data table, one or more clustering keys that can be used to cluster data files of a data table. The method includes generating features for the data table, generating tokens from the features, generating a prediction for each token by applying to the token a machine-learned transformer model trained to predict a likelihood that the key associated with the token is a clustering key for the data table, determining clustering keys based on the predictions, and clustering data records of the data table into data files based on key-values for the clustering keys.

    Checkpoint and restore based startup of executor nodes of a distributed computing engine for processing queries

    公开(公告)号:US12229137B1

    公开(公告)日:2025-02-18

    申请号:US18412438

    申请日:2024-01-12

    Abstract: A system performs efficient startup of executors of a distributed computing engine used for processing queries, for example, database queries. The system starts an executor node and processes a set of queries using the executor node to warm up the executor node. The system performs a checkpoint of the warmed-up executor node to create an image. The image is restored in the target executor nodes. The system may store a checkpoint image for each configuration of an executor node. The configuration is determined based on various factors including the hardware of the executor node, memory allocation of the processes, and so on. The user or restore based on checkpoint images improves efficiency of execution of the startup of executor nodes.

    Multi-cluster query result caching
    124.
    发明授权

    公开(公告)号:US12189625B2

    公开(公告)日:2025-01-07

    申请号:US18222343

    申请日:2023-07-14

    Abstract: A multi-cluster computing system which includes a query result caching system is presented. The multi-cluster computing system may include a data processing service and client devices communicatively coupled over a network. The data processing service may include a control layer and a data layer. The control layer may be configured to receive and process requests from the client devices and manage resources in the data layer. The data layer may be configured to include instances of clusters of computing resources for executing jobs. The data layer may include a data storage system, which further includes a remote query result cache Store. The query result cache store may include a cloud storage query result cache which stores data associated with results of previously executed requests. As such, when a cluster encounters a previously executed request, the cluster may efficiently retrieve the cached result of the request from the in-memory query result cache or the cloud storage query result cache.

    FEATURE FUNCTION BASED COMPUTATION OF ON-DEMAND FEATURES OF MACHINE LEARNING MODELS

    公开(公告)号:US20240412095A1

    公开(公告)日:2024-12-12

    申请号:US18206460

    申请日:2023-06-06

    Abstract: A system performs training and execution of machine learning models that use on-demand features using feature functions. The system receives commands for registering metadata associated with a machine learning model. The machine learning model may process a set of features including on-demand features as well as other features such as batch features. The system executes the command by storing an association between the machine learning model and the feature functions associated with any on-demand features processed by the machine learning model. The feature functions are executed using an end point of a data asset service. The use of the data asset service for invoking the feature functions ensures that the same set of instructions is executed during model training and model inferencing, thereby avoiding model skew.

    RETRIEVAL AND CACHING OF OBJECT METADATA ACROSS DATA SOURCES AND STORAGE SYSTEMS

    公开(公告)号:US20240346007A1

    公开(公告)日:2024-10-17

    申请号:US18135078

    申请日:2023-04-14

    CPC classification number: G06F16/2365 G06F16/24552

    Abstract: A system for retrieving and caching metadata from a remote data source is described.
    The system may receive a request from a client device. The request is to perform a query operation on a set of data objects stored in the remote data source. The system may access a metadata cache storing metadata information on one or more data objects of the remote data source and identify metadata corresponding to the set of data objects for the query operation in the metadata cache. The system may determine whether the identified metadata for the set of data objects meets an update condition. In response to the identified metadata meeting the update condition, the system may fetch updated metadata for at least the set of data objects from the remote data source, and store the updated metadata in the metadata cache.

    Model ML registry and model serving
    128.
    发明授权

    公开(公告)号:US12117983B2

    公开(公告)日:2024-10-15

    申请号:US18512028

    申请日:2023-11-17

    CPC classification number: G06F16/219 G06F16/955 G06N5/022

    Abstract: A system includes an interface, a processor, and a memory. The interface is configured to receive a version of a model from a model registry. The processor is configured to store the version of the model, start a process running the version of the model, and update a proxy with version information associated with the version of the model, wherein the updated proxy indicates to redirect an indication to invoke the version of the model to the process. The memory is coupled to the processor and configured to provide the processor with instructions.

    Scan parsing
    129.
    发明授权

    公开(公告)号:US12072880B2

    公开(公告)日:2024-08-27

    申请号:US17892376

    申请日:2022-08-22

    CPC classification number: G06F16/24542 G06F16/285

    Abstract: The present application discloses a method, system, and computer system for parsing files. The method includes receiving an indication that a first file is to be processed, determining to begin processing the first file using a first processing engine based at least in part on one or more predefined heuristics, indicating to process the first file using a first processing engine, determining whether a particular error in processing the first file using the first processing engine has been detected, in response to determining that the particular error has been detected, indicate to stop processing the first file using the first processing engine and indicate to continue processing using a second processing engine, and storing in memory information obtained based on processing the first file by one or more of the first processing engine and the second processing engine.

    Multi-Cluster Query Result Caching
    130.
    发明公开

    公开(公告)号:US20240265011A1

    公开(公告)日:2024-08-08

    申请号:US18222343

    申请日:2023-07-14

    CPC classification number: G06F16/24539

    Abstract: A multi-cluster computing system which includes a query result caching system is presented. The multi-cluster computing system may include a data processing service and client devices communicatively coupled over a network. The data processing service may include a control layer and a data layer. The control layer may be configured to receive and process requests from the client devices and manage resources in the data layer. The data layer may be configured to include instances of clusters of computing resources for executing jobs. The data layer may include a data storage system, which further includes a remote query result cache Store. The query result cache store may include a cloud storage query result cache which stores data associated with results of previously executed requests. As such, when a cluster encounters a previously executed request, the cluster may efficiently retrieve the cached result of the request from the in-memory query result cache or the cloud storage query result cache.

Patent Agency Ranking