-
21.
公开(公告)号:US20250156448A1
公开(公告)日:2025-05-15
申请号:US19022884
申请日:2025-01-15
Applicant: Databricks, Inc.
Inventor: Terry Kim , Lin Ma , Rahul Shivu Mahadev , Rahul Potharaju
Abstract: The disclosed configurations provide a method (and/or a computer-readable medium or system) for determining, from a table schema describing keys of a data table, one or more clustering keys that can be used to cluster data files of a data table. The method includes generating features for the data table, generating tokens from the features, generating a prediction for each token by applying to the token a machine-learned transformer model trained to predict a likelihood that the key associated with the token is a clustering key for the data table, determining clustering keys based on the predictions, and clustering data records of the data table into data files based on key-values for the clustering keys.
-
公开(公告)号:US20250156397A1
公开(公告)日:2025-05-15
申请号:US18983280
申请日:2024-12-16
Applicant: Databricks, Inc.
Inventor: Zhaoxing Li , Rayman Preet Singh , Fuat Can Efeoglu , Daniel Tenedorio , Sarah Cai
IPC: G06F16/23 , G06F16/2455
Abstract: A system for retrieving and caching metadata from a remote data source is described. The system may receive a request from a client device. The request is to perform a query operation on a set of data objects stored in the remote data source. The system may access a metadata cache storing metadata information on one or more data objects of the remote data source and identify metadata corresponding to the set of data objects for the query operation in the metadata cache. The system may determine whether the identified metadata for the set of data objects meets an update condition. In response to the identified metadata meeting the update condition, the system may fetch updated metadata for at least the set of data objects from the remote data source, and store the updated metadata in the metadata cache.
-
公开(公告)号:US12277237B2
公开(公告)日:2025-04-15
申请号:US17514982
申请日:2021-10-29
Applicant: Databricks, Inc.
Inventor: Matei Zaharia , David Lewis , Cheng Lian , Yuchen Huo , Ali Ghodsi
Abstract: The present application discloses a method, system, and computer system for providing access to information stored on system for data storage. The method includes receiving a data request from a user, determining data corresponding to the data request, determining whether the user has requisite permissions to access the data, and in response to determining that the user has requisite permissions to access the data: determining a manner by which to provide access to the data, wherein the data comprises a filtered subset of stored data, and generating a token based at least in part on the user and the manner by which access to the data is to be provided.
-
公开(公告)号:US20250094195A1
公开(公告)日:2025-03-20
申请号:US18368919
申请日:2023-09-15
Applicant: Databricks, Inc.
Inventor: Aaron Daniel Davidson , Thomas Garnier , Lin Guo , Zhe He , Manlin Li , Yang Liu , Feng Wang , Hong Zhang , Weirong Zhu
Abstract: A resource management configuration may receive an API request from an API server. The API request specifies task information from a plurality of tenants. The configuration transmits status information of a plurality of VMs to the API server to assign tasks to one or more VMs based on the task information and the status information. Tasks assigned to a VM of the plurality of VMs are for one tenant of the plurality of tenants. The configuration configures on an untrusted network, network security groups for managing communications of tenants such that a network security group configured for a tenant permits communications between VMs assigned to the same tenant but prevents communications between VMs assigned to different tenants. The configuration pins each assigned VM of the one or more assigned VMs to perform the task based on the task information of the corresponding tenant.
-
公开(公告)号:US20250061378A1
公开(公告)日:2025-02-20
申请号:US18738025
申请日:2024-06-09
Applicant: Databricks, Inc.
Inventor: Benjamin Thomas Wilson , Corey Zumar
IPC: G06N20/00 , G06F18/20 , G06F18/2132
Abstract: The present application discloses a method, system, and computer system for building a model associated with a dataset. The method includes receiving a data set, the dataset comprising a plurality of keys and a plurality of key-value relationships, determining a plurality of models to build based at least in part on the dataset, wherein determining the plurality of models to build comprises using the dataset format information to identify the plurality of models, building the plurality of models, and optimizing at least one of the plurality of models.
-
公开(公告)号:US20250061132A1
公开(公告)日:2025-02-20
申请号:US18822023
申请日:2024-08-30
Applicant: Databricks, Inc.
Inventor: Alexander Balikov , Tathagata Das , Karthikeyan Ramasamy
IPC: G06F16/27 , G06F16/2455
Abstract: A data processing service performs a rebalancing process for rebalancing stateful tasks on a cluster computing system. In one instance, the method for rebalancing stateful tasks is performed such that the per-operator partitions are spread across available executors of a cluster of the cluster computing system with respect to one or more statistics of the tasks. In one instance, the method for rebalancing stateful tasks is also performed such that the total number of stateful tasks are balanced per executor as long as this rebalancing does not imbalance the per-operator placements. In this way, the processing of stateful tasks can be spread across multiple executors in a relatively uniform manner, even though there may be an upfront cost of breaking the local caching on an executor.
-
公开(公告)号:US20250028686A1
公开(公告)日:2025-01-23
申请号:US18224981
申请日:2023-07-21
Applicant: Databricks, Inc.
Inventor: Pranav Anand , Praveen Gattu , Anish Shrigondekar , Huanli Wang
IPC: G06F16/174 , G06F16/14 , G06F16/16
Abstract: A device for using message identifiers for Publish/subscribe messaging deduplication is described. The system may fetch one or more sets of data records from a data source, and each data record is associated with a message identifier. The system may store the one or more sets of data records in a data file, which is associated with a metadata comprising the message identifier, a file path and a row number for each data record. The system may determine whether one or more of the data records are duplicated based on the associated message identifiers. In response to determining that the one or more data records are duplicated, the system may generate a second metadata comprising the file paths and row numbers associated with the duplicated data records.
-
公开(公告)号:US20250013606A1
公开(公告)日:2025-01-09
申请号:US18218410
申请日:2023-07-05
Applicant: Databricks, Inc.
Inventor: Prakhar Jain , Frederick Ryan Johnson , Terry Kim , Vijayan Prabhakaran , Bart Samwel
Abstract: A data processing service generates a data classifier tree for managing data files of a data table. The data classifier tree may be configured as a KD-classifier tree and includes a plurality of nodes and edges. A node of the data classifier tree may represent a splitting condition with respect to key-values for a respective key. A node of the data classifier tree may be associated with one or more data files assigned to the node. The data files assigned to the node each include a subset of records having key-values that satisfy the conditions represented by the node and parent nodes of the node. The data processing service may efficiently cluster the data in the data table while reducing the number of data files that are rewritten when data is modified or added to the data table.
-
公开(公告)号:US12147412B2
公开(公告)日:2024-11-19
申请号:US18156109
申请日:2023-01-18
Applicant: Databricks, Inc.
Inventor: Bart Samwel , Christos Stavrakakis
Abstract: A disclosed configuration receives a first indication that a first transaction is committed to update a first subset of records in a data table at a first version to generate a second version of the data table and receiving a second indication to commit a second transaction to update a second subset of records in a data file of the data table at the first version. The configuration determines a logical prerequisite based on whether the first subset of records changes content of one or more records in the second subset of records and determining a physical prerequisite on whether the second subset of records corresponds to respective data records in data files of the second version of the data table. The configuration commits the second transaction to generate a third version of the data table by updating elements of the deletion vector if the prerequisites are satisfied.
-
公开(公告)号:US20240362215A1
公开(公告)日:2024-10-31
申请号:US18140323
申请日:2023-04-27
Applicant: Databricks, Inc.
Inventor: Venkata Sai Akhil Gudesa , Herman Rudolf Petrus Catharina van Hövell tot Westerflier , Supun Chathuranga Nakandala
IPC: G06F16/2453 , G06F9/48 , G06F11/34 , G06F16/28
CPC classification number: G06F16/2453 , G06F9/4887 , G06F11/3419 , G06F16/285
Abstract: A cluster computing system maintains a first set of queues for short queries and a set second set for longer queries. The first set is allocated a majority of the cluster's processing resources and processes queries on a first in first out basis. The second set is allocated a minority of the cluster's processing resources which are shared among queries in the second set. Accordingly, the system assigns each query to the first set of queues for a fixed amount of resource time. While a query is processing, the system monitors the query's resource time and reassigns the query to the second set of queues if the query has not completed within the allotted amount of resource time. Thus, short queries receive the necessary resources to complete quickly without getting stuck behind longer queries while ensuring that longer queries continue to make progress.
-
-
-
-
-
-
-
-
-