-
公开(公告)号:US12222915B2
公开(公告)日:2025-02-11
申请号:US17314813
申请日:2021-05-07
Applicant: Cloudera, Inc.
Inventor: Todd Lipcon
Abstract: Columnar storage provides many performance and space saving benefits for analytic workloads, but previous mechanisms for handling single row update transactions in column stores suffer from poor performance. A columnar data layout facilitates both low-latency random access capabilities together with high-throughput analytical access capabilities, simplifying Hadoop architectures for use cases involving real-time data. In disclosed embodiments, mutations within a single row are executed atomically across columns and do not necessarily include the entirety of a row. This allows for faster updates without the overhead of reading or rewriting larger columns.
-
公开(公告)号:US20240015234A1
公开(公告)日:2024-01-11
申请号:US18357021
申请日:2023-07-21
Applicant: Cloudera, Inc.
Inventor: David Alves , Todd Lipcon
CPC classification number: H04L69/28 , G06Q10/00 , H04L67/10 , G06Q50/26 , G06F1/14 , G06F11/0721 , G06F11/0772
Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
-
23.
公开(公告)号:US11341134B2
公开(公告)日:2022-05-24
申请号:US16989687
申请日:2020-08-10
Applicant: Cloudera, Inc.
Inventor: Anjali Betawadkar-Norwood , Priyank Patel
IPC: G06F16/2453 , G06F16/2458 , G06F16/27
Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.
-
公开(公告)号:US20210349755A1
公开(公告)日:2021-11-11
申请号:US17379742
申请日:2021-07-19
Applicant: Cloudera, Inc.
Inventor: Karthik Kambatla
Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.
-
25.
公开(公告)号:US20210334301A1
公开(公告)日:2021-10-28
申请号:US17367194
申请日:2021-07-02
Applicant: Cloudera, Inc.
Inventor: Sudhanshu Arora , Mark Donsky , Guang Yao Leng , Naren Koneru , Chang She , Vikas Singh , Himabindu Vuppula
Abstract: Transient computing clusters can be temporarily provisioned in cloud-based infrastructure to run data processing tasks. Such tasks may be run by services operating in the clusters that consume and produce data including operational metadata. Techniques are introduced for tracking data lineage across multiple clusters, including transient computing clusters, based on the operational metadata. In some embodiments, operational metadata is extracted from the transient computing clusters and aggregated at a metadata system for analysis. Based on the analysis of the metadata, operations can be summarized at a cluster level even if the transient computing cluster no longer exists. Further relationships between workflows, such as dependencies or redundancies, can be identified and utilized to optimize the provisioning of computing clusters and tasks performed by the computing clusters.
-
26.
公开(公告)号:US11151135B1
公开(公告)日:2021-10-19
申请号:US15230240
申请日:2016-08-05
Applicant: Cloudera, Inc.
Inventor: Douglas J. Cameron
IPC: G06F16/178 , G06F16/2453 , G06F16/951
Abstract: A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.
-
公开(公告)号:US11108661B1
公开(公告)日:2021-08-31
申请号:US16280397
申请日:2019-02-20
Applicant: Cloudera, Inc.
Inventor: Charu Anchlia , Sushil Thomas
IPC: G06F15/173 , H04L12/26 , G06F16/9538 , H04L29/06 , H04L12/24
Abstract: A machine has a bus and a network interface circuit to receive different data streams from a network. The network interface circuit is connected to the network and the bus. A processor is connected to the bus. A memory is connected to the bus. The memory stores instructions executed by the processor to continuously increment aggregate functions associated with data parameters within the different data streams. Visualizations of the different data streams are periodically updated on different client devices connected to the network.
-
公开(公告)号:US10514948B2
公开(公告)日:2019-12-24
申请号:US15808805
申请日:2017-11-09
Applicant: Cloudera, Inc.
Inventor: Vikas Singh , Sudhanshu Arora , Philip Zeyliger , Marcelo Masiero Vanzin , Chang She
Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.
-
公开(公告)号:US20190278783A1
公开(公告)日:2019-09-12
申请号:US16424083
申请日:2019-05-28
Applicant: Cloudera, Inc.
Inventor: Todd Lipcon
IPC: G06F16/27
Abstract: A compaction policy imposing soft limits to optimize system efficiency is used to select various rowsets on which to perform compaction, each rowset storing keys within an interval called a keyspace. For example, the disclosed compaction policy results in a decrease in a height of the tablet, removes overlapping rowsets, and creates smaller sized rowsets. The compaction policy is based on the linear relationship shared between the keyspace height and the cost associated with performing an operation (e.g., an insert operation) in that keyspace. Accordingly, various factors determining which rowsets are to be compacted, how large the compacted rowsets are to be made, and when to perform the compaction, are considered within the disclosed compaction policy. Furthermore, a system and method for performing compaction on the selected datasets in a log-structured database is also provided.
-
公开(公告)号:US10171635B2
公开(公告)日:2019-01-01
申请号:US14462445
申请日:2014-08-18
Applicant: Cloudera, Inc.
Inventor: David Alves , Todd Lipcon
Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
-
-
-
-
-
-
-
-
-