Accumulating and flushing mutations in a column store

    公开(公告)号:US12222915B2

    公开(公告)日:2025-02-11

    申请号:US17314813

    申请日:2021-05-07

    Applicant: Cloudera, Inc.

    Inventor: Todd Lipcon

    Abstract: Columnar storage provides many performance and space saving benefits for analytic workloads, but previous mechanisms for handling single row update transactions in column stores suffer from poor performance. A columnar data layout facilitates both low-latency random access capabilities together with high-throughput analytical access capabilities, simplifying Hadoop architectures for use cases involving real-time data. In disclosed embodiments, mutations within a single row are executed atomically across columns and do not necessarily include the entirety of a row. This allows for faster updates without the overhead of reading or rewriting larger columns.

    Apparatus and method for accelerated query processing using eager aggregation and analytical view matching

    公开(公告)号:US11341134B2

    公开(公告)日:2022-05-24

    申请号:US16989687

    申请日:2020-08-10

    Applicant: Cloudera, Inc.

    Abstract: A system comprises a computer network and worker machines connected to the computer network. The worker machines store partitions of a distributed database. A master machine is connected to the computer network. The master machine includes a query processor to identify a star query that references a fact table and related dimension tables that characterize attributes of facts in the fact table. Eager aggregation is applied to a query plan associated with the star query. The eager aggregation alters the query plan by moving an aggregation operation before a join operation to form an eager aggregated query plan. An analytical view with data responsive to the eager aggregated query plan is identified. The eager aggregated query plan is revised to form a final query plan. The final query plan references the analytical view. The final query plan is executed to produce query results.

    UTILIZATION-AWARE RESOURCE SCHEDULING IN A DISTRIBUTED COMPUTING CLUSTER

    公开(公告)号:US20210349755A1

    公开(公告)日:2021-11-11

    申请号:US17379742

    申请日:2021-07-19

    Applicant: Cloudera, Inc.

    Inventor: Karthik Kambatla

    Abstract: Embodiments are disclosed for a utilization-aware approach to cluster scheduling, to address this resource fragmentation and to improve cluster utilization and job throughput. In some embodiments a resource manager at a master node considers actual usage of running tasks and schedules opportunistic work on underutilized worker nodes. The resource manager monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, the resource manager effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks.

    Apparatus and method for utilizing pre-computed results for query processing in a distributed database

    公开(公告)号:US11151135B1

    公开(公告)日:2021-10-19

    申请号:US15230240

    申请日:2016-08-05

    Applicant: Cloudera, Inc.

    Abstract: A pre-computed result module computes a result prior to receiving a query. The pre-computed result module includes instructions executed by a processor to assess a pre-computation query to designate each identified database source that contributes to the answer to the pre-computation query and corresponding database source metadata. A metadata signature is computed for each identified database source to create a store of identified database sources and corresponding metadata signatures. The query is evaluated to identify accessed database sources responsive to the query. A current metadata signature for each accessed database source is compared to the metadata signatures to identify each updated database source. Re-computed results are formed for each updated database source. Pre-computed results are utilized for each database source that is not updated. A response is supplied to the query using the re-computed results and the pre-computed results.

    Information based on run-time artifacts in a distributed computing cluster

    公开(公告)号:US10514948B2

    公开(公告)日:2019-12-24

    申请号:US15808805

    申请日:2017-11-09

    Applicant: Cloudera, Inc.

    Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.

    COMPACTION POLICY
    29.
    发明申请
    COMPACTION POLICY 审中-公开

    公开(公告)号:US20190278783A1

    公开(公告)日:2019-09-12

    申请号:US16424083

    申请日:2019-05-28

    Applicant: Cloudera, Inc.

    Inventor: Todd Lipcon

    Abstract: A compaction policy imposing soft limits to optimize system efficiency is used to select various rowsets on which to perform compaction, each rowset storing keys within an interval called a keyspace. For example, the disclosed compaction policy results in a decrease in a height of the tablet, removes overlapping rowsets, and creates smaller sized rowsets. The compaction policy is based on the linear relationship shared between the keyspace height and the cost associated with performing an operation (e.g., an insert operation) in that keyspace. Accordingly, various factors determining which rowsets are to be compacted, how large the compacted rowsets are to be made, and when to perform the compaction, are considered within the disclosed compaction policy. Furthermore, a system and method for performing compaction on the selected datasets in a log-structured database is also provided.

    Ensuring properly ordered events in a distributed computing environment

    公开(公告)号:US10171635B2

    公开(公告)日:2019-01-01

    申请号:US14462445

    申请日:2014-08-18

    Applicant: Cloudera, Inc.

    Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.

Patent Agency Ranking