-
公开(公告)号:US20250103229A1
公开(公告)日:2025-03-27
申请号:US18436810
申请日:2024-02-08
Applicant: Cloudera, Inc.
Inventor: Yida Wu , Abhishek Rawat , Vincent Kulandaisamy
IPC: G06F3/06
Abstract: Examples disclosed herein include writing pages of data to blocks, the data associated with an operator; writing the blocks to a file based on a sequential arrangement of the data in the blocks; writing the file to a spill data store; and executing an instruction by programmable circuitry to batch read the blocks in sequential order from the spill data store to a local memory
-
公开(公告)号:US12169507B2
公开(公告)日:2024-12-17
申请号:US16424083
申请日:2019-05-28
Applicant: Cloudera, Inc.
Inventor: Todd Lipcon
IPC: G06F16/27
Abstract: A compaction policy imposing soft limits to optimize system efficiency is used to select various rowsets on which to perform compaction, each rowset storing keys within an interval called a keyspace. For example, the disclosed compaction policy results in a decrease in a height of the tablet, removes overlapping rowsets, and creates smaller sized rowsets. The compaction policy is based on the linear relationship shared between the keyspace height and the cost associated with performing an operation (e.g., an insert operation) in that keyspace. Accordingly, various factors determining which rowsets are to be compacted, how large the compacted rowsets are to be made, and when to perform the compaction, are considered within the disclosed compaction policy. Furthermore, a system and method for performing compaction on the selected datasets in a log-structured database is also provided.
-
公开(公告)号:US12105712B2
公开(公告)日:2024-10-01
申请号:US18305715
申请日:2023-04-24
Applicant: Cloudera, Inc.
Inventor: Alexander Behm , Mostafa Mokhtar
IPC: G06F16/2453 , G06F16/2458 , G06F16/835
CPC classification number: G06F16/24545 , G06F16/24547 , G06F16/2471 , G06F16/8373 , G06F16/24549
Abstract: The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.
-
公开(公告)号:US20230350906A1
公开(公告)日:2023-11-02
申请号:US18127322
申请日:2023-03-28
Applicant: Cloudera, Inc.
Inventor: Rituparna Agrawal , Anupam Singh , Prithviraj Pandian
IPC: G06F16/248 , G06F16/84 , G06F16/21 , G06F16/28 , G06F16/2455
CPC classification number: G06F16/248 , G06F16/86 , G06F16/211 , G06F16/285 , G06F16/2455
Abstract: Systems and methods for very fast grouping of “similar” SQL queries according to user-supplied similarity criteria. The user-supplied similarity criteria include a threshold quantifying the degree of similarity between SQL queries and common artifacts included in the queries. A similarity-characterizing data structure allows for the very fast grouping of “similar” SQL queries. Because the computation is distributed among multiple compute nodes, a small cluster of compute nodes takes a short time to compute the similarity-characterizing data on a workload of tens of millions of queries. The user can supply the similarity criteria through a UI or a command line tool. Furthermore, the user can adjust the degree of similarity by supplying new similarity criteria. Accordingly, the system can display in real time or near real time, updated SQL groupings corresponding to the newly supplied similarity criteria using the originally computed similarity-characterizing data structure.
-
公开(公告)号:US20230350894A1
公开(公告)日:2023-11-02
申请号:US18305715
申请日:2023-04-24
Applicant: Cloudera, Inc.
Inventor: Alexander Behm , Mostafa Mokhtar
IPC: G06F16/2453 , G06F16/2458 , G06F16/835
CPC classification number: G06F16/24545 , G06F16/2471 , G06F16/8373 , G06F16/24547 , G06F16/24549
Abstract: The problem of distinct value estimation has many applications, but is particularly important in the field of database technology where such information is utilized by query planners to generate and optimize query plans. Introduced is a novel technique for estimating the number of distinct values in a given dataset without scanning all of the values in the dataset. In an example embodiment, the introduced technique includes gathering multiple intermediate probabilistic estimates based on varying samples of the dataset, 2) plotting the multiple intermediate probabilistic estimates against indications of sample size, 3) fitting a function to the plotted data points, and 4) determining an overall distinct value estimate by extrapolating the objective function to an estimated or known total number of values in the dataset.
-
公开(公告)号:US11663033B2
公开(公告)日:2023-05-30
申请号:US17179155
申请日:2021-02-18
Applicant: Cloudera, Inc.
Inventor: Vikas Singh , Sudhanshu Arora , Philip Zeyliger , Marcelo Masiero Vanzin , Chang She
CPC classification number: G06F9/46 , G06F16/211 , G06F16/14
Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.
-
公开(公告)号:US11301210B2
公开(公告)日:2022-04-12
申请号:US16775141
申请日:2020-01-28
Applicant: Cloudera, Inc.
Inventor: Adar Lieber-Dembo , Todd Lipcon
Abstract: A technique is described for merging multiple lists of ordinal elements such as keys into a sorted output. In an example embodiment, a merge window is defined, based on the bounds of the multiple lists of ordinal elements, that is representative of a portion of an overall element space associated with the multiple lists. Lists of elements to be sorted can be placed into one of at least two different heaps based on whether they overlap the merge window. For example, lists that overlap the merge window may be placed into an active or “hot” heap, while lists that do not overlap the merge window may be placed into a separate inactive or “cold” heap. A sorted output can then be generated by iteratively processing the active heap. As the processing of the active heap progresses, the merge window advances, and lists may move between the active and inactive heaps.
-
公开(公告)号:US20210141602A1
公开(公告)日:2021-05-13
申请号:US16775141
申请日:2020-01-28
Applicant: Cloudera, Inc.
Inventor: Adar Lieber-Dembo , Todd Lipcon
Abstract: A technique is described for merging multiple lists of ordinal elements such as keys into a sorted output. In an example embodiment, a merge window is defined, based on the bounds of the multiple lists of ordinal elements, that is representative of a portion of an overall element space associated with the multiple lists. Lists of elements to be sorted can be placed into one of at least two different heaps based on whether they overlap the merge window. For example, lists that overlap the merge window may be placed into an active or “hot” heap, while lists that do not overlap the merge window may be placed into a separate inactive or “cold” heap. A sorted output can then be generated by iteratively processing the active heap. As the processing of the active heap progresses, the merge window advances, and lists may move between the active and inactive heaps.
-
公开(公告)号:US20200304610A1
公开(公告)日:2020-09-24
申请号:US16895947
申请日:2020-06-08
Applicant: Cloudera, Inc.
Inventor: David Alves , Todd Lipcon
Abstract: A first event occurs at a first computer at a first time, as measured by a local clock. A second event is initiated at a second computer by sending a message that includes the first time. The second event occurs at a second time, as measured by a local clock. Because of clock error, the first time is later than the second time. Based on the first time being later than the second time, an alternate second time, that is based on the first time, is used as the time of the second event. When a third system determines the order of the two events, the first time is obtained from the first computer, and the alternate second time is obtained from the second computer, and the order of the events is determined based on a comparison of the two times.
-
10.
公开(公告)号:US20200065136A1
公开(公告)日:2020-02-27
申请号:US16667609
申请日:2019-10-29
Applicant: Cloudera, Inc.
Inventor: Vikas Singh , Sudhanshu Arora , Philip Zeyliger , Marcelo Masiero Vanzin , Chang She
Abstract: Techniques are disclosed for inferring design-time information based on run-time artifacts generated by services operating in a distributed computing cluster. In an embodiment, a metadata system extracts metadata including run-time artifacts generated by services in a distributed computing cluster while processing a workflow including multiple jobs. The extracted metadata is processed to identify entities and entity relationships which can then be used to generate lineage information. Using the lineage information, the metadata system can infer design-time information associated with the workflow. The inferred design-time information can then be utilized to, for example, recreate the workflow, recreate previous versions of the workflow, optimize the workflow, etc.
-
-
-
-
-
-
-
-
-