Manifest-based snapshots in distributed computing environments

    公开(公告)号:US09690671B2

    公开(公告)日:2017-06-27

    申请号:US14527563

    申请日:2014-10-29

    Applicant: Cloudera, Inc.

    CPC classification number: G06F11/1464 G06F11/1456 G06F17/30575 G06F2201/84

    Abstract: Scalable architectures, systems, and services are provided herein for creating manifest-based snapshots in distributed computing environments. In some embodiments, responsive to receiving a request to create a snapshot of a data object, a master node identifies multiple slave nodes on which a data object is stored in the cloud-computing platform and creates a snapshot manifest representing the snapshot of the data object. The snapshot manifest comprises a file including a listing of multiple file names in the snapshot manifest and reference information for locating the multiple files in the distributed database system. The snapshot can be created without disrupting I/O operations, e.g., in an online mode by various region servers as directed by the master node. Additionally, a log roll approach to creating the snapshot is also disclosed in which log files are marked. The replaying of log entries can reduce the probability of causal consistency in the snapshot.

    COLLECTING AND AGGREGATING LOG DATA WITH FAULT TOLERANCE
    84.
    发明申请
    COLLECTING AND AGGREGATING LOG DATA WITH FAULT TOLERANCE 有权
    收集和聚集日志数据与容错

    公开(公告)号:US20160275136A1

    公开(公告)日:2016-09-22

    申请号:US15170824

    申请日:2016-06-01

    Applicant: Cloudera, Inc.

    Abstract: Systems and methods of collecting and aggregating log data with fault tolerance are disclosed. One embodiment includes, one or more devices that generate log data, the one or more machines each associated with an agent node to collect the log data, wherein, the agent node generates a batch comprising multiple messages from the log data and assigns a tag to the hatch. In one embodiment, the agent node further computes a checksum for the batch of multiple messages. The system may further include a collector device, the collector device being associated with a collector tier having a collector node to which the agent sends the log data; wherein, the collector determines the checksum for the hatch of multiple messages received from the agent node.

    Abstract translation: 公开了收集和聚合具有容错能力的日志数据的系统和方法。 一个实施例包括生成日志数据的一个或多个设备,每个与代理节点相关联的一个或多个机器以收集日志数据,其中,代理节点生成包括来自日志数据的多个消息的批次,并将标签分配给 舱口盖。 在一个实施例中,代理节点还计算多个消息批次的校验和。 所述系统还可以包括收集器设备,所述收集器设备与具有所述代理发送所述日志数据的收集器节点的收集器层相关联; 其中,收集器确定从代理节点接收的多个消息的填充的校验和。

    COMPACTION POLICY
    85.
    发明申请
    COMPACTION POLICY 审中-公开
    压力政策

    公开(公告)号:US20160275094A1

    公开(公告)日:2016-09-22

    申请号:US15073509

    申请日:2016-03-17

    Applicant: Cloudera, Inc.

    Inventor: Todd Lipcon

    CPC classification number: G06F16/278

    Abstract: A compaction policy imposing soft limits to optimize system efficiency is used to select various rowsets on which to perform compaction, each rowset storing keys within an interval called a keyspace. For example, the disclosed compaction policy results in a decrease in a height of the tablet, removes overlapping rowsets, and creates smaller sized rowsets. The compaction policy is based on the linear relationship shared between the keyspace height and the cost associated with performing an operation (e.g., an insert operation) in that keyspace. Accordingly, various factors determining which rowsets are to be compacted, how large the compacted rowsets are to be made, and when to perform the compaction, are considered within the disclosed compaction policy. Furthermore, a system and method for performing compaction on the selected datasets in a log-structured database is also provided.

    Abstract translation: 使用强制软件限制以优化系统效率的压缩策略来选择要执行压缩的各种行集,每个行集存储在称为密钥空间的间隔内的密钥。 例如,所公开的压缩策略导致平板电脑的高度降低,去除重叠的行集,并且创建更小尺寸的行集。 压缩策略基于在键空间高度与在该键空间中执行操作(例如,插入操作)相关联的成本之间共享的线性关系。 因此,在所公开的压缩策略中考虑了确定哪些行集合被压缩的多种因素,压实的行集合将被制造多大以及何时执行压缩。 此外,还提供了用于在日志结构化数据库中对所选数据集执行压缩的系统和方法。

    MANIFEST-BASED SNAPSHOTS IN DISTRIBUTED COMPUTING ENVIRONMENTS
    87.
    发明申请
    MANIFEST-BASED SNAPSHOTS IN DISTRIBUTED COMPUTING ENVIRONMENTS 有权
    分布式计算环境中基于显示的快照

    公开(公告)号:US20150127608A1

    公开(公告)日:2015-05-07

    申请号:US14527563

    申请日:2014-10-29

    Applicant: Cloudera, Inc.

    CPC classification number: G06F11/1464 G06F11/1456 G06F17/30575 G06F2201/84

    Abstract: Scalable architectures, systems, and services are provided herein for creating manifest-based snapshots in distributed computing environments. In some embodiments, responsive to receiving a request to create a snapshot of a data object, a master node identifies multiple slave nodes on which a data object is stored in the cloud-computing platform and creates a snapshot manifest representing the snapshot of the data object. The snapshot manifest comprises a file including a listing of multiple file names in the snapshot manifest and reference information for locating the multiple files in the distributed database system. The snapshot can be created without disrupting I/O operations, e.g., in an online mode by various region servers as directed by the master node. Additionally, a log roll approach to creating the snapshot is also disclosed in which log files are marked. The replaying of log entries can reduce the probability of causal consistency in the snapshot.

    Abstract translation: 本文提供了可扩展架构,系统和服务,用于在分布式计算环境中创建基于清单的快照。 在一些实施例中,响应于接收到创建数据对象的快照的请求,主节点识别数据对象在其中存储在云计算平台中的多个从节点,并且创建表示数据对象的快照的快照清单 。 快照清单包括包含快照清单中的多个文件名的列表的文件以及用于在分布式数据库系统中定位多个文件的参考信息。 可以创建快照,而不会中断由主节点指导的各种区域服务器的在线模式的I / O操作。 此外,还公开了创建快照的日志滚动方法,其中标记了日志文件。 日志条目的重放可以减少快照中因果一致性的概率。

    BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP
    88.
    发明申请
    BACKGROUND FORMAT OPTIMIZATION FOR ENHANCED SQL-LIKE QUERIES IN HADOOP 有权
    背景格式优化在HADOOP中增强的类似SQL的查询

    公开(公告)号:US20150095308A1

    公开(公告)日:2015-04-02

    申请号:US14043753

    申请日:2013-10-01

    Applicant: Cloudera, Inc.

    Abstract: A format conversion engine for Apache Hadoop that converts data from its original format to a database-like format at certain time points for use by a low latency (LL) query engine. The format conversion engine comprises a daemon that is installed on each data node in a Hadoop cluster. The daemon comprises a scheduler and a converter. The scheduler determines when to perform the format conversion and notifies the converter when the time comes. The converter converts data on the data node from its original format to a database-like format for use by the low latency (LL) query engine.

    Abstract translation: 用于Apache Hadoop的格式转换引擎,可在某些时间点将数据从原始格式转换为数据库格式,以供低延迟(LL)查询引擎使用。 格式转换引擎包括安装在Hadoop集群中每个数据节点上的守护程序。 守护进程包括调度器和转换器。 调度程序确定何时执行格式转换,并在时间到来时通知转换器。 转换器将数据节点上的数据从其原始格式转换为数据库状格式供低延迟(LL)查询引擎使用。

    DATA NODE FENCING IN A DISTRIBUTED FILE SYSTEM
    89.
    发明申请
    DATA NODE FENCING IN A DISTRIBUTED FILE SYSTEM 有权
    分布式文件系统中的数据节点

    公开(公告)号:US20140081927A1

    公开(公告)日:2014-03-20

    申请号:US14024585

    申请日:2013-09-11

    Applicant: Cloudera, Inc.

    Abstract: Systems and methods for data node fencing in a distributed file system to prevent data inconsistencies and corruptions are disclosed. An embodiment includes implementing a protocol whereby data nodes detect a failover and determine an active name node based on transaction identifiers associated with transaction requests. The data nodes also provide to the active name node block location information and an acknowledgment. The embodiment further includes a protocol whereby a name node refrains from issuing invalidation requests to the data nodes until the name node receives acknowledgments from all data nodes that are functional.

    Abstract translation: 公开了一种分布式文件系统中数据节点防护的系统和方法,以防止数据不一致和破坏。 一个实施例包括实现协议,由此数据节点检测到故障切换,并且基于与事务请求相关联的事务标识符来确定活动名称节点。 数据节点还提供活动名称节点块位置信息和确认。 该实施例还包括一个协议,其中名称节点避免向数据节点发出无效请求,直到名称节点接收到来自所有功能的所有数据节点的确认。

    APPROACHES TO OPTIMIZING COMPUTE RESOURCE ALLOCATION FOR HEAVY WORKLOADS IN ELASTIC ENVIRONMENTS AND CLOUD DATA PLATFORM FOR IMPLEMENTING THE SAME

    公开(公告)号:US20240378088A1

    公开(公告)日:2024-11-14

    申请号:US18450221

    申请日:2023-08-15

    Applicant: Cloudera, Inc.

    Abstract: Introduced here is a resource management platform (also called a “resource manager”) that is able to dynamically allocate compute resources to workloads to accommodate resource requirements of a tenant in a more efficient and cost-effective manner, especially in scenarios where compute resource availability is elastic in nature. The resource manager can include a scheduling engine and a recommending engine that together are able to optimize the scaling up and down of compute resources in different scenarios. Normally, the resource manager can communicate with a resource-aware, external entity that may be responsible for implementing appropriate changes on a cloud infrastructure. For example, the external entity may be responsible for adding or removing nodes assigned to a given tenant, as well as obtaining relevant attributes of those compute resources from a provider of the cloud infrastructure.

Patent Agency Ranking