DISTRIBUTED, FAULT-TOLERANT AND HIGHLY AVAILABLE COMPUTING SYSTEM
    1.
    发明公开
    DISTRIBUTED, FAULT-TOLERANT AND HIGHLY AVAILABLE COMPUTING SYSTEM 有权
    分布式,故障排除和高可用性数据处理系统

    公开(公告)号:EP2156307A4

    公开(公告)日:2010-10-13

    申请号:EP08742917

    申请日:2008-04-15

    Applicant: IBM

    Abstract: A method and system for achieving highly available, fault-tolerant execution of components in a distributed computing system, without requiring the writer of these components to explicitly write code (such as entity beans or database transactions) to make component state persistent. It is achieved by converting the intrinsically non-deterministic behavior of the distributed system to a deterministic behavior, thus enabling state recovery to be achieved by advantageously efficient checkpoint-replay techniques. The method comprises: adapting the execution environment for enabling message communication amongst and between the components; automatically associating a deterministic timestamp in conjunction with a message to be communicated from a sender component to a receiver component during program execution, the timestamp representative of estimated time of arrival of the message at a receiver component. At a component, tracking state of that component during program execution, and periodically checkpointing the state in a local storage device. Upon failure of a component, the component state is restored by recovering a recent stored checkpoint and re-executing the events occurring since the last checkpoint. The system is deterministic by repeating the execution of the receiving component by processing the messages in the same order as their associated timestamp.

    Abstract translation: 一种用于在分布式计算系统中实现高可用性,容错执行组件的方法和系统,而不需要这些组件的写入程序来明确地编写代码(例如实体bean或数据库事务)以使组件状态持续存在。 通过将分布式系统的本质上非确定性行为转换为确定性行为来实现,从而通过有利的高效检查点重放技术实现状态恢复。 该方法包括:使执行环境适应于在组件之间和之间启用消息通信; 将确定性时间戳与在程序执行期间从发送方组件传递到接收方组件的消息相结合,该时间戳代表消息在接收方组件处的估计到达时间。 在组件中,在程序执行期间跟踪该组件的状态,并周期性地检查本地存储设备中的状态。 在组件发生故障时,通过恢复最近存储的检查点并重新执行自上次检查点以来发生的事件来恢复组件状态。 通过以与其相关联的时间戳相同的顺序处理消息来重复接收组件的执行,该系统是确定性的。

    Distributed, fault-tolerant and highly available computing system

    公开(公告)号:AU2008244623B2

    公开(公告)日:2013-03-28

    申请号:AU2008244623

    申请日:2008-04-15

    Applicant: IBM

    Abstract: A method and system for achieving highly available, fault-tolerant execution of components in a distributed computing system, without requiring the writer of these components to explicitly write code (such as entity beans or database transactions) to make component state persistent. It is achieved by converting the intrinsically non-deterministic behavior of the distributed system to a deterministic behavior, thus enabling state recovery to be achieved by advantageously efficient checkpoint-replay techniques. The method comprises: adapting the execution environment for enabling message communication amongst and between the components; automatically associating a deterministic timestamp in conjunction with a message to be communicated from a sender component to a receiver component during program execution, the timestamp representative of estimated time of arrival of the message at a receiver component. At a component, tracking state of that component during program execution, and periodically checkpointing the state in a local storage device. Upon failure of a component, the component state is restored by recovering a recent stored checkpoint and re-executing the events occurring since the last checkpoint. The system is deterministic by repeating the execution of the receiving component by processing the messages in the same order as their associated timestamps.

    Distributed, fault-tolerant and highly available computing system

    公开(公告)号:AU2008244623A1

    公开(公告)日:2008-11-06

    申请号:AU2008244623

    申请日:2008-04-15

    Applicant: IBM

    Abstract: A method and system for achieving highly available, fault-tolerant execution of components in a distributed computing system, without requiring the writer of these components to explicitly write code (such as entity beans or database transactions) to make component state persistent. It is achieved by converting the intrinsically non-deterministic behavior of the distributed system to a deterministic behavior, thus enabling state recovery to be achieved by advantageously efficient checkpoint-replay techniques. The method comprises: adapting the execution environment for enabling message communication amongst and between the components; automatically associating a deterministic timestamp in conjunction with a message to be communicated from a sender component to a receiver component during program execution, the timestamp representative of estimated time of arrival of the message at a receiver component. At a component, tracking state of that component during program execution, and periodically checkpointing the state in a local storage device. Upon failure of a component, the component state is restored by recovering a recent stored checkpoint and re-executing the events occurring since the last checkpoint. The system is deterministic by repeating the execution of the receiving component by processing the messages in the same order as their associated timestamp.

    OPTIMISTIC RECOVERY IN A DISTRIBUTED PROCESSING SYSTEM

    公开(公告)号:CA1223369A

    公开(公告)日:1987-06-23

    申请号:CA485189

    申请日:1985-06-25

    Applicant: IBM

    Abstract: OPTIMISTIC RECOVERY IN A DISTRIBUTED PROCESSING SYSTEM In a distributed system whose state space is partitioned into recovery units, wherein recovery units communicate by the exchange of messages and wherein a message received by a recovery unit may causally depend on other recovery units having received prior messages, a method of recovering from failure of any number of recovery units in the system comprising the steps of: (a) tracking the dependency of each message received by a recovery unit in terms of the causative messages received by other recovery units in the system; and (b) restoring all recovery units to a consistent system-wide state after recovery unit failure by means of the tracked message dependencies.

Patent Agency Ranking