Abstract:
A method and apparatus for detecting and tolerating situations in which one or more processors in a multi-processor system cannot participate in timer-driven or timer-triggered protocols or event sequences. The multi-processor system includes multiple processors each having a respective memory. These processors are coupled by an inter-processor communication network (preferably consisting of redundant paths). Processors are suspected of having failed (ceased operations) outright or having a failed timer mechanism when other processors detect the absence of periodic "IamAlive" messages from other processors. When this happens, all of the processors in the system are subjected to a series of stages in which they repeatedly broadcast their status and their connectivity to each other. During the first such stage, according to the present invention, a processor will not assert its ability to participate unless its timer mechanism is working. It arms a timer expiration event and does not assert its health until and unless that timer expiration event occurs.
Abstract:
A method and apparatus for achieving maximal, full connection in a multi-processor system having a plurality of processors. Each of the multiple processors has a respective memory. The invention includes communicatively connecting the processors. Following a disruption (805) in the communicative connection, the invention collects connectivity information of one of the processors (830) and selects certain of the processors to cease operations, based on the connectivity information collected. The invention further communicates the selection to each of the processors (850) communicatively coupled to the one processor.
Abstract:
A method and apparatus for detecting and tolerating situations in which one or more processors (112a, b, ..., n) in a multi-processor system cannot participate in timer-driven or timer-triggered protocols or event sequences. The multi-processor system includes multiple processors each having a respective memory (118a, b, ..., n). These processors are coupled by an interprocessor communication network (114) (preferably consisting of redundant paths). Processors are suspected of having failed (ceased operations) outright or having a failed timer mechanism when other processors detect the absence of periodic "IamAlive" messages from other processors. When this happens, all of the processors in the system are subjected to a series of stages in which they repeatedly broadcast their status and their connectivity to each other. During the first such stage, according to the present invention, a processor will not assert its ability to participate unless its timer mechanism is working. It arms a timer expiration event and does not assert its health until and unless that timer expiration event occurs.
Abstract:
A split brain avoidance protocol to determine the group of processors (112) that will survive a complete partitioning (disconnection) in the interprocessor communications (114) paths connecting processors (112) in a multiprocessor system (100). Processors (112) embodying the invention detect that the set of processors (112) with which they can communicate has changed. They then choose either to halt or to continue operations, guided by the goal of minimizing the possibility that multiple disconnected groups of processors (112) continue to operate as independent systems, each group having determined (incorrectly) that the processors (112) of the other groups have failed.
Abstract:
A method and apparatus for achieving maximal, full connection in a multiprocessor system having a plurality of processors. Each of the multiple processors has a respective memory. The invention includes communicatively connecting the processors. Following a disruption (805) in the communicative connection, the invention collects connectivity information of one of the processors (830) and selects certain of the processors to cease operations, based on the connectivity information collected. The invention further communicates the selection to each of the processors (850) communicatively coupled to the one processor.
Abstract:
A system to determine the group of processors that will survive communications faults and/or timed-event failures in a multi-processor system (100). The processors (112), each having a memory (118) and connected to an interprocessor communication network (114), detect that the set of processors with which they can communicate has changed. They then choose to halt or continue operations based on minimizing the likelihood that disconnected groups of processors will continue to operate as independent systems on the initiation of a regroup operation (622b). A processor is suspected of having failed when other processors detect the absence of a periodic message from the processor (682). When this happens, all of the processors are subjected to a series of stages in which they repeatedly broadcast their status and connectivity to each other (830). The suspected processor does not advance through the stages to regroup if it has ceased operations or if its timer mechanism has failed.