Abstract:
Systems, methods, apparatuses, and software for data storage systems are provided herein. In one example, a data storage system is provided that includes a plurality of storage drives each comprising a Peripheral Component Interconnect Express (PCIe) interface, and configured to store data and retrieve the data stored on associated storage media responsive to storage operations. The data storage system includes one or more processing modules comprising one or more processors communicatively coupled to the plurality of storage drives over a PCIe fabric comprised of one or more PCIe switches. The processors are configured to share a PCIe address space associated with the PCIe fabric for transfer of the storage operations to appropriate ones of the processors that manage ones of the plurality of storage drives.
Abstract:
In a system, a first status of a first ESP engine (ESPE) executing at a first computing device is determined as newly active; a last published event block object identifier is determined as an identifier uniquely identifying a last event block object published to an out-messaging network device; a next event block object having an event block object identifier greater than the determined last published event block object identifier is selected from a first computer-readable medium; and the selected next event block object is published to the out-messaging network device. A first event block object is received from a second ESPE executing at a second computing device. A first status of the second ESPE is determined as standby by the second computing device. The received first event block object is stored by the second computing device in a second non-transitory computer-readable medium.
Abstract:
A selection device is selected from a first group in a first cluster. The node devices of the first group can communicate with a node device in a second cluster adjacent to the first cluster. The selection device performs transmission and reception of a report frame that reports an identifier of a node device included in the first group. The selection device selects from the first group a first relay device that relays a relay frame used for a communication between a node device in the first cluster and a node device in the second cluster. The first relay device determines anode device adjacent to the first relay device to be a second relay device that relays the relay frame to a node device in the second cluster.
Abstract:
Technology is disclosed for recovering I/O modules in a storage system using in-band alternate control path (ACP) architecture (“the technology”). The technology enables a storage server to transmit control commands, e.g., for recovering an I/O module, to the I/O module over a data path that is typically used to transmit data commands. The control commands are typically transmitted using ACP that is separate from the data path. By enabling transmission of control commands over the data path, the technology eliminates the need for separate medium for ACP, at least in part, to transmit the control commands. The technology can be implemented in a pure in-band ACP mode, which supports recovering an I/O module of a storage shelf in which at least one I/O module is responsive, and/or in a mixed in-band ACP mode, which supports recovery of I/O modules of a storage shelf in which all I/O modules are non-responsive.
Abstract:
According to an example, an Edge Virtual Bridging (EVB) station is configured with a VM, an ER and multiple physical network cards. The VM is configured with multiple virtual network cards and each virtual network card has one VSI. Each VSI is connected with one of the physical network cards via the ER. One of the physical network cards is configured as a primary physical network card, and another is configured as a secondary physical network card. A VSI corresponding to the primary physical network card is configured as a primary virtual interface, and a VSI corresponding to the secondary physical network card is configured as a secondary virtual interface. After determining the primary physical network card failed, the secondary physical network card is configured as a new primary physical network card, and the secondary virtual is configured as a new primary virtual interface.
Abstract:
Embodiments of the present invention disclose a method, computer program product, and system for memory replication. In one embodiment, in accordance with the present invention, the computer implemented method includes the steps of executing a mobile agent on a server node, wherein the server node is within a cluster of server nodes connected via network communications, capturing a memory state of the server node during operation of the server node, wherein the memory state includes session information stored on computer memory of the server node, which is captured and stored by the mobile agent, monitoring the server node to determine whether the server node has failed, and responsive to determining that the server node has failed, migrating the mobile agent to an active server node within the cluster of server nodes, wherein the mobile agent carries the captured memory state.
Abstract:
A method, non-transitory computer readable medium, and host device that receives one or more transactions. A state is stored in a transaction log in a volatile memory wherein the state includes information associated with the one or more transactions. The transaction log is stored in a stable storage device when a failure is determined to a have occurred. The transaction log can then be retrieved and replayed subsequent to a reboot. Thereby, state can be preserved and transactions pending, but not yet committed to storage server devices, can be replayed and proceed with minimal or no impact on the client devices originating the write transactions.
Abstract:
A method of providing failure recovery capabilities to a cloud environment for scientific HPC applications. An HPC application with MPI implementation extends the class of MPI programs to embed the HPC application with various degrees of fault tolerance. An MPI fault tolerance mechanism realizes a recover-and-continue solution. If an error occurs, only failed processes re-spawn, the remaining living processes remain in their original processors/nodes, and system recovery costs are thus minimized.
Abstract:
A processor includes a plurality of processing sections, each of which executes a predetermined process. A plurality of fault detecting circuits are respectively provided for the plurality of processing sections, to detect a fault in one of the plurality of processing sections as a fault processing section to generate a fault detection signal. A fault monitoring and control section controls a normal processing section as at least one of the plurality of processing sections other than the fault processing section to execute a relieving process in response to the fault detection signal. The relieving process is determined based on a process load of the fault processing section, a process load of the normal processing section, and priority levels of processes to be executed by the fault processing section and the normal processing section.
Abstract:
An SVC cluster manages a plurality of storage devices and includes a plurality of SVCs interconnected via a network, each SVC acting as a separate node. A new configuration node is activated in response to configuration node failures. The new configuration node retrieves client subscription information about events occurring in storage devices managed by the SVC cluster from the storage devices. In response to events occurring in the storage device managed by the SVC cluster, the new configuration node obtains storage device event information from a storage device event monitoring unit. The new configuration node sends storage device events to clients who have subscribed to this information according to subscription information obtained. The storage device is not installed in the original configuration node.