Abstract:
Nodes in a distributed node system are configured to support memory corruption detection when memory is shared between the nodes. Nodes in the distributed node system share data in units of memory referred to herein as “shared cache lines.” A node associates a version value with data in a shared cache line. The version value and data may be stored in a shared cache line in the node's main memory. When the node performs a memory operation, it can use the version value to determine whether memory corruption has occurred. For example, a pointer may be associated with a version value. When the pointer is used to access memory, the version value of the pointer may indicate the expected version value at the memory location. If the version values do not match, then memory corruption has occurred.
Abstract:
A method and apparatus for sending and receiving messages between nodes on a compute cluster is provided. Communication between nodes on a compute cluster, which do not share physical memory, is performed by passing messages over an I/O subsystem. Typically, each node includes a synchronization mechanism, a thread ready to receive connections, and other threads to process and reassemble messages. Frequently, a separate queue is maintained in memory for each node on the I/O subsystem sending messages to the receiving node. Such overhead increases latency and limits message throughput. Due to a specialized coprocessor running on each node, messages on an I/O subsystem are sent, received, authenticated, synchronized, and reassembled at a faster rate and with lower latency. Additionally, the memory structure used may reduce memory consumption by storing messages from multiple sources in the same memory structure, eliminating the need for per-source queues.
Abstract:
A method and apparatus are disclosed for enabling nodes in a distributed system to share one or more memory portions. A home node makes a portion of its main memory available for sharing, and one or more sharer nodes mirrors that shared portion of the home node's main memory in its own main memory. To maintain memory coherency, a memory coherence protocol is implemented. Under this protocol, load and store instructions that target the mirrored memory portion of a sharer node are trapped, and store instructions that target the shared memory portion of a home node are trapped. With this protocol, valid data is obtained from the home node and updates are propagated to the home node. Thus, no “dirty” data is transferred between sharer nodes. As a result, the failure of one node will not cause the failure of another node or the failure of the entire system.
Abstract:
A method and apparatus for sending and receiving messages between nodes on a compute cluster is provided. Communication between nodes on a compute cluster, which do not share physical memory, is performed by passing messages over an I/O subsystem. Typically, each node includes a synchronization mechanism, a thread ready to receive connections, and other threads to process and reassemble messages. Frequently, a separate queue is maintained in memory for each node on the I/O subsystem sending messages to the receiving node. Such overhead increases latency and limits message throughput. Due to a specialized coprocessor running on each node, messages on an I/O subsystem are sent, received, authenticated, synchronized, and reassembled at a faster rate and with lower latency. Additionally, the memory structure used may reduce memory consumption by storing messages from multiple sources in the same memory structure, eliminating the need for per-source queues.
Abstract:
A method and apparatus are disclosed for enabling nodes in a distributed system to share one or more memory portions. A home node makes a portion of its main memory available for sharing, and one or more sharer nodes mirrors that shared portion of the home node's main memory in its own main memory. To maintain memory coherency, a memory coherence protocol is implemented. Under this protocol, load and store instructions that target the mirrored memory portion of a sharer node are trapped, and store instructions that target the shared memory portion of a home node are trapped. With this protocol, valid data is obtained from the home node and updates are propagated to the home node. Thus, no “dirty” data is transferred between sharer nodes. As a result, the failure of one node will not cause the failure of another node or the failure of the entire system.
Abstract:
A system and method implementing revocable secure remote keys is disclosed. A plurality of indexed base secrets is stored in a register of a coprocessor of a local node coupled with a local memory. When it is determined that a selected base secret expired, the base secret stored in the register based on the base secret index is changed, thereby invalidating remote keys generated based on the expired base secret. A remote key with validation data and a base secret index is received from a node requesting access to the local memory. A validation base secret is obtained from the register based on the base secret index. The coprocessor performs hardware validation on the validation data based on the validation base secret. Hardware validation fails if the base secret associated with the base secret index has been changed in the register of the selected coprocessor.
Abstract:
Nodes in a distributed node system are configured to support memory corruption detection when memory is shared between the nodes. Nodes in the distributed node system share data in units of memory referred to herein as “shared cache lines.” A node associates a version value with data in a shared cache line. The version value and data may be stored in a shared cache line in the node's main memory. When the node performs a memory operation, it can use the version value to determine whether memory corruption has occurred. For example, a pointer may be associated with a version value. When the pointer is used to access memory, the version value of the pointer may indicate the expected version value at the memory location. If the version values do not match, then memory corruption has occurred.