Abstract:
A method performed by a multi-core processor is described. The method includes, while a core is executing program code, reading a dirty cache line from the core's last level cache and sending the dirty cache line from the core for storage external from the core, where, the dirty cache line has not been evicted from the cache nor requested by another core or processor.
Abstract:
In an embodiment, a processor includes a logic to cause at least one core to operate with a power control cycle including a plurality of on times and a plurality of off times according to an ON-OFF keying protocol, where the on and off times vary depending on whether and when an interrupt is incurred. Other embodiments are described and claimed.
Abstract:
Technologies for aggregation-based message processing include multiple computing nodes in communication over a network. A computing node receives a message from a remote computing node, increments an event counter in response to receiving the message, determines whether an event trigger is satisfied in response to incrementing the counter, and writes a completion event to an event queue if the event trigger is satisfied. An application of the computing node monitors the event queue for the completion event. The application may be executed by a processor core of the computing node, and the other operations may be performed by a host fabric interface of the computing node. The computing node may be a target node and count one-sided messages received from an initiator node, or the computing node may be an initiator node and count acknowledgement messages received from a target node. Other embodiments are described and claimed.
Abstract:
A method performed by a multi-core processor is described. The method includes, while a core is executing program code, reading a dirty cache line from the core's last level cache and sending the dirty cache line from the core for storage external from the core, where, the dirty cache line has not been evicted from the cache nor requested by another core or processor.
Abstract:
Methods and apparatus for reduced network load with receiver-managed offset (RMO) PUT or GET messages. An RMO PUT message including an RMO key, data, and a length is sent from an initiator to a target, where the RMO key is extracted by a Network Interface controller (NIC), SmartNIC, or Infrastructure Processing Unit and used to identify an address or address offset of a memory buffer in a target memory at which to write the data. An RMO GET message is sent from an initiator to a target and includes an RMO key, a source buffer on the target, and a length. The target processes the RMO GET, reads the length of data from its source buffer, and returns a message to the initiator including the RMO key, the read data, and the length. The RMO key is extracted and used to identify an address or address offset of a memory buffer in a memory on the initiator in which to write the read data.
Abstract:
Examples described herein relate to a computing system supporting custom page sized ranges for an application to map contiguous memory regions instead of many smaller sized pages. An application can request a custom range size. An operating system can allocate a contiguous physical memory region to a virtual address range by specifying a custom range sizes that are larger or smaller than the normal general page sizes. Virtual-to-physical address translation can occur using an address range circuitry and translation lookaside buffer in parallel. The address range circuitry can determine if a custom entry is available to use to identify a physical address translation for the virtual address. Physical address translation can be performed by transforming the virtual address in some examples.
Abstract:
Various different embodiments of the invention are described including: (1) a method and apparatus for intelligently allocating threads within a binary translation system; (2) data cache way prediction guided by binary translation code morphing software; (3) fast interpreter hardware support on the data-side; (4) out-of-order retirement; (5) decoupled load retirement in an atomic OOO processor; (6) handling transactional and atomic memory in an out-of-order binary translation based processor; and (7) speculative memory management in a binary translation based out of order processor.
Abstract:
In an example, there is disclosed a compute node, comprising: first one or more logic elements comprising a data producer engine to produce a datum; and a host fabric interface to communicatively couple the compute node to a fabric, the host fabric interface comprising second one or more logic elements comprising a data pulling engine, the data pulling engine to: publish the datum as available; receive a pull request for the datum, the pull request comprising a node identifier for a data consumer; and send the datum to the data consumer via the fabric. There is also disclosed a method of providing a data pulling engine.
Abstract:
In an example, there is disclosed a compute node, comprising: first one or more logic elements comprising a data producer engine to produce a datum; and a host fabric interface to communicatively couple the compute node to a fabric, the host fabric interface comprising second one or more logic elements comprising a data pulling engine, the data pulling engine to: publish the datum as available; receive a pull request for the datum, the pull request comprising a node identifier for a data consumer; and send the datum to the data consumer via the fabric. There is also disclosed a method of providing a data pulling engine.
Abstract:
Technologies for monitoring communication performance of a high performance computing (HPC) network include a performance probing engine of a source endpoint node of the HPC network. The performance probing engine is configured to generate a probe request that includes a timestamp of the probe request and transmit the probe request to a destination endpoint node of the HPC network communicatively coupled to the source endpoint node via the HPC network. The performance probing engine is additionally configured to receive a probe response from the destination endpoint node via the HPC network and to generate another timestamp that corresponds to the probe request having been received. Further, the performance probing engine is configured to determine a round-trip latency as a function of the probe request and probe response timestamps. Other embodiments are described and claimed.