Abstract:
A processor employs multiple prefetchers at a processor to identify patterns in memory accesses to different memory modules. The memory accesses can include transfers between the memory modules, and the prefetchers can prefetch data directly from one memory module to another based on patterns in the transfers. This allows the processor to efficiently organize data at the memory modules without direct intervention by software or by a processor core, thereby improving processing efficiency.
Abstract:
A cloud computing system includes cloud orchestrator circuitry and fabric manager circuitry. The cloud orchestrator circuitry receives an input application and determines a task graph, a data graph, and a function popularity heap parameter for the input application. The task graph comprises an indication of function interdependency of functions of the input application, the data graph comprises an indication of data interdependency of the functions, and the function popularity heap parameter corresponds to a re-usability index for the functions. The fabric manager circuitry allocate a first programmable integrated circuit (IC) device to perform a first function of the input application based on the task graph, the data graph, and the function popularity heap parameter.
Abstract:
A technique for synchronizing workgroups is provided. Multiple workgroups execute a wait instruction that specifies a condition variable and a condition. A workgroup scheduler stops execution of a workgroup that executes a wait instruction and an advanced controller begins monitoring the condition variable. In response to the advanced controller detecting that the condition is met, the workgroup scheduler determines whether there is a high contention scenario, which occurs when the wait instruction is part of a mutual exclusion synchronization primitive and is detected by determining that there is a low number of updates to the condition variable prior to detecting that the condition has been met. In a high contention scenario, the workgroup scheduler wakes up one workgroup and schedules another workgroup to be woken up at a time in the future. In a non-contention scenario, more than one workgroup can be woken up at the same time.
Abstract:
The disclosed computing device can include super flow control unit (flit) generation circuitry configured to generate a super flit containing two or more flits having two or more requests embedded therein, wherein the two or more requests have the same destination node identifiers and the super flit has a variable size based on a flit size and a number of existing requests in a source node that target a same destination node. The device can additionally include authentication circuitry configured to append a message authentication code to a last flit of the super flit. The device can also include communication circuitry configured to send the super flit to a network switch configured to route the super flit to a destination node corresponding to the same destination node identifiers. Various other methods, systems, and computer-readable media are also disclosed.
Abstract:
One or both of read and write accesses to a fabric-attached memory module via a fabric interconnect are monitored. In one or more implementations, offloading of one or more tasks accessing the fabric-attached memory module to a processor of a routing system associated with the fabric-attached memory module is initiated based on the read and write accesses to the fabric-attached memory module. Additionally or alternatively, replicating memory of the fabric-attached memory module to a cache memory of a computing node in the disaggregated memory system executing one or more tasks of a host application is initiated based on the write accesses to the fabric-attached memory module.
Abstract:
Data are serially communicated over an interconnect between an encoder and a decoder. The encoder includes a first training unit to count a frequency of symbol values in symbol blocks of a set of N number of symbol blocks in an epoch. A circular shift unit of the encoder stores a set of most-recently-used (MRU) amplitude values. An XOR unit is coupled to the first training unit and the first circular shift unit as inputs and to the interconnect as output. A transmitter is coupled to the encoder XOR unit and the interconnect and thereby contemporaneously sends symbols and trains on the symbols. In a system, a device includes a receiver and decoder that receive, from the encoder, symbols over the interconnect. The decoder includes its own training unit for decoding the transmitted symbols.
Abstract:
Arbitrating atomic memory operations, including: receiving, by a media controller, a plurality of atomic memory operations; determining, by an atomics controller associated with the media controller, based on one or more arbitration rules, an ordering for issuing the plurality of atomic memory operations; and issuing the plurality of atomic memory operations to a memory module according to the ordering.
Abstract:
Systems, apparatuses, and methods for dynamically selecting between wired and wireless interconnects for sending packets are disclosed. A system includes at least a hybrid communication engine and a plurality of interconnects for connecting to various end-points. The communication engine dynamically discovers and utilizes the best interconnect technology available in between given end-points. The communication engine dynamically chooses the physical interconnect that is best suited at any given time to send data from one source to one or multiple destinations. This communication can be either on-chip or across nodes. The communication engine makes a decision based on a set of predetermined parameters that can be re-adjusted by the application layer, such as latency of the transmission, message data size, physical distance from source to destination, the energy cost, and the current congestion on the alternative interconnects.
Abstract:
Methods and apparatus provide monitoring of memory access traffic in a data processing system by tracking, such as by data fabric hardware control logic, a number of cache line accesses to a page of memory associated with one or more memory devices, and producing spike indication data that indicates a spike in cache line accesses to a given page of memory. Pages are moved from a slower memory to a faster memory based on the spike indication data. In some implementations, the tracking is done by updating a cache directory with data representing the tracked number of cache line accesses.
Abstract:
Compacted addressing for transaction layer packets, including: determining, for a first epoch, one or more low entropy address bits in a plurality of first transaction layer packets; removing, from one or more memory addresses of one or more second transaction layer packets, the one or more low entropy address bits; and sending the one or more second transaction layer packets.