Abstract:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
Abstract:
An apparatus is described that includes a memory controller to interface to a multi-level system memory. The memory controller includes least recently used (LRU) circuitry to keep track of least recently used cache lines kept in a higher level of the multi-level system memory. The memory controller also includes idle time predictor circuitry to predict idle times of a lower level of the multi-level system memory. The memory controller is to write one or more lesser used cache lines from the higher level of the multi-level system memory to the lower level of the multi-level system memory in response to the idle time predictor circuitry indicating that an observed idle time of the lower level of the multi-level system memory is expected to be long enough to accommodate the write of the one or more lesser used cache lines from the higher level of the multi-level system memory to the lower level of the multi-level system memory.
Abstract:
In accordance with embodiments disclosed herein, there is provided systems and methods for providing a virtual shared cache mechanism. A processing device includes a plurality of clusters allocated into a virtual private shared cache. Each of the clusters includes a plurality of cores and a plurality of cache slices co-located within the plurality of cores. The processing device also includes a virtual shared cache including the plurality of clusters such that the cache data in the plurality of cache slices is shared among the plurality of clusters.
Abstract:
Systems and methods for managing shared cache by multi-core processor. An example processing system comprises: a plurality of processing cores, each processing core communicatively coupled to a last level cache (LLC) slice; and a cache control logic coupled to the plurality of processing cores, the cache control logic configured to perform one of: making an LLC slice of an inactive processing core available to an active processing core or power gating the LLC slice, based on estimating cache requirements by active processing cores.
Abstract:
A processor includes a cache, a prefetcher module to select information according to a prefetcher algorithm, and a prefetcher algorithm selection module. The prefetcher algorithm selection module includes logic to select a candidate prefetcher algorithm determine and store memory addresses of predicted memory accesses of the candidate prefetcher algorithm when performed by the prefetcher module, determine cache lines accessed during memory operations, and evaluate whether the determined cache lines match the stored memory addresses. The prefetcher algorithm selection module further includes logic to adjust an accuracy ratio of the candidate prefetcher algorithm, compare the accuracy ratio with a threshold accuracy ratio, and determine whether to apply the first candidate prefetcher algorithm to the prefetcher module.
Abstract:
An apparatus is described. The apparatus includes a last level cache and a memory controller to interface to a multi-level system memory. The multi-level system memory has a caching level. The apparatus includes a first prediction unit to predict unneeded blocks in the last level cache. The apparatus includes a second prediction unit to predict unneeded blocks in the caching level of the multi-level system memory.
Abstract:
An embodiment of a memory apparatus may include a tag cache to cache tag information, and a memory controller communicatively coupled to the tag cache to determine if a request for a memory line results in a tag cache miss, bring tag information for the missed memory line into the tag cache if the request results in a cache miss, and bring tag information for at least one additional memory line adjacent to the missed memory line into the tag cache if the request results in a cache miss. Additional embodiments are disclosed and claimed.
Abstract:
A multi-level memory management circuit can remap data between near and far memory. In one embodiment, a register array stores near memory addresses and far memory addresses mapped to the near memory addresses. The number of entries in the register array is less than the number of pages in near memory. Remapping logic determines that a far memory address of the requested data is absent from the register array and selects an available near memory address from the register array. Remapping logic also initiates writing of the requested data at the far memory address to the selected near memory address. Remapping logic further writes the far memory address to an entry of the register array corresponding to the selected near memory address.
Abstract:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
Abstract:
An apparatus is described that includes a memory controller to couple to a multi-level memory characterized by a faster higher level and a slower lower level. The memory controller having early demotion logic circuitry to demote a page from the higher level to the lower level without system software having to instruct the memory controller to demote the page and before the system software promotes another page from the lower level to the higher level.