Abstract:
Three-dimensional (3D) DRAM integrated in the same package as compute logic enable forming high-density caches. In one example, an integrated 3D DRAM includes a large on-de cache (such as a level 4 (L4) cache), a large on-die memory-side cache, or both an L4 cache and a memory-side cache. One or more tag caches cache recently accessed tags from the L4 cache, the memory-side cache, or both. A cache controller in the compute logic is to receive a request from one of the processor cores to access an address and compare tags in the tag cache with the address. In response to a hit in the tag cache, the cache controller accesses data from the cache at a location indicated by an entry in the tag cache, without performing a tag lookup in the cache.
Abstract:
A processor may include a memory controller to interface with a system memory having a near memory and a far memory. The processor may include logic circuitry to cause memory controller to determine whether a write request is generated remotely or locally, and when the write request is generated remotely to instruct the memory controller to perform a read of near memory before performing a write, when the write request is generated locally and a cache line targeted by the write request is in the inclusive state to instruct the memory controller to perform the write without performing a read of near memory, and when the write request is generated locally and the cache line targeted by the write request is in the non-inclusive state to instruct the memory controller to read near memory before performing the write.
Abstract:
In accordance with embodiments disclosed herein, there is provided systems and methods for providing a virtual shared cache mechanism. A processing device includes a plurality of clusters allocated into a virtual private shared cache. Each of the clusters includes a plurality of cores and a plurality of cache slices co-located within the plurality of cores. The processing device also includes a virtual shared cache including the plurality of clusters such that the cache data in the plurality of cache slices is shared among the plurality of clusters.
Abstract:
An apparatus and method for reducing or eliminating writeback operations. For example, one embodiment of a method comprises: detecting a first operation associated with a cache line at a first requestor cache; detecting that the cache line exists in a first cache in a modified (M) state; forwarding the cache line from the first cache to the first requestor cache and storing the cache line in the first requestor cache in a second modified (M′) state; detecting a second operation associated with the cache line at a second requestor; responsively forwarding the cache line from the first requestor cache to the second requestor cache and storing the cache line in the second requestor cache in an owned (O) state if the cache line has not been modified in the first requestor cache; and setting the cache line to a shared (S) state in the first requestor cache.
Abstract:
Methods and apparatus relating to allocation and/or write policy for a glueless area-efficient directory cache for hotly contested cache lines are described. In one embodiment, a directory cache stores data corresponding to a caching status of a cache line. The caching status of the cache line is stored for each of a plurality of caching agents in the system. An write-on-allocate policy is used for the directory cache by using a special state (e.g., snoop-all state) that indicates one or more snoops are to be broadcasted to all agents in the system. Other embodiments are also disclosed.
Abstract:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
Abstract:
A processor of an aspect includes a plurality of logical processors each having one or more corresponding lower level caches. A shared higher level cache is shared by the plurality of logical processors. The shared higher level cache includes a distributed cache slice for each of the logical processors. The processor includes logic to direct an access that misses in one or more lower level caches of a corresponding logical processor to a subset of the distributed cache slices in a virtual cluster that corresponds to the logical processor. Other processors, methods, and systems are also disclosed.
Abstract:
In accordance with embodiments disclosed herein, there is provided systems and methods for providing a virtual shared cache mechanism. A processing device includes a plurality of clusters allocated into a virtual private shared cache. Each of the clusters includes a plurality of cores and a plurality of cache slices co-located within the plurality of cores. The processing device also includes a virtual shared cache including the plurality of clusters such that the cache data in the plurality of cache slices is shared among the plurality of clusters.
Abstract:
A processor of an aspect includes a plurality of logical processors each having one or more corresponding lower level caches. A shared higher level cache is shared by the plurality of logical processors. The shared higher level cache includes a distributed cache slice for each of the logical processors. The processor includes logic to direct an access that misses in one or more lower level caches of a corresponding logical processor to a subset of the distributed cache slices in a virtual cluster that corresponds to the logical processor. Other processors, methods, and systems are also disclosed.
Abstract:
Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.