Abstract:
A mechanism is described for facilitating using of a shared local memory for register spilling/filling relating to graphics processors at computing devices. A method of embodiments, as described herein, includes reserving one or more spaces of a shared local memory (SLM) to perform one or more of spilling and filling relating to registers associated with a graphics processor of a computing device.
Abstract:
A mechanism is described for facilitating memory address compression at computing devices. A method of embodiments, as described herein, includes coalescing slot addresses across multiple messages received from an execution unit, where the slot addresses are coalesced in groups based on memory cacheline addresses such that each of a set of slot addresses in a group have a memory cacheline address in common between them. The method may further include outputting the memory cacheline addresses.
Abstract:
One embodiment provides for a general-purpose graphics processing unit comprising a processing array including multiple compute blocks, each compute block including multiple processing clusters and a thread dispatch unit to dispatch threads of a workload to the multiple compute blocks based on a parallelism metric, wherein the thread dispatch unit, based on the parallelism metric, is to perform one of a first operation and a second operation, the first operation to distribute threads across the multiple compute blocks and the second operation is to concentrate threads within one of the multiple compute blocks.
Abstract:
A mechanism is described for facilitating using of a shared local memory for register spilling/filling relating to graphics processors at computing devices. A method of embodiments, as described herein, includes reserving one or more spaces of a shared local memory (SLM) to perform one or more of spilling and filling relating to registers associated with a graphics processor of a computing device.
Abstract:
One embodiment provides for a general-purpose graphics processing device comprising a general-purpose graphics processing compute block to process a workload including graphics or compute operations, a first cache memory, and a coherency module enable the first cache memory to coherently cache data for the workload, the data stored in memory within a virtual address space, wherein the virtual address space shared with a separate general-purpose processor including a second cache memory that is coherent with the first cache memory.
Abstract:
An apparatus to facilitate data intelligent dispatching is disclosed. The apparatus includes one or more processing units including a plurality of execution units (EUs) to execute a plurality of processing threads and collection logic to collect statistics data for threads executed at the processing unit during execution of an application, and dispatch logic to dispatch the threads to be executed at a subset of the plurality of EUs during a subsequent execution of the application based on the statistics data.
Abstract:
One embodiment provides for a general-purpose graphics processing device comprising a general-purpose graphics processing compute block to process a workload including graphics or compute operations, a first cache memory, and a coherency module enable the first cache memory to coherently cache data for the workload, the data stored in memory within a virtual address space, wherein the virtual address space shared with a separate general-purpose processor including a second cache memory that is coherent with the first cache memory.
Abstract:
A mechanism is described for facilitating dynamic runtime transformation of graphics processing commands for improved graphics performance on computing devices. A method of embodiments, as described herein, includes detecting a command stream associated with an application, where the command stream includes dispatches. The method may further include evaluating processing parameters relating to each of the dispatches, where evaluating further includes associating a first plan with one or more of the dispatches to transform the command stream into a transformed command stream. The method may further include associating, based on the first plan, a second plan to the one or more of the dispatches, where the second plan represents the transformed command stream. The method may further include executing the second plan, where execution of the second plan includes processing the transformed command stream in lieu of the command stream.
Abstract:
A hybrid hierarchical cache is implemented at the same level in the access pipeline, to get the faster access behavior of a smaller cache and, at the same time, a higher hit rate at lower power for a larger cache, in some embodiments. A split cache at the same level in the access pipeline includes two caches that work together. In the hybrid, split, low level cache (e.g., L1) evictions are coordinated locally between the two L1 portions, and on a miss to both L1 portions, a line is allocated from a larger L2 cache to the smallest L1 cache.
Abstract:
A shared local memory data crossbar may be implemented in multiple stages. With this approach, the number of multiplexer cells can be reduced by fifty percent (50%) or more in some embodiments.