Abstract:
Technologies for dynamic home tile mapping are described. an address request can be received from a processing core, the processing core being associated with a home tile table, the home tile table including respective mappings of one or more directory addresses to one or more home tiles. A buffer can be scanned to identify a presence of the address within the buffer. Based on an identification of the presence of the address within the buffer, a home tile identifier corresponding to the address can be provided from the buffer.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes prior to executing an original loop having iterations, analyzing, via a processor, the iterations of the original loop, identifying a dependency between a first one of the iterations of the original loop and a second one of the iterations of the original loop, after identifying the dependency, vectorizing a first group of the iterations of the original loop based on the identified dependency to form a vectorization loop, and setting a dynamic adjustment value of the vectorization loop based on the identified dependency.
Abstract:
A scatter/gather technique optimizes unstructured streaming memory accesses, providing off-chip bandwidth efficiency by accessing only useful data at a fine granularity, and off-loading memory access overhead by supporting address calculation, data shuffling, and format conversion.
Abstract:
Technologies for migration of dynamic home tile mapping are described. An apparatus includes means for receiving coherence messages from other processor cores on the die, means for recording locations from which the coherence messages originate and means for determining distances between the requested home tiles and the locations from which the coherence messages originate. The apparatus includes means for determining whether an average distance between a particular home tile, whose identifier is stored in the home tile table, exceeds a threshold. When the average distance exceeds the defined threshold, the apparatus includes means for migrating the particular home tile to another location.
Abstract:
Instructions and logic provide pushing buffer copy and store functionality. Some embodiments include a first hardware thread or processing core, and a second hardware thread or processing core, a cache to store cache coherent data in a cache line for a shared memory address accessible by the second hardware thread or processing core. Responsive to decoding an instruction specifying a source data operand, said shared memory address as a destination operand, and one or more owner of said shared memory address, one or more execution units copy data from the source data operand to the cache coherent data in the cache line for said shared memory address accessible by said second hardware thread or processing core in the cache when said one or more owner includes said second hardware thread or processing core.
Abstract:
In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed.
Abstract:
Technologies for optimizing complex exponential calculations include a computing device with optimizing compiler. The compiler parses source code, optimizes the parsed representation of the source code, and generates output code. During optimization, the compiler identifies a loop in the source code including a call to the exponential function having an argument that is a loop-invariant complex number multiplied by the loop index variable. The compiler tiles the loop to generate a pair of nested loops. The compiler generates code to pre-compute the exponential function and store the resulting values in a pair of coefficient arrays. The size of each coefficient array may be equal to the square root of the number of loop iterations. The compiler applies rewrite rules to replace the exponential function call with a multiplicative expression of one element from each of the coefficient arrays. Other embodiments are described and claimed.
Abstract:
A scatter/gather technique optimizes unstructured streaming memory accesses, providing off-chip bandwidth efficiency by accessing only useful data at a fine granularity, and off-loading memory access overhead by supporting address calculation, data shuffling, and format conversion.
Abstract:
Loop vectorization methods and apparatus are disclosed. An example method includes setting a dynamic adjustment value of a vectorization loop; executing the vectorization loop to vectorize a loop by grouping iterations of the loop into one or more vectors; identifying a dependency between iterations of the loop as; and setting the dynamic adjustment value based on the identified dependency.
Abstract:
An apparatus and method for implementing a scratchpad memory within a cache using priority hints. For example, a method according to one embodiment comprises: providing a priority hint for a scratchpad memory implemented using a portion of a cache; determining a page replacement priority based on the priority hint; storing the page replacement priority in a page table entry (PTE) associated with the page; and using the page replacement priority to determine whether to evict one or more cache lines associated with the scratchpad memory from the cache.