Abstract:
Techniques and mechanisms for configuring a memory device to perform a sequence of in-memory computations. In an embodiment, a memory device includes a memory array and circuitry, coupled thereto, to perform data computations based on the data stored at the memory array. Based on instructions received at the memory device, control circuitry is configured to enable an automatic performance of a sequence of operations. In another embodiment, the memory device is coupled in an in-series arrangement of other memory devices to provide a pipeline circuit architecture. The memory devices each function as a respective stage of the pipeline circuit architecture, where the stages each perform respective in-memory computations. Some or all such stages each provide a different respective layer of a neural network.
Abstract:
Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.
Abstract:
A method and system to perform stream buffer management instructions in a processor. The stream buffer management instructions facilitate the creation and usage of a dedicated memory space or stream buffer of the processor in one embodiment of the invention. The dedicated memory space is a contiguous memory space and has a sequential or linear addressing scheme in one embodiment of the invention. The processor has logic to execute a stream buffer management instruction to copy data from a source memory address to a destination memory address that is specified with a desired level of memory hierarchy.
Abstract:
Embodiments of systems, apparatuses, and methods for performing in a computer processor generation of a predicate mask based on vector comparison in response to a single instruction are described.
Abstract:
In an embodiment, a processor includes: a plurality of first cores to independently execute instructions, each of the plurality of first cores including a plurality of counters to store performance information; at least one second core to perform memory operations; and a power controller to receive performance information from at least some of the plurality of counters, determine a workload type executed on the processor based at least in part on the performance information, and based on the workload type dynamically migrate one or more threads from one or more of the plurality of first cores to the at least one second core for execution during a next operation interval. Other embodiments are described and claimed.
Abstract:
According to one embodiment, a processor includes an instruction decoder to decode an instruction to read a plurality of data elements from memory, the instruction having a first operand specifying a storage location, a second operand specifying a bitmask having one or more bits, each bit corresponding to one of the data elements, and a third operand specifying a memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the instruction, to read one or more data elements speculatively, based on the bitmask specified by the second operand, from a memory location based on the memory address indicated by the third operand, and to store the one or more data elements in the storage location indicated by the first operand.
Abstract:
An apparatus, system and method are described for identifying identical elements in a vector register. For example, a computer implemented method according to one embodiment comprises the operations of: reading each active element from a first vector register, each active element having a defined bit position within the first vector register; reading each element from a second vector register, each element having a defined bit position within the second vector register corresponding to a bit position of a current active element in the first vector register; reading an input mask register, the input mask register identifying active bit positions in the second vector register for which comparisons are to be made with values in the first vector register, the comparison operations comprising: comparing each active element in the second vector register with elements in the first vector register having bit positions preceding the bit position of the current active element in the second vector register; and setting a bit position in an output mask register equal to a true value if all of the preceding bit positions in the first vector register are equal to the bit in the current active bit position in the second vector register.
Abstract:
In some embodiments, the invention involves utilizing a tree merge sort in a platform to minimize cache reads/writes when sorting large amounts of data. An embodiment uses blocks of pre-sorted data residing in “leaf nodes” residing in memory storage. A pre-sorted block of data from each leaf node is read from memory and stored in faster cache memory. A tree merge sort is performed on the nodes that are cache resident until a block of data migrates to a root node. Sorted blocks reaching the root node are written to memory storage in an output list until all pre-sorted data blocks have been moved to cache and merged upward to the root. The completed output list in memory storage is a list of the fully sorted data. Other embodiments are described and claimed.
Abstract:
Embodiments of systems, apparatuses, and methods for performing an align instruction in a computer processor are described. In some embodiments, the execution of an align instruction causes the selective storage of data elements of two concatenated sources to be stored in a destination.
Abstract:
A region or group of pixels may be textured as a unit, using a range specifier and one or more anchor pixels to define the group. In some embodiments, processing grouped pixels improves efficiency.