Abstract:
A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers.
Abstract:
A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers.
Abstract:
The present invention provides a system and method for improving the performance of general-purpose processors by implementing a functional unit that computes the product of a matrix operand with a vector operand, producing a vector result. The functional unit fully utilizes the entire resources of a 128b by 128b multiplier regardless of the operand size, as the number of elements of the matrix and vector operands increase as operand size is reduced. The unit performs both fixed-point and floating-point multiplications and additions with the highest-possible intermediate accuracy with modest resources.
Abstract:
A system and software for improving the performance of processors by incorporating an execution unit configurable to execute a plurality of instruction streams from the plurality of threads, wherein each instruction stream includes a group instruction that operates on a plurality of data elements in partitioned fields of at least one of the registers to produce a catenated result.
Abstract:
Systems and apparatuses are presented relating a programmable processor comprising an execution unit that is operable to decode and execute instructions received from an instruction path and partition data stored in registers in the register file into multiple data elements, the execution unit capable of executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on multiple data elements stored registers in a register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results, wherein the execution unit is capable of executing group data handling operations that re-arrange data elements in different ways in response to data handling instructions.
Abstract:
Systems and apparatuses are presented relating a programmable processor comprising an execution unit that is operable to decode and execute instructions received from an instruction path and partition data stored in registers in the register file into multiple data elements, the execution unit capable of executing a plurality of different group floating-point and group integer arithmetic operations that each arithmetically operates on multiple data elements stored registers in a register file to produce a catenated result that is returned to a register in the register file, wherein the catenated result comprises a plurality of individual results, wherein the execution unit is capable of executing group data handling operations that re-arrange data elements in different ways in response to data handling instructions.
Abstract:
A programmable processor and method for improving the performance of processors by incorporating an execution unit operable to decode and execute single instructions in an instruction set comprising (a) group instructions that operate on a plurality of data elements in partitioned fields of a register to produce a catenated result, (b) aligned memory operations that move data between memory and register where the memory operand is aligned, and (c) unaligned memory operations where the memory operand is unaligned.
Abstract:
A programmable processor and method for improving the performance of processors by incorporating an execution unit configurable to execute a plurality of instruction streams from the plurality of threads, wherein each instruction stream includes a group instruction that operates on a plurality of data elements in partitioned fields of at least one of the registers to produce a catenated result.
Abstract:
A system and software for improving the performance of processors by incorporating an execution unit operable to decode and execute single instructions specifying both a mask and a register containing data, the mask comprising fields that each correspond to a field of the data contained in the register, the execution unit is operable to detect some of the fields of the mask as having a predetermined value and identifying corresponding fields of the data contained in the register as write-enabled data fields; and cause the write-enabled data fields to be written to a specified memory location.
Abstract:
A method and software for improving the performance of processors by incorporating an execution unit operable to decode and execute single instructions specifying three registers each containing a plurality of data elements, the execution unit operable to multiply the first and second registers and add the third register to produce a catenated result containing a plurality of data elements. Additional instructions provide group floating-point subtract, add, multiply, set less, and set greater equal operations. The set less and set greater equal operations produce alternatively zero or an identity element for each element of a catenated result, the result facilitating alternative selection of individual data elements using bitwise Boolean operations and without requiring conditional branch operations.