Abstract:
One embodiment provides for a graphics processing unit to accelerate machine-learning operations, the graphics processing unit comprising a multiprocessor having a single instruction, multiple thread (SIMT) architecture, the multiprocessor to execute at least one single instruction; and a first compute unit included within the multiprocessor, the at least one single instruction to cause the first compute unit to perform a two-dimensional matrix multiply and accumulate operation, wherein to perform the two-dimensional matrix multiply and accumulate operation includes to compute an intermediate product of 16-bit operands and to compute a 32-bit sum based on the intermediate product.
Abstract:
An apparatus includes a plurality of delay generators, a first plurality of flip-flop circuits, a second plurality of flip-flop circuits, and a third plurality of flip-flop circuits. The plurality of delay generators includes a data delay generator, an enable delay generator, and a reference delay generator. The first plurality of flip-flop circuits is coupled to the data delay generator to receive a delayed data input signal, and provide the delayed data input signal to a plurality of data input terminals of a memory circuit. The second plurality of flip-flop circuits is coupled to the enable delay generator to receive a delayed enable signal and provide the delayed enable signal to a plurality of enable terminals of the memory circuit. The third plurality of flip-flop circuits is coupled to an output terminal of the memory circuit. The reference delay generator provides a synchronized clock signal to the flip-flop circuits.
Abstract:
One embodiment provides a graphics processor comprising a memory controller and a graphics processing resource coupled with the memory controller. The graphics processing resource includes circuitry configured to execute an instruction to perform a matrix operation on first input including weight data and second input including input activation data, generate intermediate data based on a result of the matrix operation, quantize the intermediate data to a floating-point format determined based on a statistical distribution of first output data, and output, as second output data, quantized intermediate data in a determined floating-point format.
Abstract:
A FPMAC operation has two operands: an input operand and a weight operand. The operands may have a format of FP16, BF16, or INT8. Each operand is split into two portions. The two portions are stored in separate storage units. Then operands are transferred to register files of a PE, with each register file storing bits of an operand sequentially. The PE performs the FPMAC operation based on the operands. The PE may include an FPMAC unit configured to compute an individual partial sum of the PE. The PE may also include an FP adder to accumulate the individual partial sum with other data, such as an output from another PE or an output form another PE array. The FP adder may be fused with the FPMAC unit in a single circuit that can do speculative alignment and has separate critical paths for alignment and normalization.
Abstract:
Systems, apparatuses and methods may provide for multi-precision multiply-accumulate (MAC) technology that includes a plurality of arithmetic blocks, wherein the plurality of arithmetic blocks each contain multiple multipliers, and wherein the logic is to combine multipliers one or more of within each arithmetic block or across multiple arithmetic blocks. In one example, one or more intermediate multipliers are of a size that is less than precisions supported by arithmetic blocks containing the one or more intermediate multipliers.
Abstract:
A first packet and a first direction associated with the first packet are received. The first packet is forwarded to an output port of a plurality of output ports of the first router based on the first direction associated with the first packet. A second direction associated with the first packet is determined. The second direction is based at least on an address of the first packet. The first packet and the second direction are forwarded through the output port of the first router to a second router.
Abstract:
An apparatus may comprise a plurality of ports and a plurality of channel reservation banks. A channel reservation bank is to be associated with a port of the plurality of ports. The channel reservation bank is to comprise a plurality of channel reservation slots. The port of the plurality of ports is to comprise a plurality of circuit-switched channels through the port. The configuration of each of the plurality of circuit-switched channels to be based on information stored in a channel reservation slot of the channel reservation bank to be associated with the port.
Abstract:
Described is an apparatus (e.g., a router) which comprises: multiple ports; and a plurality of crossbar circuits arranged such that at least one crossbar circuit receives all interconnects associated with a data bit of the multiple ports and is operable to re-route signals on those interconnects.
Abstract:
A first packet and a first direction associated with the first packet are received. The first packet is forwarded to an output port of a plurality of output ports of the first router based on the first direction associated with the first packet. A second direction associated with the first packet is determined. The second direction is based at least on an address of the first packet. The first packet and the second direction are forwarded through the output port of the first router to a second router.
Abstract:
One embodiment provides for a graphics processing unit to accelerate machine-learning operations, the graphics processing unit comprising a multiprocessor having a single instruction, multiple thread (SIMT) architecture, the multiprocessor to execute at least one single instruction; and a first compute unit included within the multiprocessor, the at least one single instruction to cause the first compute unit to perform a two-dimensional matrix multiply and accumulate operation, wherein to perform the two-dimensional matrix multiply and accumulate operation includes to compute a 32-bit intermediate product of 16-bit operands and to compute a 32-bit sum based on the 32-bit intermediate product.