Apparatus and method for vector multiply and accumulate of packed bytes

    公开(公告)号:US11768681B2

    公开(公告)日:2023-09-26

    申请号:US15879419

    申请日:2018-01-24

    Abstract: An apparatus and method for performing multiply-accumulate operations. For example, one embodiment of a processor comprises: a decoder to decode instructions; a first source register to store a first plurality of packed bytes; a second source register to store a second plurality of packed bytes; a third source register to store a plurality of packed doublewords; execution circuitry to execute a first instruction, the execution circuitry comprising: extension circuitry to sign-extend or zero-extend the first and second plurality of packed bytes to generate a first and second plurality of words corresponding to the first and second plurality of packed bytes; multiplier circuitry to multiply each of the first plurality of words with a corresponding one of the second plurality of words to generate a plurality of temporary products; adder circuitry to add at least a first set of the temporary products to generate a first temporary sum; accumulation circuitry to combine the first temporary sum with a first packed doubleword value from a first doubleword location in the third source register to generate a first accumulated doubleword result; a destination register to store the first accumulated doubleword result in the first doubleword location.

    Scaling half-precision floating point tensors for training deep neural networks

    公开(公告)号:US11468303B2

    公开(公告)日:2022-10-11

    申请号:US16526376

    申请日:2019-07-30

    Abstract: A graphics processor is described that includes a single instruction, multiple thread (SIMT) architecture including hardware multithreading. The multiprocessor can execute parallel threads of instructions associated with a command stream, where the multiprocessor includes a set of functional units to execute at least one of the parallel threads of the instructions. The set of functional units can include a mixed precision tensor processor to perform tensor computations to generate loss data. The loss data is stored as a floating-point data type and scaled by a scaling factor to enable a data distribution of a gradient tensor generated based on the loss data to be represented by a 16-bit floating point data type.

    COMMUNICATION OPTIMIZATIONS FOR DISTRIBUTED MACHINE LEARNING

    公开(公告)号:US20190205745A1

    公开(公告)日:2019-07-04

    申请号:US15859180

    申请日:2017-12-29

    CPC classification number: G06F9/5061 G06F9/5077

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

Patent Agency Ranking