Scaling half-precision floating point tensors for training deep neural networks

    公开(公告)号:US11507815B2

    公开(公告)日:2022-11-22

    申请号:US17742138

    申请日:2022-05-11

    Abstract: A graphics processor is described that includes a single instruction, multiple thread (SIMT) architecture including hardware multithreading. The multiprocessor can execute parallel threads of instructions associated with a command stream, where the multiprocessor includes a set of functional units to execute at least one of the parallel threads of the instructions. The set of functional units can include a mixed precision tensor processor to perform tensor computations. The functional units can also include circuitry to analyze statistics for output values of the tensor computations, determine a target format to convert the output values, the target format determined based on the statistics for the output values and a precision associated with a second layer of the neural network, and convert the output values to the target format.

    Instructions for fused multiply-add operations with variable precision input operands

    公开(公告)号:US11321086B2

    公开(公告)日:2022-05-03

    申请号:US16735381

    申请日:2020-01-06

    Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

    Communication optimizations for distributed machine learning

    公开(公告)号:US11270201B2

    公开(公告)日:2022-03-08

    申请号:US15859180

    申请日:2017-12-29

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

    HARDWARE IMPLEMENTED POINT TO POINT COMMUNICATION PRIMITIVES FOR MACHINE LEARNING

    公开(公告)号:US20180322387A1

    公开(公告)日:2018-11-08

    申请号:US15869510

    申请日:2018-01-12

    CPC classification number: G06N3/08 G06F9/547 G06N3/04 G06N3/063

    Abstract: One embodiment provides for a system to compute and distribute data for distributed training of a neural network, the system including first memory to store a first set of instructions including a machine learning framework; a fabric interface to enable transmission and receipt of data associated with the set of trainable machine learning parameters; a first set of general-purpose processor cores to execute the first set of instructions, the first set of instructions to provide a training workflow for computation of gradients for the trainable machine learning parameters and to communicate with a second set of instructions, the second set of instructions facilitate transmission and receipt of the gradients via the fabric interface; and a graphics processor to perform compute operations associated with the training workflow to generate the gradients for the trainable machine learning parameters.

Patent Agency Ranking