Abstraction layers for scalable distributed machine learning

    公开(公告)号:US11094029B2

    公开(公告)日:2021-08-17

    申请号:US15482953

    申请日:2017-04-10

    Abstract: One embodiment provides for a method of transmitting data between multiple compute nodes of a distributed compute system, the method comprising creating a global view of communication operations to be performed between the multiple compute nodes of the distributed compute system, the global view created using information specific to a machine learning model associated with the distributed compute system; using the global view to determine a communication cost of the communication operations; and automatically determining a number of network endpoints for use in transmitting the data between the multiple compute nodes of the distributed compute system.

    COMMUNICATION OPTIMIZATIONS FOR DISTRIBUTED MACHINE LEARNING

    公开(公告)号:US20190205745A1

    公开(公告)日:2019-07-04

    申请号:US15859180

    申请日:2017-12-29

    CPC classification number: G06F9/5061 G06F9/5077

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

    Communication optimizations for distributed machine learning

    公开(公告)号:US11270201B2

    公开(公告)日:2022-03-08

    申请号:US15859180

    申请日:2017-12-29

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

    HARDWARE IMPLEMENTED POINT TO POINT COMMUNICATION PRIMITIVES FOR MACHINE LEARNING

    公开(公告)号:US20180322387A1

    公开(公告)日:2018-11-08

    申请号:US15869510

    申请日:2018-01-12

    CPC classification number: G06N3/08 G06F9/547 G06N3/04 G06N3/063

    Abstract: One embodiment provides for a system to compute and distribute data for distributed training of a neural network, the system including first memory to store a first set of instructions including a machine learning framework; a fabric interface to enable transmission and receipt of data associated with the set of trainable machine learning parameters; a first set of general-purpose processor cores to execute the first set of instructions, the first set of instructions to provide a training workflow for computation of gradients for the trainable machine learning parameters and to communicate with a second set of instructions, the second set of instructions facilitate transmission and receipt of the gradients via the fabric interface; and a graphics processor to perform compute operations associated with the training workflow to generate the gradients for the trainable machine learning parameters.

    COMMUNICATION OPTIMIZATIONS FOR DISTRIBUTED MACHINE LEARNING

    公开(公告)号:US20220245454A1

    公开(公告)日:2022-08-04

    申请号:US17685462

    申请日:2022-03-03

    Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library. The instructions cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network. The processor can transparently adjust communication paths between worker nodes based on the communication pattern.

    PARALLEL PROCESSING BASED ON INJECTION NODE BANDWIDTH

    公开(公告)号:US20210109888A1

    公开(公告)日:2021-04-15

    申请号:US16642483

    申请日:2017-09-30

    Abstract: A technique includes performing a collective operation among multiple nodes of a parallel processing computer system using multiple parallel processing stages. The technique includes regulating an ordering of the parallel processing stages so that an initial stage of the plurality of parallel processing stages is associated with a higher node injection bandwidth than a subsequent stage of the plurality of parallel processing stages.

Patent Agency Ranking