-
公开(公告)号:US12211117B2
公开(公告)日:2025-01-28
申请号:US17849968
申请日:2022-06-27
Applicant: Intel Corporation
Inventor: Dipankar Das , Karthikeyan Vaidyanathan , Srinivas Sridharan
Abstract: One embodiment provides for a method of transmitting data between multiple compute nodes of a distributed compute system, the method comprising multi-dimensionally partitioning data of a feature map across multiple nodes for distributed training of a convolutional neural network; performing a parallel convolution operation on the multiple partitions to train weight data of the neural network; and exchanging data between nodes to enable computation of halo regions, the halo regions having dependencies on data processed by a different node.
-
公开(公告)号:US11704565B2
公开(公告)日:2023-07-18
申请号:US17685462
申请日:2022-03-03
Applicant: Intel Corporation
Inventor: Srinivas Sridharan , Karthikeyan Vaidyanathan , Dipankar Das , Chandrasekaran Sakthivel , Mikhail E. Smorkalov
IPC: G06N3/08 , G06N3/088 , G06F9/50 , G06N3/084 , G06N3/044 , G06N3/045 , G06N3/04 , G06N3/063 , G06N3/048 , G06N7/01
CPC classification number: G06N3/08 , G06F9/50 , G06F9/5061 , G06F9/5077 , G06N3/04 , G06N3/044 , G06N3/045 , G06N3/063 , G06N3/084 , G06N3/088 , G06N3/048 , G06N7/01
Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library. The instructions cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network. The processor can transparently adjust communication paths between worker nodes based on the communication pattern.
-
公开(公告)号:US11681529B2
公开(公告)日:2023-06-20
申请号:US17410934
申请日:2021-08-24
Applicant: Intel Corporation
Inventor: Swagath Venkataramani , Dipankar Das , Sasikanth Avancha , Ashish Ranjan , Subarno Banerjee , Bharat Kaul , Anand Raghunathan
CPC classification number: G06F9/30145 , G06F9/3004 , G06F9/30043 , G06F9/30087 , G06F9/3834 , G06F9/52 , G06N3/04 , G06N3/063 , G06N3/084
Abstract: Systems, methods, and apparatuses relating to access synchronization in a shared memory are described. In one embodiment, a processor includes a decoder to decode an instruction into a decoded instruction, and an execution unit to execute the decoded instruction to: receive a first input operand of a memory address to be tracked and a second input operand of an allowed sequence of memory accesses to the memory address, and cause a block of a memory access that violates the allowed sequence of memory accesses to the memory address. In one embodiment, a circuit separate from the execution unit compares a memory address for a memory access request to one or more memory addresses in a tracking table, and blocks a memory access for the memory access request when a type of access violates a corresponding allowed sequence of memory accesses to the memory address for the memory access request.
-
公开(公告)号:US20230177328A1
公开(公告)日:2023-06-08
申请号:US17972832
申请日:2022-10-25
Applicant: Intel Corporation
Inventor: Srinivas Sridharan , Karthikeyan Vaidyanathan , Dipankar Das
Abstract: One embodiment provides for a graphics processing unit including a fabric interface configured to transmit gradient data stored in a memory device of the graphics processing unit according to a pre-defined communication operation. The memory device is a physical memory device shared with a compute block of the graphics processing unit and the fabric interface. The fabric interface automatically transmits the gradient data stored in memory to a second distributed training node based on an address of the gradient data in the memory device.
-
公开(公告)号:US11494163B2
公开(公告)日:2022-11-08
申请号:US16562979
申请日:2019-09-06
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Dipankar Das , Chunhui Mei , Kristopher Wong , Dhiraj D. Kalamkar , Hong H. Jiang , Subramaniam Maiyuran , Varghese George
Abstract: An apparatus to facilitate a computer number format conversion is disclosed. The apparatus comprises a control unit to receive to receive data format information indicating a first precision data format that input data is to be received and converter hardware to receive the input data and convert the first precision data format to a second precision data format based on the data format information.
-
公开(公告)号:US11488008B2
公开(公告)日:2022-11-01
申请号:US15869510
申请日:2018-01-12
Applicant: Intel Corporation
Inventor: Srinivas Sridharan , Karthikeyan Vaidyanathan , Dipankar Das
Abstract: One embodiment provides for a system to compute and distribute data for distributed training of a neural network, the system including first memory to store a first set of instructions including a machine learning framework; a fabric interface to enable transmission and receipt of data associated with the set of trainable machine learning parameters; a first set of general-purpose processor cores to execute the first set of instructions, the first set of instructions to provide a training workflow for computation of gradients for the trainable machine learning parameters and to communicate with a second set of instructions, the second set of instructions facilitate transmission and receipt of the gradients via the fabric interface; and a graphics processor to perform compute operations associated with the training workflow to generate the gradients for the trainable machine learning parameters.
-
公开(公告)号:US20220343174A1
公开(公告)日:2022-10-27
申请号:US17742581
申请日:2022-05-12
Applicant: Intel Corporation
Inventor: Dipankar Das , Roger Gramunt , Mikhail Smelyanskiy , Jesus Corbal , Dheevatsa Mudigere , Naveen K. Mellempudi , Alexander F. Heinecke
Abstract: Described herein is a graphics processor including a processing resource including a multiplier configured to multiply input associated with the instruction at one of a first plurality of bit widths, an adder configured to add a product output from the multiplier with an accumulator value at one of a second plurality of bit widths, and circuitry to select a first bit width of the first plurality of bit widths for the multiplier and a second bit width of the second plurality of bit widths for the adder.
-
公开(公告)号:US11373266B2
公开(公告)日:2022-06-28
申请号:US15869551
申请日:2018-01-12
Applicant: Intel Corporation
Inventor: Dipankar Das , Karthikeyan Vaidyanathan , Srinivas Sridharan
Abstract: One embodiment provides for a method of transmitting data between multiple compute nodes of a distributed compute system, the method comprising multi-dimensionally partitioning data of a feature map across multiple nodes for distributed training of a convolutional neural network; performing a parallel convolution operation on the multiple partitions to train weight data of the neural network; and exchanging data between nodes to enable computation of halo regions, the halo regions having dependencies on data processed by a different node.
-
公开(公告)号:US11314515B2
公开(公告)日:2022-04-26
申请号:US16724831
申请日:2019-12-23
Applicant: Intel Corporation
Inventor: Supratim Pal , Sasikanth Avancha , Ishwar Bhati , Wei-Yu Chen , Dipankar Das , Ashutosh Garg , Chandra S. Gurram , Junjie Gu , Guei-Yuan Lueh , Subramaniam Maiyuran , Jorge E. Parra , Sudarshan Srinivasan , Varghese George
Abstract: Embodiments described herein provide for an instruction and associated logic to enable a vector multiply add instructions with automatic zero skipping for sparse input. One embodiment provides for a general-purpose graphics processor comprising logic to perform operations comprising fetching a hardware macro instruction having a predicate mask, a repeat count, and a set of initial operands, where the initial operands include a destination operand and multiple source operands. The hardware macro instruction is configured to perform one or more multiply/add operations on input data associated with a set of matrices.
-
公开(公告)号:US11106464B2
公开(公告)日:2021-08-31
申请号:US16317501
申请日:2016-09-27
Applicant: Intel Corporation
Inventor: Swagath Venkataramani , Dipankar Das , Sasikanth Avancha , Ashish Ranjan , Subarno Banerjee , Bharat Kaul , Anand Raghunathan
Abstract: Systems, methods, and apparatuses relating to access synchronization in a shared memory are described. In one embodiment, a processor includes a decoder to decode an instruction into a decoded instruction, and an execution unit to execute the decoded instruction to: receive a first input operand of a memory address to be tracked and a second input operand of an allowed sequence of memory accesses to the memory address, and cause a block of a memory access that violates the allowed sequence of memory accesses to the memory address. In one embodiment, a circuit separate from the execution unit compares a memory address for a memory access request to one or more memory addresses in a tracking table, and blocks a memory access for the memory access request when a type of access violates a corresponding allowed sequence of memory accesses to the memory address for the memory access request.
-
-
-
-
-
-
-
-
-