Patent search ap:("Intel Corporation") AND inv:"Dipankar Das" Page 1

1.

发明公开
UTILIZING STRUCTURED SPARSITY IN SYSTOLIC ARRAYS 审中-公开

公开(公告)号：US20240320000A1

公开(公告)日：2024-09-26

申请号：US18621539

申请日：2024-03-29

Applicant: Intel Corporation

Inventor： Subramaniam Maiyuran , Jorge Parra , Ashutosh Garg , Chandra Gurram , Chunhui Mei , Durgesh Borkar , Shubra Marwaha , Supratim Pal , Varghese George , Wei Xiong , Yan Li , Yongsheng Liu , Dipankar Das , Sasikanth Avancha , Dharma Teja Vooturi , Naveen K. Mellempudi

IPC: G06F9/30 , G06F9/38 , G06F15/80

CPC classification number: G06F9/30036 , G06F9/3001 , G06F9/30101 , G06F9/3893 , G06F15/8046

Abstract: An apparatus to facilitate utilizing structured sparsity in systolic arrays is disclosed. The apparatus includes a processor comprising a systolic array to receive data from a plurality of source registers, the data comprising unpacked source data, structured source data that is packed based on sparsity, and metadata corresponding to the structured source data; identify portions of the unpacked source data to multiply with the structured source data, the portions of the unpacked source data identified based on the metadata; and output, to a destination register, a result of multiplication of the portions of the unpacked source data and the structured source data.

2.

发明授权
Incremental precision networks using residual inference and fine-grain quantization 有权

公开(公告)号：US11893490B2

公开(公告)日：2024-02-06

申请号：US18060414

申请日：2022-11-30

Applicant: Intel Corporation

Inventor： Abhisek Kundu , Naveen Mellempudi , Dheevatsa Mudigere , Dipankar Das

IPC: G06N3/08 , G06N5/04 , G06T15/00 , G06F9/46 , G06N3/063 , G06N3/084 , G06N3/044 , G06N3/045 , G06T17/20 , G06T15/80 , G06T17/10 , G06T15/04 , G06V10/94

CPC classification number: G06N3/08 , G06F9/46 , G06N3/044 , G06N3/045 , G06N3/063 , G06N3/084 , G06N5/04 , G06T15/005 , G06T15/04 , G06T15/80 , G06T17/10 , G06T17/20 , G06V10/94

Abstract: One embodiment provides for a computer-readable medium storing instructions that cause one or more processors to perform operations comprising determining a per-layer scale factor to apply to tensor data associated with layers of a neural network model and converting the tensor data to converted tensor data. The tensor data may be converted from a floating point datatype to a second datatype that is an 8-bit datatype. The instructions further cause the one or more processors to generate an output tensor based on the converted tensor data and the per-layer scale factor.

3.

发明授权
Apparatus and method for vector multiply and accumulate of packed bytes 有权

公开(公告)号：US11768681B2

公开(公告)日：2023-09-26

申请号：US15879419

申请日：2018-01-24

Applicant: Intel Corporation

Inventor： Alexander Heinecke , Dipankar Das , Robert Valentine , Mark Charney

IPC: G06F9/30 , G06F9/38

CPC classification number: G06F9/3001 , G06F9/3013 , G06F9/30014 , G06F9/3016 , G06F9/30018 , G06F9/30036 , G06F9/3893

Abstract: An apparatus and method for performing multiply-accumulate operations. For example, one embodiment of a processor comprises: a decoder to decode instructions; a first source register to store a first plurality of packed bytes; a second source register to store a second plurality of packed bytes; a third source register to store a plurality of packed doublewords; execution circuitry to execute a first instruction, the execution circuitry comprising: extension circuitry to sign-extend or zero-extend the first and second plurality of packed bytes to generate a first and second plurality of words corresponding to the first and second plurality of packed bytes; multiplier circuitry to multiply each of the first plurality of words with a corresponding one of the second plurality of words to generate a plurality of temporary products; adder circuitry to add at least a first set of the temporary products to generate a first temporary sum; accumulation circuitry to combine the first temporary sum with a first packed doubleword value from a first doubleword location in the third source register to generate a first accumulated doubleword result; a destination register to store the first accumulated doubleword result in the first doubleword location.

4.

发明申请
SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY 有权

公开(公告)号：US20220413803A1

公开(公告)日：2022-12-29

申请号：US17304803

申请日：2021-06-25

Applicant: Intel Corporation

Inventor： Jorge Parra , Fangwen Fu , Subramaniam Maiyuran , Varghese George , Mike Macpherson , Supratim Pal , Chandra Gurram , Sabareesh Ganapathy , Sasikanth Avancha , Dharma Teja Vooturi , Naveen Mellempudi , Dipankar Das

IPC: G06F7/544 , G06F7/523 , G06F15/80 , G06F17/16

Abstract: A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.

5.

发明授权
Scaling half-precision floating point tensors for training deep neural networks 有权

公开(公告)号：US11468303B2

公开(公告)日：2022-10-11

申请号：US16526376

申请日：2019-07-30

Applicant: Intel Corporation

Inventor： Naveen Mellempudi , Dipankar Das

IPC: G06T1/20 , G06F5/01 , G06N3/063 , G06F7/487 , G06F7/544 , G06N3/04 , G06N3/08

Abstract: A graphics processor is described that includes a single instruction, multiple thread (SIMT) architecture including hardware multithreading. The multiprocessor can execute parallel threads of instructions associated with a command stream, where the multiprocessor includes a set of functional units to execute at least one of the parallel threads of the instructions. The set of functional units can include a mixed precision tensor processor to perform tensor computations to generate loss data. The loss data is stored as a floating-point data type and scaled by a scaling factor to enable a data distribution of a gradient tensor generated based on the loss data to be represented by a 16-bit floating point data type.

6.

发明授权
Optimized compute hardware for machine learning operations 有权

公开(公告)号：US11334796B2

公开(公告)日：2022-05-17

申请号：US16983107

申请日：2020-08-03

Applicant: Intel Corporation

Inventor： Dipankar Das , Roger Gramunt , Mikhail Smelyanskiy , Jesus Corbal , Dheevatsa Mudigere , Naveen K. Mellempudi , Alexander F. Heinecke

IPC: G06F17/16 , G06F9/30 , G06F9/38 , G06F7/544 , G06N3/08 , G06N3/063 , G06N3/04

Abstract: A processing cluster of a processing cluster array comprises a plurality of registers to store input values of vector input operands, the input values of at least some of the vector input operands having different bit lengths than those of other input values of other vector input operands, and a compute unit to execute a dot-product instruction with the vector input operands to perform a number of parallel multiply operations and an accumulate operation per 32-bit lane based on a bit length of the smallest-sized input value of a first vector input operand relative to the 32-bit lane.

7.

发明授权
Circuitry for low-precision deep learning 有权

公开(公告)号：US11275998B2

公开(公告)日：2022-03-15

申请号：US15994930

申请日：2018-05-31

Applicant: Intel Corporation

Inventor： Martin Langhammer , Sudarshan Srinivasan , Gregg William Baeckler , Duncan Moss , Sasikanth Avancha , Dipankar Das

IPC: G06N3/08 , G06N3/04 , G06N3/063 , G06F17/16 , G06F7/501 , G06F5/01 , G06F7/509 , H03M7/40 , H03M7/42 , H03M7/30

Abstract: The present disclosure relates generally to techniques for improving the implementation of certain operations on an integrated circuit. In particular, deep learning techniques, which may use a deep neural network (DNN) topology, may be implemented more efficiently using low-precision weights and activation values by efficiently performing down conversion of data to a lower precision and by preventing data overflow during suitable computations. Further, by more efficiently mapping multipliers to programmable logic on the integrated circuit device, the resources used by the DNN topology to perform, for example, inference tasks may be reduced, resulting in improved integrated circuit operating speeds.

8.

发明申请
COMMUNICATION OPTIMIZATIONS FOR DISTRIBUTED MACHINE LEARNING 审中-公开

公开(公告)号：US20190205745A1

公开(公告)日：2019-07-04

申请号：US15859180

申请日：2017-12-29

Applicant: Intel Corporation

Inventor： Srinivas Sridharan , Karthikeyan Vaidyanathan , Dipankar Das , Chandrasekaran Sakthivel , Mikhail E. Smorkalov

IPC: G06N3/08 , G06N3/063 , G06N3/04

CPC classification number: G06F9/5061 , G06F9/5077

Abstract: Embodiments described herein provide a system to configure distributed training of a neural network, the system comprising memory to store a library to facilitate data transmission during distributed training of the neural network; a network interface to enable transmission and receipt of configuration data associated with a set of worker nodes, the worker nodes configured to perform distributed training of the neural network; and a processor to execute instructions provided by the library, the instructions to cause the processor to create one or more groups of the worker nodes, the one or more groups of worker nodes to be created based on a communication pattern for messages to be transmitted between the worker nodes during distributed training of the neural network.

9.

发明授权
Incremental precision networks using residual inference and fine-grain quantization 有权

公开(公告)号：US12198055B2

公开(公告)日：2025-01-14

申请号：US18532795

申请日：2023-12-07

Applicant: Intel Corporation

Inventor： Abhisek Kundu , Naveen Mellempudi , Dheevatsa Mudigere , Dipankar Das

IPC: G06N3/08 , G06F9/46 , G06N3/044 , G06N3/045 , G06N3/063 , G06N3/084 , G06N5/04 , G06T15/00 , G06T15/04 , G06T15/80 , G06T17/10 , G06T17/20 , G06V10/94

Abstract: One embodiment provides for a computer-readable medium storing instructions that cause one or more processors to perform operations comprising determining a per-layer scale factor to apply to tensor data associated with layers of a neural network model and converting the tensor data to converted tensor data. The tensor data may be converted from a floating point datatype to a second datatype that is an 8-bit datatype. The instructions further cause the one or more processors to generate an output tensor based on the converted tensor data and the per-layer scale factor.

10.

发明授权
Instructions for fused multiply-add operations with variable precision input operands 有权

公开(公告)号：US11900107B2

公开(公告)日：2024-02-13

申请号：US17704690

申请日：2022-03-25

Applicant: Intel Corporation

Inventor： Dipankar Das , Naveen K. Mellempudi , Mrinmay Dutta , Arun Kumar , Dheevatsa Mudigere , Abhisek Kundu

IPC: G06F9/30 , G06F7/544 , G06F9/38 , G06N3/063 , G06F7/483

CPC classification number: G06F9/30014 , G06F7/483 , G06F7/5443 , G06F9/30036 , G06F9/30145 , G06F9/382 , G06F9/3802 , G06F9/384 , G06F9/3887 , G06N3/063 , G06F9/30065 , G06F2207/382

Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification