-
公开(公告)号:US11748606B2
公开(公告)日:2023-09-05
申请号:US17317857
申请日:2021-05-11
Applicant: INTEL CORPORATION
Inventor: Kamal Sinha , Balaji Vembu , Eriko Nurvitadhi , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Joydeep Ray , Ping T. Tang , Michael S. Strickland , Xiaoming Chen , Anbang Yao , Tatiana Shpeisman , Abhishek R. Appu , Altug Koker , Farshad Akhbari , Narayan Srinivasa , Feng Chen , Dukhwan Kim , Nadathur Rajagopalan Satish , John C. Weast , Mike B. MacPherson , Linda L. Hurd , Vasanth Ranganathan , Sanjeev S. Jahagirdar
IPC: G06F7/50 , G06N3/063 , G06N3/08 , G06N3/04 , G06T1/20 , G06F9/30 , G06T15/00 , G06F15/78 , G06F15/76 , G06F1/3287 , G06F1/3293 , G06N3/084 , G06N3/044 , G06N3/045 , G06T1/60
CPC classification number: G06N3/063 , G06F1/3287 , G06F1/3293 , G06F9/30014 , G06F9/30036 , G06F15/76 , G06F15/78 , G06N3/04 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/084 , G06T1/20 , G06T15/005 , G06T1/60
Abstract: In an example, an apparatus comprises a compute engine comprising a high precision component and a low precision component; and logic, at least partially including hardware logic, to receive instructions in the compute engine; select at least one of the high precision component or the low precision component to execute the instructions; and apply a gate to at least one of the high precision component or the low precision component to execute the instructions. Other embodiments are also disclosed and claimed.
-
公开(公告)号:US20230040631A1
公开(公告)日:2023-02-09
申请号:US17881720
申请日:2022-08-05
Applicant: Intel Corporation
Inventor: Eriko Nurvitadhi , Balaji Vembu , Tsung-Han Lin , Kamal Sinha , Rajkishore Barik , Nicolas C. Galoppo Von Borries
IPC: G06T1/20 , G06F9/30 , G06F9/38 , G06F12/0811 , G06F12/0815 , G06F12/0831 , G06F12/0888 , H03M7/30 , G06K9/62 , G06N20/00 , G06F12/02 , G06F9/48 , G06F17/16 , G06N3/04 , G06N3/08 , G06T1/60 , G06T15/00
Abstract: Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.
-
73.
公开(公告)号:US20220357945A1
公开(公告)日:2022-11-10
申请号:US17834482
申请日:2022-06-07
Applicant: Intel Corporation
Inventor: Himanshu Kaul , Mark A. Anders , Sanu K. Mathew , Anbang Yao , Joydeep Ray , Ping T. Tang , Michael S. Strickland , Xiaoming Chen , Tatiana Shpeisman , Abhishek R. Appu , Altug Koker , Kamal Sinha , Balaji Vembu , Nicolas C. Galoppo Von Borries , Eriko Nurvitadhi , Rajkishore Barik , Tsung-Han Lin , Vasanth Ranganathan , Sanjeev Jahagirdar
Abstract: One embodiment provides a graphics processor comprising a memory controller and a graphics processing resource coupled with the memory controller. The graphics processing resource includes circuitry configured to execute an instruction to perform a matrix operation on first input including weight data and second input including input activation data, generate intermediate data based on a result of the matrix operation, quantize the intermediate data to a floating-point format determined based on a statistical distribution of first output data, and output, as second output data, quantized intermediate data in a determined floating-point format.
-
公开(公告)号:US20220164916A1
公开(公告)日:2022-05-26
申请号:US17541413
申请日:2021-12-03
Applicant: Intel Corporation
Inventor: Eriko Nurvitadhi , Balaji Vembu , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha , Nadathur Rajagopalan Satish , Jeremy Bottleson , Farshad Akhbari , Altug Koker , Narayan Srinivasa , Dukhwan Kim , Sara S. Baghsorkhi , Justin E. Gottschlich , Feng Chen , Elmoustapha Ould-Ahmed-Vall , Kevin Nealis , Xiaoming Chen , Anbang Yao
Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex compute operation.
-
公开(公告)号:US20200349670A1
公开(公告)日:2020-11-05
申请号:US16930841
申请日:2020-07-16
Applicant: Intel Corporation
Inventor: Eriko Nurvitadhi , Balaji Vembu , Tsung-Han Lin , Kamal Sinha , Rajkishore Barik , Nicolas C. Galoppo Von Borries
IPC: G06T1/20 , G06F17/16 , G06T1/60 , G06F9/30 , G06K9/62 , G06F12/0888 , G06F12/0815 , H03M7/30 , G06F9/48 , G06T15/00 , G06N3/04 , G06F9/38 , G06F12/0831 , G06F12/0811 , G06N3/08
Abstract: Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.
-
公开(公告)号:US10769748B2
公开(公告)日:2020-09-08
申请号:US16197783
申请日:2018-11-21
Applicant: Intel Corporation
Inventor: Eriko Nurvitadhi , Balaji Vembu , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha , Nadathur Rajagopalan Satish , Jeremy Bottleson , Farshad Akhbari , Altug Koker , Narayan Srinivasa , Dukhwan Kim , Sara S. Baghsorkhi , Justin E. Gottschlich , Feng Chen , Elmoustapha Ould-Ahmed-Vall , Kevin Nealis , Xiaoming Chen , Anbang Yao
Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.
-
公开(公告)号:US10748237B2
公开(公告)日:2020-08-18
申请号:US16185965
申请日:2018-11-09
Applicant: Intel Corporation
Inventor: Rajkishore Barik , Tatiana Shpeisman , Brian T. Lewis , Rashid Kaleem
Abstract: Generally, this disclosure provides systems, devices, methods and computer readable media for adaptive scheduling of task assignment among heterogeneous processor cores. The system may include any number of CPUs, a graphics processing unit (GPU) and memory configured to store a pool of work items to be shared by the CPUs and GPU. The system may also include a GPU proxy profiling module associated with one of the CPUs to profile execution of a first portion of the work items on the GPU. The system may further include profiling modules, each associated with one of the CPUs, to profile execution of a second portion of the work items on each of the CPUs. The measured profiling information from the CPU profiling modules and the GPU proxy profiling module is used to calculate a distribution ratio for execution of a remaining portion of the work items between the CPUs and the GPU.
-
公开(公告)号:US10521349B2
公开(公告)日:2019-12-31
申请号:US16277267
申请日:2019-02-15
Applicant: Intel Corporation
Inventor: Chandrasekaran Sakthivel , Prasoonkumar Surti , John C. Weast , Sara S. Baghsorkhi , Justin E. Gottschlich , Abhishek R. Appu , Nicolas C. Galoppo Von Borries , Joydeep Ray , Narayan Srinivasa , Feng Chen , Ben J. Ashbaugh , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha , Eriko Nurvitadhi , Balaji Vembu , Altug Koker
IPC: G06F12/0837 , G06N3/08 , G06N20/00 , G06T1/20 , G06F12/0815 , G06N3/04 , G06N3/063
Abstract: In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. Other embodiments are also disclosed and claimed.
-
公开(公告)号:US20190332903A1
公开(公告)日:2019-10-31
申请号:US16505012
申请日:2019-07-08
Applicant: Intel Corporation
Inventor: Kevin Nealis , Anbang Yao , Xiaoming Chen , Elmoustapha Ould-Ahmed-Vall , Sara S. Baghsorkhi , Eriko Nurvitadhi , Balaji Vembu , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha
Abstract: One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a bipolar binary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the bipolar binary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.
-
公开(公告)号:US10410098B2
公开(公告)日:2019-09-10
申请号:US15494710
申请日:2017-04-24
Applicant: Intel Corporation
Inventor: Kevin Nealis , Anbang Yao , Xiaoming Chen , Elmoustapha Ould-Ahmed-Vall , Sara S. Baghsorkhi , Eriko Nurvitadhi , Balaji Vembu , Nicolas C. Galoppo Von Borries , Rajkishore Barik , Tsung-Han Lin , Kamal Sinha
IPC: G06F9/30 , G06K9/66 , G06K9/00 , G06F9/38 , G06F9/46 , G06T1/20 , G06N3/04 , G06N3/063 , G06N3/08
Abstract: One embodiment provides for a compute apparatus to perform machine learning operations, the apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including an input value and a quantized weight value associated with a neural network and an arithmetic logic unit including a barrel shifter, an adder, and an accumulator register, wherein to execute the decoded instruction, the barrel shifter is to shift the input value by the quantized weight value to generate a shifted input value and the adder is to add the shifted input value to a value stored in the accumulator register and update the value stored in the accumulator register.
-
-
-
-
-
-
-
-
-