-
公开(公告)号:US20220413854A1
公开(公告)日:2022-12-29
申请号:US17358859
申请日:2021-06-25
Applicant: Intel Corporation
Inventor: Joydeep Ray , Supratim Pal , Prathamesh Raghunath Shinde , Ben J. Ashbaugh , Changwon Rhee , Hong Jiang , FangWen Fu
Abstract: An apparatus to facilitate 64-bit two-dimensional (2D) block load with transpose is disclosed. The apparatus includes a processor comprising processing resources; and load store pipeline hardware circuitry coupled to the processing resources, the load store pipeline hardware circuitry to receive a 64-bit two-dimensional (2D) block load message with transpose from the processing resources. The load store pipeline hardware circuitry comprising a load store pipeline sequencer to map rows of a block of memory corresponding to the 64-bit 2D block load message with transpose to 64-bit standard load messages; and load store pipeline return circuitry to: sequentially number general register files (GRFs) used for returning elements of the block of memory accessed by the 64-bit standard load messages to the processing resources; and return, to the processing resources, the sequentially numbered GRFs in response to the 64-bit 2D block load message with transpose.
-
公开(公告)号:US20220413851A1
公开(公告)日:2022-12-29
申请号:US17304794
申请日:2021-06-25
Applicant: Intel Corporation
Inventor: Chandra Gurram , Wei-yu Chen , Fangwen Fu , Sabareesh Ganapathy , Varghese George , Guei-Yuan Lueh , Subramaniam Maiyuran , Mike Macpherson , Supratim Pal , Jorge Parra
Abstract: A processing apparatus includes a general-purpose parallel processing engine including a set of multiple processing elements including a single precision floating-point unit, a double precision floating point unit, and an integer unit; a matrix accelerator including one or more systolic arrays; a first register file coupled with a first read control circuit, wherein the first read control circuit couples with the set of multiple processing elements and the matrix accelerator to arbitrate read requests to the first register file from the set of multiple processing elements and the matrix accelerator; and a second register file coupled with a second read control circuit, wherein the second read control circuit couples with the matrix accelerator to arbitrate read requests to the second register file from the matrix accelerator and limit access to the second register file by the set of multiple processing elements.
-
公开(公告)号:US20220413848A1
公开(公告)日:2022-12-29
申请号:US17358867
申请日:2021-06-25
Applicant: Intel Corporation
Inventor: Supratim Pal , Li-An Tang , Changwon Rhee , Timothy R. Bauer , Alexander Lyashevsky , Jiasheng Chen
Abstract: An apparatus to facilitate large integer multiplication enhancements in a graphics environment is disclosed. The apparatus includes a processor comprising processing resources, the processing resources comprising multiplier circuitry to: receive operands for a multiplication operation, wherein the multiplication operation is part of a chain of multiplication operations for a large integer multiplication; and issue a multiply and add (MAD) instruction for the multiplication operation utilizing at least one of a double precision multiplier or a 48 bit output, wherein the MAD instruction to generate an output in a single clock cycle of the processor.
-
公开(公告)号:US20220365901A1
公开(公告)日:2022-11-17
申请号:US17827067
申请日:2022-05-27
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
IPC: G06F15/78 , G06F9/30 , G06F9/38 , G06F17/18 , G06F12/0802 , G06F7/544 , G06F7/575 , G06F12/02 , G06F12/0866 , G06F12/0875 , G06F12/0895 , G06F12/128 , G06F12/06 , G06F12/1009 , G06T1/20 , G06T1/60 , H03M7/46 , G06F12/0811 , G06F15/80 , G06F17/16 , G06F7/58 , G06F12/0871 , G06F12/0862 , G06F12/0897 , G06F9/50 , G06F12/0804 , G06F12/0882 , G06F12/0891 , G06F12/0893
Abstract: Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
-
公开(公告)号:US20220129266A1
公开(公告)日:2022-04-28
申请号:US17428523
申请日:2020-03-14
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
IPC: G06F9/30 , G06F7/544 , G06F12/02 , G06F12/0811 , G06F12/0875
Abstract: Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and
a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.-
26.
公开(公告)号:US20220058158A1
公开(公告)日:2022-02-24
申请号:US17518202
申请日:2021-11-03
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Jorge Parra , Supratim Pal , Chandra Gurram
IPC: G06F15/80
Abstract: An apparatus to facilitate computing efficient cross channel operations in parallel computing machines using systolic arrays is disclosed. The apparatus includes a plurality of registers and one or more processing elements communicably coupled to the plurality of registers. The one or more processing elements include a systolic array circuit to perform cross-channel operations on source data received from a single source register of the plurality of registers, wherein the systolic array circuit is modified to: receive inputs from the single source register at different stages of the systolic array circuit; perform cross-channel operations at channels of the systolic array circuit; bypass disabled channels of the systolic array circuit, the disabled channels not used to compute the cross-channel operations; and broadcast a final result of a final stage of the systolic array circuit to all channels of a destination register.
-
公开(公告)号:US20210349717A1
公开(公告)日:2021-11-11
申请号:US16914030
申请日:2020-06-26
Applicant: Intel Corporation
Inventor: Chandra Gurram , Subramaniam Maiyuran , Supratim Pal , Saurabh Sharma , Aditya Navale
Abstract: Described herein is an accelerator device in which compaction of diverged lanes of a parallel processor is enabled to increase the efficiency of ALU utilization. One embodiment provides an accelerator device comprising a host interface, a fabric interconnect coupled with the host interface, and one or more hardware tiles coupled with the fabric interconnect, the one or more hardware tiles including a parallel processing architecture configured to enable compaction of diverged lanes.
-
公开(公告)号:US20210349715A1
公开(公告)日:2021-11-11
申请号:US17319056
申请日:2021-05-12
Applicant: Intel Corporation
Inventor: Abhishek R. Appu , Altug Koker , Joydeep Ray , Kamal Sinha , Kiran C. Veernapu , Subramaniam Maiyuran , Prasoonkumar Surti , Guei-Yuan Lueh , David Puffer , Supratim Pal , Eric J. Hoekstra , Travis T. Schluessler , Linda L. Hurd
Abstract: In an example, an apparatus comprises a plurality of execution units, and a first general register file (GRF) communicatively couple to the plurality of execution units, wherein the first GRF is shared by the plurality of execution units. Other embodiments are also disclosed and claimed.
-
公开(公告)号:US20210312697A1
公开(公告)日:2021-10-07
申请号:US17304092
申请日:2021-06-14
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
Abstract: Described herein is a graphics processing unit (GPU) comprising a single instruction, multiple thread (SIMT) multiprocessor comprising an instruction cache, a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache, the circuitry including multiple texture units, a first core including hardware to accelerate matrix operations, and a second core configured to receive an instruction having multiple operands in a bfloat16 (BF16) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent and process the instruction, wherein to process the instruction includes to multiply the second source operand by the third source operand and add a first source operand to a result of the multiply.
-
公开(公告)号:US20190265973A1
公开(公告)日:2019-08-29
申请号:US15903283
申请日:2018-02-23
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Supratim Pal , Ashutosh Garg , Darin M. Starkey , Guei-Yuan Lueh , Jorge E. Parra , Shubh B. Shah , Wei-Yu Chen , Vikranth Vemulapalli , Narsim Krishna , Brent A. Schwartz , Chandra S. Gurram , Wei Pan , Ashwin J. Shivani
Abstract: Methods and apparatus relating to techniques for fusing SIMD processing units. In an example, an apparatus comprises logic, at least partially comprising hardware logic, to receive an instruction set for execution on at least two graphics processing execution units, determine whether the instruction set requires data dependent addressing, and select between a synchronized execution environment for the at least two graphics processing units and an unsynchronized execution environment for the at least two graphics processing units based at least in part on the determination whether the instruction set requires data dependent addressing. Other embodiments are also disclosed and claimed.
-
-
-
-
-
-
-
-
-