-
公开(公告)号:US12189571B2
公开(公告)日:2025-01-07
申请号:US17304797
申请日:2021-06-25
Applicant: Intel Corporation
Inventor: Jorge Parra , Jiasheng Chen , Supratim Pal , Fangwen Fu , Sabareesh Ganapathy , Chandra Gurram , Chunhui Mei , Yue Qi
Abstract: A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.
-
32.
公开(公告)号:US12093213B2
公开(公告)日:2024-09-17
申请号:US18310129
申请日:2023-05-01
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Jorge Parra , Supratim Pal , Chandra Gurram
CPC classification number: G06F15/8046 , G06F15/8007 , G06F17/16 , G06N20/00
Abstract: An apparatus to facilitate computing efficient cross channel operations in parallel computing machines using systolic arrays is disclosed. The apparatus includes a plurality of registers and one or more processing elements communicably coupled to the plurality of registers. The one or more processing elements include a systolic array circuit to perform cross-channel operations on source data received from a single source register of the plurality of registers, wherein the systolic array circuit is modified to: receive inputs from the single source register at different stages of the systolic array circuit; perform cross-channel operations at channels of the systolic array circuit; bypass disabled channels of the systolic array circuit, the disabled channels not used to compute the cross-channel operations; and broadcast a final result of a final stage of the systolic array circuit to all channels of a destination register.
-
公开(公告)号:US12007935B2
公开(公告)日:2024-06-11
申请号:US17428523
申请日:2020-03-14
Applicant: INTEL CORPORATION
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
IPC: G06F9/30 , G06F7/544 , G06F7/575 , G06F7/58 , G06F9/38 , G06F9/50 , G06F12/02 , G06F12/06 , G06F12/0802 , G06F12/0804 , G06F12/0811 , G06F12/0862 , G06F12/0866 , G06F12/0871 , G06F12/0875 , G06F12/0882 , G06F12/0888 , G06F12/0891 , G06F12/0893 , G06F12/0895 , G06F12/0897 , G06F12/1009 , G06F12/128 , G06F15/78 , G06F15/80 , G06F17/16 , G06F17/18 , G06T1/20 , G06T1/60 , H03M7/46 , G06N3/08 , G06T15/06
CPC classification number: G06F15/7839 , G06F7/5443 , G06F7/575 , G06F7/588 , G06F9/3001 , G06F9/30014 , G06F9/30036 , G06F9/3004 , G06F9/30043 , G06F9/30047 , G06F9/30065 , G06F9/30079 , G06F9/3887 , G06F9/5011 , G06F9/5077 , G06F12/0215 , G06F12/0238 , G06F12/0246 , G06F12/0607 , G06F12/0802 , G06F12/0804 , G06F12/0811 , G06F12/0862 , G06F12/0866 , G06F12/0871 , G06F12/0875 , G06F12/0882 , G06F12/0888 , G06F12/0891 , G06F12/0893 , G06F12/0895 , G06F12/0897 , G06F12/1009 , G06F12/128 , G06F15/8046 , G06F17/16 , G06F17/18 , G06T1/20 , G06T1/60 , H03M7/46 , G06F9/3802 , G06F9/3818 , G06F9/3867 , G06F2212/1008 , G06F2212/1021 , G06F2212/1044 , G06F2212/302 , G06F2212/401 , G06F2212/455 , G06F2212/60 , G06N3/08 , G06T15/06
Abstract: Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and
a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.-
34.
公开(公告)号:US20240169021A1
公开(公告)日:2024-05-23
申请号:US18056930
申请日:2022-11-18
Applicant: Intel Corporation
Inventor: Jorge Eduardo Parra Osorio , Supratim Pal , Fangwen Fu , Guei-Yuan Lueh , Po-Yu Chen , Jiasheng Chen
CPC classification number: G06F17/16 , G06F7/5443
Abstract: An apparatus to facilitate enhancements for accumulator usage and instruction forwarding in matrix multiply pipeline in graphics environment is disclosed. The apparatus includes matrix acceleration hardware comprising a plurality of data processing units, wherein the respective plurality of data processing units comprise: multiply-accumulate hardware to generate intermediate results of a matrix multiplication operation; intermediate accumulation hardware to store the intermediate results of the matrix multiplication operation and accumulate with other intermediate results generated by the multiply-accumulate hardware; a bypass data structure to cause a source operand to bypass the multiply-accumulate hardware; and an adder circuit to add an output from the multiply-accumulate hardware with at least one of the source operand or an output of the intermediate accumulation hardware to generate a final output.
-
35.
公开(公告)号:US20240168764A1
公开(公告)日:2024-05-23
申请号:US18056820
申请日:2022-11-18
Applicant: Intel Corporation
Inventor: Supratim Pal , Jiasheng Chen , Vikranth Vemulapalli , Subramaniam Maiyuran
CPC classification number: G06F9/30014 , G06F9/3867
Abstract: An apparatus to facilitate supporting and load balancing multiple double precision pipelines in a graphics environment is disclosed. The apparatus includes a processing core having at least one processing resource comprising: a first double precision (DP) pipeline to support double float operations, the first DP pipeline comprising a first set of floating point units (FPUs) configured in a pipelined configuration to enable new instructions to be issued to the first DP pipeline before previous instructions are complete; and a second DP pipeline to support the double float operations, wherein the second DP pipeline comprising a second set of FPUs configured in a pipelined configuration to enable new instructions to be issued to the first DP pipeline before previous instructions are complete.
-
公开(公告)号:US20240168723A1
公开(公告)日:2024-05-23
申请号:US18056822
申请日:2022-11-18
Applicant: Intel Corporation
Inventor: Jorge Eduardo Parra Osorio , Supratim Pal , Jiasheng Chen
Abstract: An apparatus to facilitate matrix transposition in matrix multiplication array circuitry is disclosed. The apparatus includes a processor comprising matrix acceleration hardware comprising storage buffers and an array of data processing units (DPUs), wherein the matrix acceleration hardware is to: load data for a source matrix to the storage buffers; generate a transposed matrix corresponding comprising transposed elements of the source matrix; and input the transposed matrix to the array of DPUs for a matrix multiplication operation.
-
公开(公告)号:US11954063B2
公开(公告)日:2024-04-09
申请号:US18170900
申请日:2023-02-17
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
IPC: G06T15/06 , G06F7/544 , G06F7/575 , G06F7/58 , G06F9/30 , G06F9/38 , G06F9/50 , G06F12/02 , G06F12/06 , G06F12/0802 , G06F12/0804 , G06F12/0811 , G06F12/0862 , G06F12/0866 , G06F12/0871 , G06F12/0875 , G06F12/0882 , G06F12/0888 , G06F12/0891 , G06F12/0893 , G06F12/0895 , G06F12/0897 , G06F12/1009 , G06F12/128 , G06F15/78 , G06F15/80 , G06F17/16 , G06F17/18 , G06T1/20 , G06T1/60 , H03M7/46 , G06N3/08
CPC classification number: G06F15/7839 , G06F7/5443 , G06F7/575 , G06F7/588 , G06F9/3001 , G06F9/30014 , G06F9/30036 , G06F9/3004 , G06F9/30043 , G06F9/30047 , G06F9/30065 , G06F9/30079 , G06F9/3887 , G06F9/5011 , G06F9/5077 , G06F12/0215 , G06F12/0238 , G06F12/0246 , G06F12/0607 , G06F12/0802 , G06F12/0804 , G06F12/0811 , G06F12/0862 , G06F12/0866 , G06F12/0871 , G06F12/0875 , G06F12/0882 , G06F12/0888 , G06F12/0891 , G06F12/0893 , G06F12/0895 , G06F12/0897 , G06F12/1009 , G06F12/128 , G06F15/8046 , G06F17/16 , G06F17/18 , G06T1/20 , G06T1/60 , H03M7/46 , G06F9/3802 , G06F9/3818 , G06F9/3867 , G06F2212/1008 , G06F2212/1021 , G06F2212/1044 , G06F2212/302 , G06F2212/401 , G06F2212/455 , G06F2212/60 , G06N3/08 , G06T15/06
Abstract: Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.
-
公开(公告)号:US11669329B2
公开(公告)日:2023-06-06
申请号:US17723312
申请日:2022-04-18
Applicant: Intel Corporation
Inventor: Supratim Pal , Sasikanth Avancha , Ishwar Bhati , Wei-Yu Chen , Dipankar Das , Ashutosh Garg , Chandra S. Gurram , Junjie Gu , Guei-Yuan Lueh , Subramaniam Maiyuran , Jorge E. Parra , Sudarshan Srinivasan , Varghese George
CPC classification number: G06F9/3802 , G06F9/3001 , G06F9/30018 , G06F9/30145
Abstract: Embodiments described herein provide for an instruction and associated logic to enable a vector multiply add instructions with automatic zero skipping for sparse input. One embodiment provides for a general-purpose graphics processor comprising logic to perform operations comprising fetching a hardware macro instruction having a predicate mask, a repeat count, and a set of initial operands, where the initial operands include a destination operand and multiple source operands. The hardware macro instruction is configured to perform one or more multiply/add operations on input data associated with a set of matrices.
-
公开(公告)号:US20220147316A1
公开(公告)日:2022-05-12
申请号:US17477939
申请日:2021-09-17
Applicant: Intel Corporation
Abstract: Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.
-
公开(公告)号:US11204977B2
公开(公告)日:2021-12-21
申请号:US16913800
申请日:2020-06-26
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Jorge Parra , Supratim Pal , Ashutosh Garg , Shubra Marwaha , Chandra Gurram , Darin Starkey , Durgesh Borkar , Varghese George
Abstract: Described herein is an accelerator device including a host interface, a fabric interconnect coupled with the host interface, and one or more hardware tiles coupled with the fabric interconnect, the one or more hardware tiles including sparse matrix multiply acceleration hardware including a systolic array with feedback inputs.
-
-
-
-
-
-
-
-
-