Computing efficient cross channel operations in parallel computing machines using systolic arrays

    公开(公告)号:US12093213B2

    公开(公告)日:2024-09-17

    申请号:US18310129

    申请日:2023-05-01

    CPC classification number: G06F15/8046 G06F15/8007 G06F17/16 G06N20/00

    Abstract: An apparatus to facilitate computing efficient cross channel operations in parallel computing machines using systolic arrays is disclosed. The apparatus includes a plurality of registers and one or more processing elements communicably coupled to the plurality of registers. The one or more processing elements include a systolic array circuit to perform cross-channel operations on source data received from a single source register of the plurality of registers, wherein the systolic array circuit is modified to: receive inputs from the single source register at different stages of the systolic array circuit; perform cross-channel operations at channels of the systolic array circuit; bypass disabled channels of the systolic array circuit, the disabled channels not used to compute the cross-channel operations; and broadcast a final result of a final stage of the systolic array circuit to all channels of a destination register.

    ENHANCEMENTS FOR ACCUMULATOR USAGE AND INSTRUCTION FORWARDING IN MATRIX MULTIPLY PIPELINE IN GRAPHICS ENVIRONMENT

    公开(公告)号:US20240169021A1

    公开(公告)日:2024-05-23

    申请号:US18056930

    申请日:2022-11-18

    CPC classification number: G06F17/16 G06F7/5443

    Abstract: An apparatus to facilitate enhancements for accumulator usage and instruction forwarding in matrix multiply pipeline in graphics environment is disclosed. The apparatus includes matrix acceleration hardware comprising a plurality of data processing units, wherein the respective plurality of data processing units comprise: multiply-accumulate hardware to generate intermediate results of a matrix multiplication operation; intermediate accumulation hardware to store the intermediate results of the matrix multiplication operation and accumulate with other intermediate results generated by the multiply-accumulate hardware; a bypass data structure to cause a source operand to bypass the multiply-accumulate hardware; and an adder circuit to add an output from the multiply-accumulate hardware with at least one of the source operand or an output of the intermediate accumulation hardware to generate a final output.

    SUPPORTING AND LOAD BALANCING MULTIPLE DOUBLE PRECISION PIPELINES IN A GRAPHICS ENVIRONMENT

    公开(公告)号:US20240168764A1

    公开(公告)日:2024-05-23

    申请号:US18056820

    申请日:2022-11-18

    CPC classification number: G06F9/30014 G06F9/3867

    Abstract: An apparatus to facilitate supporting and load balancing multiple double precision pipelines in a graphics environment is disclosed. The apparatus includes a processing core having at least one processing resource comprising: a first double precision (DP) pipeline to support double float operations, the first DP pipeline comprising a first set of floating point units (FPUs) configured in a pipelined configuration to enable new instructions to be issued to the first DP pipeline before previous instructions are complete; and a second DP pipeline to support the double float operations, wherein the second DP pipeline comprising a second set of FPUs configured in a pipelined configuration to enable new instructions to be issued to the first DP pipeline before previous instructions are complete.

    MATRIX TRANSPOSITION IN MATRIX MULTIPLICATION ARRAY CIRCUITRY

    公开(公告)号:US20240168723A1

    公开(公告)日:2024-05-23

    申请号:US18056822

    申请日:2022-11-18

    CPC classification number: G06F7/78 G06F17/16

    Abstract: An apparatus to facilitate matrix transposition in matrix multiplication array circuitry is disclosed. The apparatus includes a processor comprising matrix acceleration hardware comprising storage buffers and an array of data processing units (DPUs), wherein the matrix acceleration hardware is to: load data for a source matrix to the storage buffers; generate a transposed matrix corresponding comprising transposed elements of the source matrix; and input the transposed matrix to the array of DPUs for a matrix multiplication operation.

    USE OF A SINGLE INSTRUCTION SET ARCHITECTURE (ISA) INSTRUCTION FOR VECTOR NORMALIZATION

    公开(公告)号:US20220147316A1

    公开(公告)日:2022-05-12

    申请号:US17477939

    申请日:2021-09-17

    Abstract: Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.

Patent Agency Ranking