-
公开(公告)号:US12229554B2
公开(公告)日:2025-02-18
申请号:US17463405
申请日:2021-08-31
Applicant: Intel Corporation
Inventor: Alexander Heinecke , Menachem Adelman , Robert Valentine , Zeev Sperber , Amit Gradstein , Mark Charney , Evangelos Georganas , Dhiraj Kalamkar , Christopher Hughes , Cristina Anderson
Abstract: Techniques for performing BF16 FMA in response to an instruction are described. In some examples, an instruction has fields for an opcode, an identification of location of a packed data source/destination operand (a first source), an identification of a location of a second packed data source operand, an identification of a location of a third packed data source operand, and an identification of location of a packed data source/destination operand, wherein the opcode is to indicate operand ordering and that execution circuitry is to, per data element position, perform a BF16 value fused multiply-accumulate operation using the first, second, and third source operands and store a result in a corresponding data element position of the source/destination operand.
-
公开(公告)号:US12204903B2
公开(公告)日:2025-01-21
申请号:US17359522
申请日:2021-06-26
Applicant: Intel Corporation
Inventor: Venkateswara Madduri , Cristina Anderson , Robert Valentine , Mark Charney , Vedvyas Shanbhogue
IPC: G06F9/30
Abstract: Techniques for matrix multiplication are described. In some examples, a single instruction having a format of fields for an opcode, one or more fields to indicate a location of a source/destination operand, one or more fields to indicate a location of a first source operand, and one or more fields to indicate a location of a second source operand is used. Wherein the opcode is to indicate that execution circuitry is to: multiply values from corresponding data elements of the first and second sources, add a first subset of the multiplied values to a first value from the source/destination operand and store in a first data element position of the source/destination operand, and add a second subset of the multiplied values to a second value from the source/destination operand and store in a second data element position of the source/destination operand.
-
3.
公开(公告)号:US12204898B2
公开(公告)日:2025-01-21
申请号:US18240287
申请日:2023-08-30
Applicant: Intel Corporation
Inventor: Edward T. Grochowski , Asit K. Mishra , Robert Valentine , Mark J. Charney , Simon C. Steely, Jr.
Abstract: A processor of an aspect includes a decode unit to decode a matrix multiplication instruction. The matrix multiplication instruction is to indicate a first memory location of a first source matrix, is to indicate a second memory location of a second source matrix, and is to indicate a third memory location where a result matrix is to be stored. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the matrix multiplication instruction, is to multiply a portion of the first and second source matrices prior to an interruption, and store a completion progress indicator in response to the interruption. The completion progress indicator to indicate an amount of progress in multiplying the first and second source matrices, and storing corresponding result data to the third memory location, that is to have been completed prior to the interruption.
-
公开(公告)号:US12124846B2
公开(公告)日:2024-10-22
申请号:US18456699
申请日:2023-08-28
Applicant: INTEL CORPORATION
Inventor: Robert Valentine , Galina Ryvchin , Piotr Majcher , Mark J. Charney , Elmoustapha Ould-Ahmed-Vall , Jesus Corbal , Milind B. Girkar , Zeev Sperber , Simon Rubanovich , Amit Gradstein
CPC classification number: G06F9/30014 , G06F7/5443 , G06F9/30018 , G06F9/30036 , G06F9/30105 , G06F9/3818
Abstract: Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.
-
公开(公告)号:US12112178B2
公开(公告)日:2024-10-08
申请号:US17134322
申请日:2020-12-26
Applicant: Intel Corporation
Inventor: Abhimanyu Kanaiya Varde , Karuna Ramkumar , Robert Valentine
CPC classification number: G06F9/44505 , G06F9/30098 , G06F9/30145
Abstract: Systems or methods of the present disclosure may provide an initialization technique that enables the initialization of multiple states in an efficient manner. The initialization technique includes a register to track usage of state components of the processor and a decode unit to decode a state initialization instruction. The state initialization instruction indicates that of the state components are to be initialized. The initialization technique also includes an execution unit coupled with the decode unit. The execution unit, in response to the state initialization instruction, is to initialize the state components without reading another state component from memory as part of the initialization.
-
公开(公告)号:US20240329938A1
公开(公告)日:2024-10-03
申请号:US18607024
申请日:2024-03-15
Applicant: Intel Corporation
Inventor: Menachem Adelman , Robert Valentine , Barukh Ziv , Amit Gradstein , Simon Rubanovich , Zeev Sperber , Mark J. Charney , Christopher J. Hughes , Alexander F. Heinecke , Evangelos Georganas , Binh Pham
CPC classification number: G06F7/78 , G06F9/3001 , G06F9/3016 , G06F17/16
Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, a processor includes a decoder and execution circuitry. The decoder is to decode an instruction having a format including an opcode field to specify an opcode, a first destination operand field to specify a destination matrix location, a first source operand field to specify a first source matrix location, and a second source operand field to specify a second source matrix location. The execution circuitry is to, in response to the decoded instruction, transpose the first source matrix to generate a transposed first source matrix, perform a matrix multiplication using the transposed first source matrix and the second source matrix to generate a result, and store the result in a destination matrix location.
-
7.
公开(公告)号:US12056489B2
公开(公告)日:2024-08-06
申请号:US18313026
申请日:2023-05-05
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Alexander F. Heinecke , Robert Valentine , Mark J. Charney , Christopher J. Hughes , Evangelos Georganas , Zeev Sperber , Amit Gradstein , Simon Rubanovich
CPC classification number: G06F9/30036 , G06F7/49915 , G06F9/30196 , G06F9/3887
Abstract: Systems, methods, and apparatuses relating to 8-bit floating-point matrix dot product instructions are described. A processor embodiment includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a destination matrix having single-precision elements, a first source matrix, and a second source matrix, the source matrices having elements that each comprise a quadruple of 8-bit floating-point values, the opcode to indicate execution circuitry is to cause, for each element of the first source matrix and corresponding element of the second source matrix, a conversion of the 8-bit floating-point values to single-precision values, a multiplication of different pairs of converted single-precision values to generate plurality of results, and an accumulation of the results with previous contents of a corresponding element of the destination matrix, decode circuitry to decode the fetched instruction, and the execution circuitry to respond to the decoded instruction as specified by the opcode.
-
8.
公开(公告)号:US11954490B2
公开(公告)日:2024-04-09
申请号:US18309469
申请日:2023-04-28
Applicant: Intel Corporation
Inventor: Raanan Sade , Robert Valentine , Bret Toll , Christopher J. Hughes , Alexander F. Heinecke , Elmoustapha Ould-Ahmed-Vall , Mark J. Charney
CPC classification number: G06F9/30167 , G06F9/30101 , G06F9/30149
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, a processor includes fetch and decode circuitry to fetch and decode an instruction having fields to specify an opcode and locations of source and destination matrices, wherein the opcode indicates that the processor is to transform the specified source matrix into the specified destination matrix having the row-interleaved format; and execution circuitry to respond to the decoded instruction by transforming the specified source matrix into the specified RowInt-formatted destination matrix by interleaving J elements of each J-element sub-column of the specified source matrix in either row-major or column-major order into a K-wide submatrix of the specified destination matrix, the K-wide submatrix having K columns and enough rows to hold the J elements.
-
公开(公告)号:US20240045684A1
公开(公告)日:2024-02-08
申请号:US17958380
申请日:2022-10-01
Applicant: Intel Corporation
Inventor: Alexander Heinecke , Menachem Adelman , Mark Charney , Evangelos Georganas , Amit Gradstein , Christopher Hughes , Naveen Mellempudi , Simon Rubanovich , Uri Sherman , Zeev Sperber , Robert Valentine
IPC: G06F9/30
CPC classification number: G06F9/30145 , G06F9/30036 , G06F9/30018
Abstract: Techniques for converting FP16 to BF8 using bias are described. An example embodiment utilizes decoder circuitry to decode a single instruction, the single instruction to include one or more fields to identify a first source operand, one or more fields to identify a second source operand, one or more fields to identify a source/destination operand, and one or more fields for an opcode, wherein the opcode is to indicate that execution circuitry is to convert packed half-precision data from the identified first and second sources to packed FP8 data using bias terms from the identified source/destination operand and store the packed FP8 data into corresponding data element positions of the identified source/destination operand; and execution circuitry to execute the decoded instruction according to the opcode to convert packed half-precision data from the identified first and second sources to packed FP8 data using bias terms from the identified source/destination operand and store the packed FP8 data into corresponding data element positions of the identified source/destination operand.
-
公开(公告)号:US11782709B2
公开(公告)日:2023-10-10
申请号:US17964964
申请日:2022-10-13
Applicant: Intel Corporation
Inventor: Robert Valentine , Galina Ryvchin , Piotr Majcher , Mark J. Charney , Elmoustapha Ould-Ahmed-Vall , Jesus Corbal , Milind B. Girkar , Zeev Sperber , Simon Rubanovich , Amit Gradstein
CPC classification number: G06F9/30014 , G06F7/5443 , G06F9/30018 , G06F9/30036 , G06F9/30105 , G06F9/3818
Abstract: Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.
-
-
-
-
-
-
-
-
-