-
公开(公告)号:US12242846B2
公开(公告)日:2025-03-04
申请号:US18618648
申请日:2024-03-27
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Subramaniam Maiyuran , Varghese George , Fangwen Fu , Shuai Mu , Supratim Pal , Wei Xiong
Abstract: An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.
-
公开(公告)号:US20240403044A1
公开(公告)日:2024-12-05
申请号:US18677140
申请日:2024-05-29
Applicant: Intel Corporation
Inventor: Shuai Mu , Cristina S. Anderson , Subramaniam Maiyuran
Abstract: Embodiments are directed to systems and methods for reuse of FMA execution unit hardware logic to provide native support for execution of get exponent, get mantissa, and/or scale instructions within a GPU. These new instructions may be used to implement branch-free emulation algorithms for mathematical functions and analytic functions (e.g., transcendental functions) by detecting and handling various special case inputs within a pre-processing stage of the FMA execution unit, which allows the main dataflow of the FMA execution unit to be bypassed for such special cases. Since special cases are handled by the FMA execution unit, library functions emulating various functions, including, but not limited to logarithm, exponential, and division operations may be implemented with significantly fewer lines of machine-level code, thereby providing improved performance for HPC applications.
-
公开(公告)号:US20230315447A1
公开(公告)日:2023-10-05
申请号:US18170696
申请日:2023-02-17
Applicant: Intel Corporation
Inventor: Shuai Mu , Cristina S. Anderson , Subramaniam Maiyuran
CPC classification number: G06F9/3001 , G06T1/20 , G06F7/5443
Abstract: Embodiments are directed to systems and methods for reuse of FMA execution unit hardware logic to provide native support for execution of get exponent, get mantissa, and/or scale instructions within a GPU. These new instructions may be used to implement branch-free emulation algorithms for mathematical functions and analytic functions (e.g., transcendental functions) by detecting and handling various special case inputs within a pre-processing stage of the FMA execution unit, which allows the main dataflow of the FMA execution unit to be bypassed for such special cases. Since special cases are handled by the FMA execution unit, library functions emulating various functions, including, but not limited to logarithm, exponential, and division operations may be implemented with significantly fewer lines of machine-level code, thereby providing improved performance for HPC applications.
-
公开(公告)号:US11625244B2
公开(公告)日:2023-04-11
申请号:US17353984
申请日:2021-06-22
Applicant: Intel Corporation
Inventor: Shuai Mu , Cristina S. Anderson , Subramaniam Maiyuran
Abstract: Embodiments are directed to systems and methods for reuse of FMA execution unit hardware logic to provide native support for execution of get exponent, get mantissa, and/or scale instructions within a GPU. These new instructions may be used to implement branch-free emulation algorithms for mathematical functions and analytic functions (e.g., transcendental functions) by detecting and handling various special case inputs within a pre-processing stage of the FMA execution unit, which allows the main dataflow of the FMA execution unit to be bypassed for such special cases. Since special cases are handled by the FMA execution unit, library functions emulating various functions, including, but not limited to logarithm, exponential, and division operations may be implemented with significantly fewer lines of machine-level code, thereby providing improved performance for HPC applications.
-
公开(公告)号:US20220413916A1
公开(公告)日:2022-12-29
申请号:US17358650
申请日:2021-06-25
Applicant: Intel Corporation
Inventor: Chandra Gurram , Wei-Yu Chen , Vikranth Vemulapalli , Subramaniam Maiyuran , Jorge Eduardo Parra Osorio , Shuai Mu , Guei-Yuan Lueh , Supratim Pal
Abstract: Provision of multiple register allocation sizes for threads is described. An example of a system includes one or more processors including a graphics processor, the graphics processor including at least a first local thread dispatcher (TDL) and multiple processing resources, each processing resource including a plurality of registers; and memory for storage of data for processing, wherein the one or more processors are to determine a register size for a first thread; identify one or more processing resources having sufficient register space for the first thread; select a processing resource of the one or more processing resources having sufficient register space to assign the first thread; select an available thread slot of the selected processing resource for the first thread; and allocate registers of the selected processing resource for the first thread.
-
公开(公告)号:US20220318013A1
公开(公告)日:2022-10-06
申请号:US17212588
申请日:2021-03-25
Applicant: Intel Corporation
Inventor: Naveen Mellempudi , Subramaniam Maiyuran , Varghese George , Fangwen Fu , Shuai Mu , Supratim Pal , Wei Xiong
Abstract: An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.
-
-
-
-
-