-
公开(公告)号:US11361496B2
公开(公告)日:2022-06-14
申请号:US17304092
申请日:2021-06-14
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Shubra Marwaha , Ashutosh Garg , Supratim Pal , Jorge Parra , Chandra Gurram , Varghese George , Darin Starkey , Guei-Yuan Lueh
Abstract: Described herein is a graphics processing unit (GPU) comprising a single instruction, multiple thread (SIMT) multiprocessor comprising an instruction cache, a shared memory coupled with the instruction cache, and circuitry coupled with the shared memory and the instruction cache, the circuitry including multiple texture units, a first core including hardware to accelerate matrix operations, and a second core configured to receive an instruction having multiple operands in a bfloat16 (BF16) number format, wherein the multiple operands include a first source operand, a second source operand, and a third source operand, and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent and process the instruction, wherein to process the instruction includes to multiply the second source operand by the third source operand and add a first source operand to a result of the multiply.
-
62.
公开(公告)号:US20210286626A1
公开(公告)日:2021-09-16
申请号:US17213453
申请日:2021-03-26
Applicant: Intel Corporation
Inventor: Subramaniam M. Maiyuran , Guei-Yuan Lueh , Supratim Pal , Gang Chen , Ananda V. Kommaraju , Joy Chandra , Altug Koker , Prasoonkumar Surti , David Puffer , Hong Bin Liao , Joydeep Ray , Abhishek R. Appu , Ankur N. Shah , Travis T. Schluessler , Jonathan Kennedy , Devan Burke
Abstract: An apparatus to facilitate control flow in a graphics processing system is disclosed. The apparatus includes logic a plurality of execution units to execute single instruction, multiple data (SIMD) and flow control logic to detect a diverging control flow in a plurality of SIMD channels and reduce the execution of the control flow to a subset of the SIMD channels.
-
公开(公告)号:US20210149635A1
公开(公告)日:2021-05-20
申请号:US16685561
申请日:2019-11-15
Applicant: Intel Corporation
Abstract: Embodiments described herein are generally directed to an improved vector normalization instruction. An embodiment of a method includes responsive to receipt by a GPU of a single instruction specifying a vector normalization operation to be performed on V vectors: (i) generating V squared length values, N at a time, by a first processing unit, by, for each N sets of inputs, each representing multiple component vectors for N of the vectors, performing N parallel dot product operations on the N sets of inputs. Generating V sets of outputs representing multiple normalized component vectors of the V vectors, N at a time, by a second processing unit, by, for each N squared length values of the V squared length values, performing N parallel operations on the N squared length values, wherein each of the N parallel operations implement a combination of a reciprocal square root function and a vector scaling function.
-
公开(公告)号:US10789071B2
公开(公告)日:2020-09-29
申请号:US14794521
申请日:2015-07-08
Applicant: Intel Corporation
Inventor: Hema C. Nalluri , Supratim Pal , Subramaniam Maiyuran , Joy Chandra
Abstract: Systems, apparatuses and methods may provide for associating a first instruction pointer with an IF block of a primary IF-ELSE conditional construct associated with a thread and activating a second instruction pointer in response to a dependency associated with the IF block. Additionally, the second instruction pointer may be associated with an ELSE block of the primary IF-ELSE conditional construct. In one example, the IF block and the ELSE block are executed, via the first instruction pointer and the second instruction pointer, one or more of independently from or parallel to one another.
-
公开(公告)号:US10698689B2
公开(公告)日:2020-06-30
申请号:US16120226
申请日:2018-09-01
Applicant: Intel Corporation
Inventor: Pratik J. Ashar , Supratim Pal , Subramaniam Maiyuran , Wei-Yu Chen , Guei-Yuan Lueh
Abstract: An apparatus to facilitate register sharing is disclosed. The apparatus includes one or more processors to generate first machine code having a first General Purpose Register (GRF) per thread ratio, detect an occurrence of one or more spill/fill instructions in the first machine code, and generate second machine code having a second GRF per thread ratio upon a detection of one or more spill/fill instructions in the first machine code, wherein the second GRF per thread ratio is based on a disabling of a first of a plurality of hardware threads.
-
公开(公告)号:US10360654B1
公开(公告)日:2019-07-23
申请号:US15990328
申请日:2018-05-25
Applicant: Intel Corporation
Inventor: Subramaniam Maiyuran , Supratim Pal , Jorge E. Parra , Chandra S. Gurram , Ashwin J. Shivani , Ashutosh Garg , Brent A. Schwartz , Jorge F. Garcia Pabon , Darin M. Starkey , Shubh B. Shah , Guei-Yuan Lueh , Kaiyu Chen , Konrad Trifunovic , Buqi Cheng , Weiyu Chen
Abstract: Embodiments described herein provide a graphics processor in which dependency tracking hardware is simplified via the use of compiler provided software scoreboard information. In one embodiment the shader compiler for shader programs is configured to encode software scoreboard information into each instruction. Dependencies can be evaluated by the shader compiler and provided as scoreboard information with each instruction. The hardware can then use the provided information when scheduling instructions. In one embodiment, a software scoreboard synchronization instruction is provided to facilitate software dependency handling within a shader program. Using software to facilitate software dependency handling and synchronization can simplify hardware design, reducing the area consumed by the hardware. In one embodiment, dependencies can be evaluated by the shader compiler instead of the GPU hardware. The compiler can then insert a software scoreboard sync immediate instruction into compiled program code to manage instruction dependencies and prevent data hazards from occurring.
-
公开(公告)号:US10152452B2
公开(公告)日:2018-12-11
申请号:US14726349
申请日:2015-05-29
Applicant: Intel Corporation
Inventor: Supratim Pal , Subramaniam Maiyuran , Mark C. Davis
Abstract: Techniques to suppress redundant reads to register addresses and to replicate read data are disclosed. The redundant reads are suppressed when multiple source operands specify the same register address to read. Additionally, the read data is replicated to a data stream or data location corresponding to the source operands where the data read was suppressed.
-
公开(公告)号:US09632801B2
公开(公告)日:2017-04-25
申请号:US14249154
申请日:2014-04-09
Applicant: Intel Corporation
Inventor: Supratim Pal , Murali Sundaresan
IPC: G06T1/60 , G06F9/445 , G06F12/08 , G06F12/084
CPC classification number: G06F9/445 , G06F12/0207 , G06F12/0607 , G06F12/08 , G06F12/0811 , G06F12/084 , G06F12/0851 , G06F12/0893 , Y02D10/13
Abstract: Conversion of an array of structures (AOS) to a structure of arrays (SOA) improves the efficiency of transfer from the AOS to the SOA. A similar technique can be used to convert efficiently from an SOA to an AOS. The controller performing the conversion computes a partition size as the highest common factor between the structure size of structures in AOS and the number of banks in a first memory device, and transfers data based on the partition size, rather than on the structure size. The controller can read a partition size number of elements from multiple different structures to ensure that full data transfer bandwidth is used for each transfer.
-
69.
公开(公告)号:US20150294435A1
公开(公告)日:2015-10-15
申请号:US14249154
申请日:2014-04-09
Applicant: Intel Corporation
Inventor: Supratim Pal , Murali Sundaresan
CPC classification number: G06F9/445 , G06F12/0207 , G06F12/0607 , G06F12/08 , G06F12/0811 , G06F12/084 , G06F12/0851 , G06F12/0893 , Y02D10/13
Abstract: Conversion of an array of structures (AOS) to a structure of arrays (SOA) improves the efficiency of transfer from the AOS to the SOA. A similar technique can be used to convert efficiently from an SOA to an AOS. The controller performing the conversion computes a partition size as the highest common factor between the structure size of structures in AOS and the number of banks in a first memory device, and transfers data based on the partition size, rather than on the structure size. The controller can read a partition size number of elements from multiple different structures to ensure that full data transfer bandwidth is used for each transfer.
Abstract translation: 将结构数组(AOS)转换为数组结构(SOA)可提高从AOS到SOA的传输效率。 类似的技术可以用来从SOA有效地转换为AOS。 执行转换的控制器计算分区大小作为AOS中的结构的结构尺寸与第一存储器件中的存储体的数量之间的最高共同因子,并且基于分区大小而不是结构大小来传送数据。 控制器可以从多个不同结构读取分区大小的元素数量,以确保每次传输都使用完整的数据传输带宽。
-
公开(公告)号:US20250147762A1
公开(公告)日:2025-05-08
申请号:US18504407
申请日:2023-11-08
Applicant: Intel Corporation
Inventor: Vasanth Ranganathan , Gang Chen , Supratim Pal , Jorge Eduardo Parra Osorio , Arthur Hunter , Boris Kuznetsov , Deepak N K , Siva Kumar Seemakurthi , James Valerio , Shubham Dinesh Chavan , Abhishek Kumar Singh , Samir Pandya , Sandeep Tippannanavar Niranjan , Alan Curtis , Jain Philip , Maltesh Kulkarni , Fangwen Fu , John Wiegert , Brent Schwartz
Abstract: Described herein is a graphics processor having processing resources with configurable thread and register configurations. Program code can configure a number of registers and accumulators that will be used by hardware threads during execution of the program code by the graphics processor. Processing resources within the graphics processor can be configured to assign different numbers of registers and accumulators to hardware threads based on the configuration requested by program code to be executed by the processing resource.
-
-
-
-
-
-
-
-
-