Matrix operation optimization mechanism

    公开(公告)号:US12039000B2

    公开(公告)日:2024-07-16

    申请号:US18163418

    申请日:2023-02-02

    CPC classification number: G06F17/16 G06F7/78 G06N3/044 G06N3/084

    Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.

    NAMED AND CLUSTER BARRIERS
    45.
    发明公开

    公开(公告)号:US20240231957A9

    公开(公告)日:2024-07-11

    申请号:US17973234

    申请日:2022-10-25

    CPC classification number: G06F9/522 G06F9/4881

    Abstract: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.

    Hierarchical thread scheduling based on multiple barriers

    公开(公告)号:US11977895B2

    公开(公告)日:2024-05-07

    申请号:US17131647

    申请日:2020-12-22

    CPC classification number: G06F9/3838 G06F9/4881 G06F9/544 G06T1/20

    Abstract: Examples described herein relate to a graphics processing unit (GPU) coupled to the memory device, the GPU configured to: execute an instruction thread; determine if a dual directional signal barrier is associated with the instruction thread; and based on clearance of the dual directional signal barrier for a particular signal barrier identifier and a mode of operation, indicate a clearance of the dual directional signal barrier for the mode of operation, wherein the dual directional signal barrier is to provide a single barrier to gate activity of one or more producers based on activity of one or more consumers or gate activity of one or more consumers based on activity of one or more producers.

    SHARED LOCAL REGISTERS FOR THREAD TEAM PROCESSING

    公开(公告)号:US20240112295A1

    公开(公告)日:2024-04-04

    申请号:US17958216

    申请日:2022-09-30

    CPC classification number: G06T1/20 G06F9/30098 G06F9/3836

    Abstract: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.

Patent Agency Ranking