Patent search ap:("INTEL CORPORATION") AND inv:"Fangwen Fu" Page 5

41.

发明申请
SUPPORTING 8-BIT FLOATING POINT FORMAT FOR PARALLEL COMPUTING AND STOCHASTIC ROUNDING OPERATIONS IN A GRAPHICS ARCHITECTURE 有权

公开(公告)号：US20250110741A1

公开(公告)日：2025-04-03

申请号：US18477790

申请日：2023-09-29

Applicant: Intel Corporation

Inventor： Jorge Eduardo Parra Osorio , Fangwen Fu , Guei-Yuan Lueh , Hong Jiang , Jiasheng Chen , Naveen K. Mellempudi , Kevin Hurd , Chunhui Mei , Alexandre Hadj-Chaib , Elliot Taylor , Shuai Mu

IPC: G06F9/30 , G06F9/38

Abstract: An apparatus to facilitate supporting 8-bit floating point format for parallel computing and stochastic rounding operations in a graphics architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that is to operate on 8-bit floating point operands to perform a parallel dot product operation; a scheduler to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and circuitry to execute the decoded instruction to perform 32-way dot-product using 8-bit wide dot-product layers, each 8-bit wide dot-product layer comprises one or more sets of interconnected multipliers, shifters, and adders, wherein each set of multipliers, shifters, and adders is to generate a dot product of the 8-bit floating point operands.

42.

发明公开
SUPPORTING 8-BIT FLOATING POINT FORMAT OPERANDS IN A COMPUTING ARCHITECTURE 审中-公开

公开(公告)号：US20240256274A1

公开(公告)日：2024-08-01

申请号：US18618648

申请日：2024-03-27

Applicant: Intel Corporation

Inventor： Naveen Mellempudi , Subramaniam Maiyuran , Varghese George , Fangwen Fu , Shuai Mu , Supratim Pal , Wei Xiong

IPC: G06F9/30 , G06F9/38 , G06F9/48 , G06F17/16 , G06N20/00

CPC classification number: G06F9/30014 , G06F9/3818 , G06F9/4843 , G06F17/16 , G06N20/00

Abstract: An apparatus to facilitate supporting 8-bit floating point format operands in a computing architecture is disclosed. The apparatus includes a processor comprising: a decoder to decode an instruction fetched for execution into a decoded instruction, wherein the decoded instruction is a matrix instruction that operates on 8-bit floating point operands to cause the processor to perform a parallel dot product operation; a controller to schedule the decoded instruction and provide input data for the 8-bit floating point operands in accordance with an 8-bit floating data format indicated by the decoded instruction; and systolic dot product circuitry to execute the decoded instruction using systolic layers, each systolic layer comprises one or more sets of interconnected multipliers, shifters, and adder, each set of multipliers, shifters, and adders to generate a dot product of the 8-bit floating point operands.

43.

发明授权
Matrix operation optimization mechanism 有权

公开(公告)号：US12039000B2

公开(公告)日：2024-07-16

申请号：US18163418

申请日：2023-02-02

Applicant: Intel Corporation

Inventor： Joydeep Ray , Fangwen Fu , Dhiraj D. Kalamkar , Sasikanth Avancha

IPC: G06F17/16 , G06F7/78 , G06N3/044 , G06N3/084

CPC classification number: G06F17/16 , G06F7/78 , G06N3/044 , G06N3/084

Abstract: An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.

44.

发明公开
BROADCAST ASYNCHRONOUS LOADS TO SHARED LOCAL MEMORY 审中-公开

公开(公告)号：US20240232088A9

公开(公告)日：2024-07-11

申请号：US17973203

申请日：2022-10-25

Applicant: Intel Corporation

Inventor： John A. Wiegert , Joydeep Ray , Vasanth Ranganathan , Biju George , Fangwen Fu , Abhishek R. Appu , Chunhui Mei , Changwon Rhee

IPC: G06F12/0855

CPC classification number: G06F12/0857 , G06F2212/1016

Abstract: Embodiments described herein provide a technique to facilitate the broadcast or multicast of asynchronous loads to shared local memory of a plurality of graphics cores within a graphics core cluster. One embodiment provides a graphics processor including a cache memory a graphics core cluster coupled with the cache memory. The graphics core cluster includes a plurality of graphics cores. The plurality of graphics cores includes a graphics core configured to receive a designation as a producer graphics core for a multicast load, read data from the cache memory; and transmit the data read from the cache memory to a consumer graphics core of the plurality of graphics cores.

45.

发明公开
NAMED AND CLUSTER BARRIERS 审中-公开

公开(公告)号：US20240231957A9

公开(公告)日：2024-07-11

申请号：US17973234

申请日：2022-10-25

Applicant: Intel Corporation

Inventor： Fangwen Fu , Chunhui Mei , John A. Wiegert , Yongsheng Liu , Ben J. Ashbaugh

IPC: G06F9/52 , G06F9/48

CPC classification number: G06F9/522 , G06F9/4881

Abstract: Embodiments described herein provide a technique to facilitate the synchronization of workgroups executed on multiple graphics cores of a graphics core cluster. One embodiment provides a graphics core including a cache memory and a graphics core coupled with the cache memory. The graphics core includes execution resources to execute an instruction via a plurality of hardware threads and barrier circuitry to synchronize execution of the plurality of hardware threads, wherein the barrier circuitry is configured to provide a plurality of re-usable named barriers.

46.

发明公开
SCALABLE AND CONFIGURABLE CLUSTERED SYSTOLIC ARRAY 审中-公开

公开(公告)号：US20240220448A1

公开(公告)日：2024-07-04

申请号：US18148998

申请日：2022-12-30

Applicant: Intel Corporation

Inventor： Chunhui Mei , Jiasheng Chen , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , Rama S.B. Harihara , Maxim Kazakov

IPC: G06F15/80 , G06F13/16

CPC classification number: G06F15/8046 , G06F13/1668

Abstract: A scalable and configurable clustered systolic array is described. An example of apparatus includes a cluster including multiple cores; and a cache memory coupled with the cluster, wherein each core includes multiple processing resources, a memory coupled with the plurality of processing resources, a systolic array coupled with the memory, and one or more interconnects with one or more other cores of the plurality of cores; and wherein the systolic arrays of the cores are configurable by the apparatus to form a logically combined systolic array for processing of an operation by a cooperative group of threads running on one or more of the plurality of cores in the cluster.

47.

发明公开
SYNCHRONIZATION FOR DATA MULTICAST IN COMPUTE CORE CLUSTERS 审中-公开

公开(公告)号：US20240220335A1

公开(公告)日：2024-07-04

申请号：US18148993

申请日：2022-12-30

Applicant: Intel Corporation

Inventor： Chunhui Mei , Yongsheng Liu , John A. Wiegert , Vasanth Ranganathan , Ben J. Ashbaugh , Fangwen Fu , Hong Jiang , Guei-Yuan Lueh , James Valerio , Alan M. Curtis , Maxim Kazakov

IPC: G06F9/52 , G06F9/38 , G06F9/50

CPC classification number: G06F9/522 , G06F9/3877 , G06F9/5072 , G06F9/3887

Abstract: Synchronization for data multicast in compute core clusters is described. An example of an apparatus includes one or more processors including at least a graphics processing unit (GPU), the GPU including one or more clusters of cores and a memory, wherein each cluster of cores includes a plurality of cores, each core including one or more processing resources, shared local memory, and gateway circuitry, wherein the GPU is to initiate broadcast of a data element from a producer core to one or more consumer cores, and synchronize the broadcast of the data element utilizing the gateway circuitry of the producer core and the one or more consumer cores, and wherein synchronizing the broadcast of the data element includes establishing a multi-core barrier for broadcast of the data element.

48.

发明授权
Hierarchical thread scheduling based on multiple barriers 有权

公开(公告)号：US11977895B2

公开(公告)日：2024-05-07

申请号：US17131647

申请日：2020-12-22

Applicant: Intel Corporation

Inventor： Sabareesh Ganapathy , Fangwen Fu , Hong Jiang , James Valerio

IPC: G06F9/38 , G06F9/48 , G06F9/54 , G06T1/20

CPC classification number: G06F9/3838 , G06F9/4881 , G06F9/544 , G06T1/20

Abstract: Examples described herein relate to a graphics processing unit (GPU) coupled to the memory device, the GPU configured to: execute an instruction thread; determine if a dual directional signal barrier is associated with the instruction thread; and based on clearance of the dual directional signal barrier for a particular signal barrier identifier and a mode of operation, indicate a clearance of the dual directional signal barrier for the mode of operation, wherein the dual directional signal barrier is to provide a single barrier to gate activity of one or more producers based on activity of one or more consumers or gate activity of one or more consumers based on activity of one or more producers.

49.

发明公开
BROADCAST ASYNCHRONOUS LOADS TO SHARED LOCAL MEMORY 审中-公开

公开(公告)号：US20240134797A1

公开(公告)日：2024-04-25

申请号：US17973203

申请日：2022-10-24

Applicant: Intel Corporation

Inventor： John A. Wiegert , Joydeep Ray , Vasanth Ranganathan , Biju George , Fangwen Fu , Abhishek R. Appu , Chunhui Mei , Changwon Rhee

IPC: G06F12/0855

CPC classification number: G06F12/0857 , G06F2212/1016

Abstract: Embodiments described herein provide a technique to facilitate the broadcast or multicast of asynchronous loads to shared local memory of a plurality of graphics cores within a graphics core cluster. One embodiment provides a graphics processor including a cache memory a graphics core cluster coupled with the cache memory. The graphics core cluster includes a plurality of graphics cores. The plurality of graphics cores includes a graphics core configured to receive a designation as a producer graphics core for a multicast load, read data from the cache memory; and transmit the data read from the cache memory to a consumer graphics core of the plurality of graphics cores.

50.

发明公开
SHARED LOCAL REGISTERS FOR THREAD TEAM PROCESSING 审中-公开

公开(公告)号：US20240112295A1

公开(公告)日：2024-04-04

申请号：US17958216

申请日：2022-09-30

Applicant: Intel Corporation

Inventor： Biju George , Fangwen Fu , Supratim Pal , Jorge Parra , Chunhui Mei , Maxim Kazakov , Joydeep Ray

IPC: G06T1/20 , G06F9/30 , G06F9/38

CPC classification number: G06T1/20 , G06F9/30098 , G06F9/3836

Abstract: Shared local registers for thread team processing is described. An example of an apparatus includes one or more processors including a graphic processor having multiple processing resources; and memory for storage of data, the graphics processor to allocate a first thread team to a first processing resource, the first thread team including hardware threads to be executed solely by the first processing resource; allocate a shared local register (SLR) space that may be directly reference in the ISA instructions to the first processing resource, the SLR space being accessible to the threads of the thread team and being inaccessible to threads outside of the thread team; and allocate individual register spaces to the thread team, each of the individual register spaces being accessible to a respective thread of the thread team.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification