-
公开(公告)号:US20240329998A1
公开(公告)日:2024-10-03
申请号:US18619392
申请日:2024-03-28
Applicant: Advanced Micro Devices, Inc.
Inventor: Bin He , Michael J. Mantor , Brian D. Emberling
CPC classification number: G06F9/3802 , G06F9/3001 , G06F9/30098 , G06F9/3867
Abstract: An apparatus and method for efficiently processing multiplication and accumulate operations for matrices in applications. In various implementations, a computing system includes a parallel data processing circuit and a memory. The memory stores the instructions (or translated commands) of a parallel data application. The circuitry of the parallel data processing circuit performs a matrix multiplication operation using source operands accessed only once from a vector register file and multiple instantiations of a vector processing circuit capable of performing multiple matrix multiplication operations corresponding to multiple different types of instructions. The multiplier circuit and the adder circuit of the vector processing circuit perform each of the fused multiply add (FMA) operation and the dot product (inner product) operation without independent, dedicated execution pipelines with one execution pipeline for the FMA operation and the other separate execution pipeline for the dot product operation.
-
公开(公告)号:US12033238B2
公开(公告)日:2024-07-09
申请号:US17030852
申请日:2020-09-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Brian D. Emberling , Joseph Lee Greathouse , Anthony Thomas Gutierrez
Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.
-
公开(公告)号:US20190129718A1
公开(公告)日:2019-05-02
申请号:US15799560
申请日:2017-10-31
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Bin He , Yunxiao Zou , Michael J. Mantor , Radhakrishna Giduthuri , Eric J. Finger , Brian D. Emberling
Abstract: Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.
-
公开(公告)号:US20180246724A1
公开(公告)日:2018-08-30
申请号:US15442412
申请日:2017-02-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Mark Fowler , Brian D. Emberling
IPC: G06F9/30 , G06F12/0875
Abstract: Systems, apparatuses, and methods for maintaining separate pending load and store counters are disclosed herein. In one embodiment, a system includes at least one execution unit, a memory subsystem, and a pair of counters for each thread of execution. In one embodiment, the system implements a software based approach for managing dependencies between instructions. In one embodiment, the execution unit(s) maintains counters to support the software-based approach for managing dependencies between instructions. The execution unit(s) are configured to execute instructions that are used to manage the dependencies during run-time. In one embodiment, the execution unit(s) execute wait instructions to wait until a given counter is equal to a specified value before continuing to execute the instruction sequence.
-
公开(公告)号:US20220092725A1
公开(公告)日:2022-03-24
申请号:US17030852
申请日:2020-09-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Brian D. Emberling , Joseph Lee Greathouse , Anthony Thomas Gutierrez
Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.
-
公开(公告)号:US10474468B2
公开(公告)日:2019-11-12
申请号:US15439540
申请日:2017-02-22
Applicant: Advanced Micro Devices, Inc.
Inventor: Michael J. Mantor , Brian D. Emberling , Mark Fowler , Mark M. Leather
Abstract: Systems, apparatuses, and methods for processing variable wavefront sizes on a processor are disclosed. In one embodiment, a processor includes at least a scheduler, cache, and multiple execution units. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront. In the second mode, when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. The processor determines the operating mode based on one or more conditions.
-
公开(公告)号:US20190004807A1
公开(公告)日:2019-01-03
申请号:US15657478
申请日:2017-07-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Qingcheng Wang , Yunxiao Zou , Bin He , Jian Yang , Michael J. Mantor , Brian D. Emberling
IPC: G06F9/38
Abstract: Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.
-
公开(公告)号:US09311205B2
公开(公告)日:2016-04-12
申请号:US13840154
申请日:2013-03-15
Applicant: Advanced Micro Devices, Inc.
Inventor: Brian D. Emberling
CPC classification number: G06F11/302 , G06F11/3072 , G06F11/3476 , G06F11/348 , G06F11/3636 , G06F11/3648 , G06F2201/86
Abstract: An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.
Abstract translation: 提出了一种用于计算机系统的基于硬件的性能监视的装置和方法。 该装置包括:处理单元; 记忆 连接处理单元和存储器的连接器装置; 探测器插入处理单元,并且当检测到所选择的处理事件时探测器产生探测信号; 以及连接到连接器装置的线迹装置。 线程跟踪设备包括用于接收探测信号的事件接口,以及事件存储器控制器,用于将探测事件消息发送到存储器,其中探测事件消息基于探测信号。 可以使用软件程序来分析传送到存储器的探测事件消息,以确定例如线程到线程的相互作用。
-
公开(公告)号:US11074075B2
公开(公告)日:2021-07-27
申请号:US15442412
申请日:2017-02-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Mark Fowler , Brian D. Emberling
Abstract: Systems, apparatuses, and methods for maintaining separate pending load and store counters are disclosed herein. In one embodiment, a system includes at least one execution unit, a memory subsystem, and a pair of counters for each thread of execution. In one embodiment, the system implements a software based approach for managing dependencies between instructions. In one embodiment, the execution unit(s) maintains counters to support the software-based approach for managing dependencies between instructions. The execution unit(s) are configured to execute instructions that are used to manage the dependencies during run-time. In one embodiment, the execution unit(s) execute wait instructions to wait until a given counter is equal to a specified value before continuing to execute the instruction sequence.
-
公开(公告)号:US20250005705A1
公开(公告)日:2025-01-02
申请号:US18764603
申请日:2024-07-05
Applicant: Advanced Micro Devices, Inc.
Inventor: Brian D. Emberling , Joseph Lee Greathouse , Anthony Thomas Gutierrez
Abstract: Systems, apparatuses, and methods for implementing register compaction with early release are disclosed. A processor includes at least a command processor, a plurality of compute units, a plurality of registers, and a control unit. Registers are statically allocated to wavefronts by the control unit when wavefronts are launched by the command processor on the compute units. In response to determining that a first set of registers, previously allocated to a first wavefront, are no longer needed, the first wavefront executes an instruction to release the first set of registers. The control unit detects the executed instruction and releases the first set of registers to the available pool of registers to potentially be used by other wavefronts. Then, the control unit can allocate the first set of registers to a second wavefront for use by threads of the second wavefront while the first wavefront is still active.
-
-
-
-
-
-
-
-
-