Apparatus and method for efficient prefix sum operation

    公开(公告)号:US09632979B2

    公开(公告)日:2017-04-25

    申请号:US14727826

    申请日:2015-06-01

    Abstract: An apparatus and method are described for performing a prefix sum. For example, one embodiment of an apparatus comprises: a graphics processor unit comprising one or more execution units to execute single instruction multiple data (SIMD) instructions, the GPU to be provided with a plurality of data elements as input for a prefix sum operation; a first register of the GPU to store the plurality of data elements in specified data element positions; and the one or more execution units to perform a series of single instruction multiple data (SIMD) operations using the plurality of data elements, the SIMD operations performed using regioning techniques to generate the prefix sum, the SIMD operations including a first plurality of simultaneous addition operations to add specified data elements to generate intermediate results and further including a second plurality of simultaneous addition operations to add the intermediate results to other intermediate results to generate the prefix sum.

    Method and apparatus for subdividing shader workloads in a graphics processor for efficient machine configuration

    公开(公告)号:US10360717B1

    公开(公告)日:2019-07-23

    申请号:US15858396

    申请日:2017-12-29

    Abstract: An apparatus and method for splitting shaders. For example, one embodiment of a method comprises: receiving a request for compilation of a shader in a graphics processing environment; determining whether there is sufficient work associated with the shader to justify splitting the shader into two or more blocks of program code; evaluating the program code of the shader to identify dependencies between the blocks of program code if there is sufficient work; subdividing the shader into the two or more blocks in accordance with the identified dependencies; and individually executing the two or more blocks of code on a graphics processor. In addition, one embodiment includes the operations of determining whether any of the regions that can be subdivided are likely to run faster with different machine configurations than if the shader is executed without being subdivided, and subdividing the shader only for those regions that are likely to run faster with different machine configurations.

    Hardware instruction set to replace a plurality of atomic operations with a single atomic operation

    公开(公告)号:US10318292B2

    公开(公告)日:2019-06-11

    申请号:US14543027

    申请日:2014-11-17

    Abstract: Systems and methods may process a single atomic operation. An instruction set may be generated to replace a plurality of atomic operations with a single atomic operation. The instruction set may include an accumulation instruction to compute a prefix sum for a plurality of initial values associated with a plurality of processing lanes to generate a plurality of accumulated values. The instruction set may also include a broadcast instruction to return a pre-existing value to be added with each of the plurality of accumulated values to generate a plurality of intermediate accumulated values. In one example, a graphics processor may execute the instruction set to process the single atomic operation.

    FACILITATING DYNAMIC RUNTIME TRANSFORMATION OF GRAPHICS PROCESSING COMMANDS FOR IMPROVED GRAPHICS PERFORMANCE AT COMPUTING DEVICES
    4.
    发明申请
    FACILITATING DYNAMIC RUNTIME TRANSFORMATION OF GRAPHICS PROCESSING COMMANDS FOR IMPROVED GRAPHICS PERFORMANCE AT COMPUTING DEVICES 审中-公开
    促进图形处理命令的动态运行转换改进计算设备的图形性能

    公开(公告)号:US20160364828A1

    公开(公告)日:2016-12-15

    申请号:US14738679

    申请日:2015-06-12

    Abstract: A mechanism is described for facilitating dynamic runtime transformation of graphics processing commands for improved graphics performance on computing devices. A method of embodiments, as described herein, includes detecting a command stream associated with an application, where the command stream includes dispatches. The method may further include evaluating processing parameters relating to each of the dispatches, where evaluating further includes associating a first plan with one or more of the dispatches to transform the command stream into a transformed command stream. The method may further include associating, based on the first plan, a second plan to the one or more of the dispatches, where the second plan represents the transformed command stream. The method may further include executing the second plan, where execution of the second plan includes processing the transformed command stream in lieu of the command stream.

    Abstract translation: 描述了一种机制,用于促进图形处理命令的动态运行时转换,以改善计算设备上的图形性能。 如本文所述的实施例的方法包括检测与应用相关联的命令流,其中命令流包括分派。 该方法还可以包括评估与每个调度有关的处理参数,其中评估进一步包括将第一计划与一个或多个调度相关联,以将命令流变换成变换的命令流。 该方法可以进一步包括:基于第一计划,将第二计划与一个或多个调度相关联,其中第二计划表示变换的命令流。 该方法还可以包括执行第二计划,其中第二计划的执行包括处理变换的命令流来代替命令流。

    Techniques to manage execution of divergent shaders

    公开(公告)号:US11776195B2

    公开(公告)日:2023-10-03

    申请号:US17463320

    申请日:2021-08-31

    CPC classification number: G06T15/005 G06F9/4887

    Abstract: Examples are described here that can be used to enable a main routine to request subroutines or other related code to be executed with other instantiations of the same subroutine or other related code for parallel execution. A sorting unit can be used to accumulate requests to execute instantiations of the subroutine. The sorting unit can request execution of a number of multiple instantiations of the subroutine corresponding to a number of lanes in a SIMD unit. A call stack can be used to share information to be accessed by a main routine after execution of the subroutine completes.

    Techniques to manage execution of divergent shaders

    公开(公告)号:US11107263B2

    公开(公告)日:2021-08-31

    申请号:US16190021

    申请日:2018-11-13

    Abstract: Examples are described here that can be used to enable a main routine to request subroutines or other related code to be executed with other instantiations of the same subroutine or other related code for parallel execution. A sorting unit can be used to accumulate requests to execute instantiations of the subroutine. The sorting unit can request execution of a number of multiple instantiations of the subroutine corresponding to a number of lanes in a SIMD unit. A call stack can be used to share information to be accessed by a main routine after execution of the subroutine completes.

    Method and apparatus for efficient processing of derived uniform values in a graphics processor

    公开(公告)号:US10726605B2

    公开(公告)日:2020-07-28

    申请号:US15705530

    申请日:2017-09-15

    Abstract: Various embodiments enable low frequency calculation of derived uniform values. A compiler can identify one or more portions of a shader that calculate a derived value based on an input value. For example, this portion may include instructions that use constant values, or the results of prior functions that used constant values. The constant values may include hardcoded values provided by the program (e.g., immediates) and/or other constant values. This portion of the shader can be extracted by the compiler and compiled into a first program. The compiler can compile the remainder of the shader into a second program that receives the derived uniform values from the first program. By extracting the portion(s) of the program that calculates a derived value into a separate program, the derived uniform value or values can be calculated at a lower frequency than if they were calculated for each pixel.

Patent Agency Ranking