DIRECT-CONNECTED MACHINE LEARNING ACCELERATOR

    公开(公告)号:US20250086515A1

    公开(公告)日:2025-03-13

    申请号:US18954763

    申请日:2024-11-21

    Inventor: Maxim V. Kazakov

    Abstract: Techniques are disclosed for communicating between a machine learning accelerator and one or more processing cores. The techniques include obtaining data at the machine learning accelerator via an input/output die; processing the data at the machine learning accelerator to generate machine learning processing results; and exporting the machine learning processing results via the input/output die, wherein the input/output die is coupled to one or more processor chiplets via one or more processor ports, and wherein the input/output die is coupled to the machine learning accelerator via an accelerator port.

    Techniques for reducing serialization in divergent control flow

    公开(公告)号:US12014208B2

    公开(公告)日:2024-06-18

    申请号:US16023897

    申请日:2018-06-29

    Abstract: Techniques for executing shader programs with divergent control flow on a single instruction multiple data (“SIMD”) processor are disclosed. These techniques includes detecting entry into a divergent section of a shader program and, for the work-items that enter the divergent section, placing a task entry into a task queue associated with the target of each work-item. The target is the destination, in code, of any particular work-item, and is also referred to as a code segment herein. The task queues store task entries for code segments generated by different (or the same) wavefronts. A command processor examines task lists and schedules wavefronts for execution by grouping together tasks in the same task list into wavefronts and launching those wavefronts. By grouping tasks from different wavefronts together for execution in the same front, serialization of execution is greatly reduced or eliminated.

    WAVEFRONT SELECTION AND EXECUTION
    23.
    发明公开

    公开(公告)号:US20230266975A1

    公开(公告)日:2023-08-24

    申请号:US18309536

    申请日:2023-04-28

    Inventor: Maxim V. Kazakov

    CPC classification number: G06F9/3885 G06F9/3869 G06F9/3851 G06F9/30152

    Abstract: Techniques are provided for executing wavefronts. The techniques include at a first time for issuing instructions for execution, performing first identifying, including identifying that sufficient processing resources exist to execute a first set of instructions together within a processing lane; in response to the first identifying, executing the first set of instructions together; at a second time for issuing instructions for execution, performing second identifying, including identifying that no instructions are available for which sufficient processing resources exist for execution together within the processing lane; and in response to the second identifying, executing an instruction independently of any other instruction.

    PROCESSING DEVICE AND METHOD OF SHARING STORAGE BETWEEN CACHE MEMORY, LOCAL DATA STORAGE AND REGISTER FILES

    公开(公告)号:US20230069890A1

    公开(公告)日:2023-03-09

    申请号:US17467104

    申请日:2021-09-03

    Inventor: Maxim V. Kazakov

    Abstract: An accelerated processing device is provided which comprises a plurality of compute units each including a plurality of SIMD units, and each SIMD unit comprises a register file. The accelerated processing device also comprises LDS in communication with each of the SIMD units. The accelerated processing device also comprises a first portion of cache memory, in communication with each of the SIMD units and a second cache portion of memory shared by the compute units. The compute units are configured to execute a program in which a storage portion of at least one of the register file of a SIMD unit, the first portion of cache memory and the LDS is reserved as part of another of the register file, the first portion of cache memory and the LDS.

    MULTI-ACCELERATOR COMPUTE DISPATCH
    25.
    发明申请

    公开(公告)号:US20220319089A1

    公开(公告)日:2022-10-06

    申请号:US17218421

    申请日:2021-03-31

    Abstract: Techniques for executing computing work by a plurality of chiplets are provided. The techniques include assigning workgroups of a kernel dispatch packet to the chiplets; by each chiplet, executing the workgroups assigned to that chiplet; for each chiplet, upon completion of all workgroups assigned to that chiplet for the kernel dispatch packet, notifying the other chiplets of such completion; and upon completion of all workgroups of the kernel dispatch packet, notifying a client of such completion and proceeding to a subsequent kernel dispatch packet.

    DATA COMPRESSOR FOR APPROXIMATION OF MATRICES FOR MATRIX MULTIPLY OPERATIONS

    公开(公告)号:US20220309125A1

    公开(公告)日:2022-09-29

    申请号:US17214779

    申请日:2021-03-26

    Abstract: A processing device is provided which comprises memory configured to store data and a processor. The processor comprises a plurality of MACs configured to perform matrix multiplication of elements of a first matrix and elements of a second matrix. The processor also comprises a plurality of logic devices configured to sum values of bits of product exponents values of the elements of the first matrix and second matrix and determine keep bit values for product exponents values to be kept for matrix multiplication. The processor also comprises a plurality of multiplexor arrays each configured to receive bits of the elements of the first matrix and the second matrix and the keep bit values and provide data for selecting which elements of the first matrix and the second matrix values are provided to the MACs for matrix multiplication.

    Matrix multiplier with submatrix sequencing

    公开(公告)号:US11093580B2

    公开(公告)日:2021-08-17

    申请号:US16176449

    申请日:2018-10-31

    Abstract: A processor sequences the application of submatrices at a matrix multiplier to reduce the number of input changes at an input register of the matrix multiplier. The matrix multiplier is configured to perform a matrix multiplication for a relatively small matrix. To multiply two larger matrices the GPU decomposes the larger matrices into smaller submatrices and stores the submatrices at input registers of the matrix multiplier in a sequence, thereby calculating each column of a result matrix. The GPU sequences the storage of the submatrices at the input registers to maintain input data at one of the input registers over multiple calculation cycles of the matrix multiplier thereby reducing power consumption at the GPU.

    Texture residency checks using compression metadata

    公开(公告)号:US10783694B2

    公开(公告)日:2020-09-22

    申请号:US15687108

    申请日:2017-08-25

    Abstract: A pipeline is configured to access a memory that stores a texture block and metadata that encodes compression parameters of the texture block and a residency status of the texture block. A processor requests access to the metadata in conjunction with requesting data in the texture block to perform a shading operation. The pipeline selectively returns the data in the texture block to the processor depending on whether the metadata indicates that the texture block is resident in the memory. A cache can also be included to store a copy of the metadata that encodes the compression parameters of the texture block. The residency status and the metadata stored in the cache can be modified in response to requests to access the metadata stored in the cache.

    Wave creation control with dynamic resource allocation

    公开(公告)号:US10558499B2

    公开(公告)日:2020-02-11

    申请号:US15794593

    申请日:2017-10-26

    Abstract: Footprints, or resource allocations, of waves within resources that are shared by processor cores in a multithreaded processor are measured concurrently with the waves executing on the processor cores. The footprints are averaged over a time interval. A number of waves are spawned and dispatched for execution in the multithreaded processor based on the average footprint. In some cases, the waves are spawned at a rate that is determined based on the average value of the footprints of waves within the resources. The rate of spawning waves is modified in response to a change in the average value of the footprints of the waves within the resources.

Patent Agency Ranking