Abstract:
A method and system to facilitate a user level application executing in a first processing unit to enqueue work or task(s) safely for a second processing unit without performing any ring transition. For example, in one embodiment of the invention, the first processing unit executes one or more user level applications, where each user level application has a task to be offloaded to a second processing unit. The first processing unit signals the second processing unit to handle the task from each user level application without performing any ring transition in one embodiment of the invention.
Abstract:
A technique to enable information sharing among agents within different cache coherency domains. In one embodiment, a graphics device may use one or more caches used by one or more processing cores to store or read information, which may be accessed by one or more processing cores in a manner that does not affect programming and coherency rules pertaining to the graphics device.
Abstract:
In position-only shading, two geometry pipes exist, a trimmed down version called the Cull Pipe and a full version called the Replay Pipe. Thus, the Cull Pipe executes the position shaders in parallel with the main application, but typically generates the critical results much faster as it fetches and shades only the position attribute of the vertices and avoids the rasterization as well as the rendering of pixels for the frame buffer. Furthermore, the Cull Pipe uses these critical results to compute visibility information for all the triangles whether they are culled or not. On the other hand, the Replay Pipe consumes the visibility information to skip the culled triangles and shades only the visible triangles that are finally passed to the rasterization phase. Together the two pipes can hide the long cull runs of discarded triangles and can complete the work faster in some embodiments.
Abstract:
Page faults arising in a graphics processing unit may be handled by an operating system running on the central processing unit. In some embodiments, this means that unpinned memory can be used for the graphics processing unit. Using unpinned memory in the graphics processing unit may expand the capabilities of the graphics processing unit in some cases.
Abstract:
An integrated circuit, for use as a cache subsystem, implements a cache static random access memory (SRAM) storage array, a central processor unit (CPU) bus interface and a main memory bus interface. The CPU bus and main memory bus interfaces include multiplexers, buffers, and local control for optimizing burst read and write operations to and from the CPU bus. These circuits allow a full cache line to be read or written in a single access of the SRAM array. Control logic is utilized within the CPU bus interface for controlling CPU bursts in the order defined by the CPU. The memory bus interface includes internal buffers used in performing memory bus reads, write-throughs, write-backs and snoops. Tracking logic is employed for determining the appropriate internal buffer to be utilized for a particular memory bus cycle. Additionally, a data path is included for transparently passing data between the CPU and memory bus interfaces without disturbance of the SRAM array.
Abstract:
A second level cache memory controller, implemented as an integrated circuit unit, operates in conjunction with a secondary random access cache memory and a main memory (system) bus controller to form a second level cache memory subsystem. The subsystem is interfaced to the local processor (CPU) bus and to the main memory bus providing independent access by both buses, thereby reducing traffic of the main memory bus when the data required by the CPU is located in secondary cache. Similarly, CPU bus traffic is minimized when secondary cache access by the main memory bus for snoops and write-backs to main memory. Snoop latches interfaced with the main memory bus provide snoop access to the cache memory via the cache directory in the secondary cache controller unit. The controller also supports parallel look-up in the controller tag array and the secondary cache using most-recently-used (MRU) main memory write-through and pipelining of memory bus cycle requests.
Abstract:
By packing the depth data in a way that is independent of the number of samples, so that memory bandwidth is the same regardless of the number of samples, higher numbers of samples per pixel may be used without adversely affecting buffer cost. In some embodiments, the number of pixels per clock in a first level depth test may be increased by operating in the pixel domain, whereas previous solutions operated at the sample level.