Abstract:
PROBLEM TO BE SOLVED: To provide a method and a device for reducing a memory latency in a software application. SOLUTION: A performance analysis tool 208 is used to profile a resource use amount of the software application 210, and specifies an area of the software application 210 experiencing a performance bottleneck. A compiler runtime command is generated within the software application, to generate and manage a helper thread. The helper thread prefetches a data in the specified areas of the software application experiencing the performance bottleneck. A counting mechanism is inserted into the helper thread and the counting mechanism is inserted into a main thread, to help ensure the prefetched data is not removed from a cache before the main thread is able to take advantage of the prefetched data. COPYRIGHT: (C)2011,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a mechanism for scheduling user-level threads so that the user-level threads can be executed on a processor that is not directly managed by an OS. SOLUTION: User-level threads on a first instruction sequencer are managed in response to executing user-level instructions on a second instruction sequencer that is under control of an application level program. A first user-level thread is run on the second instruction sequencer and contains one or more user level instructions. A first user level instruction has at least (1) a field that makes reference to one or more instruction sequencers or (2) implicitly references with a pointer to a code that specifically addresses one or more instruction sequencers when the code is executed. COPYRIGHT: (C)2011,JPO&INPIT
Abstract:
Methods and apparatus to provide loop parallelization based on loop splitting and/or index array are described. In one embodiment, one or more split loops, corresponding to an original loop, are generated based on the mis-speculation information. In another embodiment, a plurality of subloops are generated from an original loop based on an index array. Other embodiments are also described.
Abstract:
Methods to improve optimization of compilation are presented. In one embodiment, a method includes identifying one or more optimization speculations with respect to a code region and speculatively performing transformation on an intermediate representation of the code region in accordance with an optimization speculation. The method includes generating an advice message corresponding to the optimization speculation and displaying the advice message if the optimization speculation results in an improved compilation result.
Abstract:
Methods and apparatus for reducing memory latency in a software application are disclosed. A disclosed system uses one or more helper threads to prefetch variables for a main thread to reduce performance bottlenecks due to memory latency and/or a cache miss. A performance analysis tool is used to profile the software application's resource usage and identifies areas in the software application experiencing performance bottlenecks. Compiler-runtime instructions are generated into the software application to create and manage the helper thread. The helper thread prefetches data in the identified areas of the software application experiencing performance bottlenecks. A counting mechanism is inserted into the helper thread and a counting mechanism is inserted into the main thread to coordinate the execution of the helper thread with the main thread and to help ensure the prefetched data is not removed from the cache before the main thread is able to take advantage of the prefetched data.
Abstract:
Embodiments of the present invention provide a method, apparatus and system which may include splitting a dependency chain into a set of reduced-width dependency chains; mapping one or more dependency chains onto one or more clustered dependency chain processors, wherein an issue-width of one or more of the clusters is adapted to accommodate a size of the dependency chains; and/or processing in parallel a plurality of dependency chains of a trace. Other embodiments are described and claimed.
Abstract:
Methods and apparatuses for compiler- created helper thread for multithreading are described herein. In one embodiment, exemplary process includes identifying a region of a main thread that likely has one or more delinquent loads, the one or more delinquent loads representing loads which likely suffer cache misses during an execution of the main thread, analyzing the region for one or more helper threads with respect to the main thread, and generating code for the one or more helper threads, the one or more helper threads being speculatively executed in parallel with the main thread to perform one or more tasks for the region of the main thread. Other methods and apparatuses are also described.
Abstract:
A method and apparatus for compiling a source program are described. Multiple predetermined sequences within the source program are located. A start code is inserted in the source program prior to a first instruction of each predetermined sequence. An invocation code is inserted in the source program prior to the start code, the invocation code addressing the start code and transferring each sequence to a system for execution. Finally, a stop code is inserted in the source program after a last instruction of each sequence, the stop code signaling to the system to step execution of the sequence.
Abstract:
Die hier beschriebenen Ausführungsformen sind im Allgemeinen auf Verbesserungen bezüglich der Leistungs-, Latenzzeit-, Bandbreiten- und/oder Leistungsfähigkeitsprobleme bezüglich der GPU-Verarbeitung/des Cachings gerichtet. Gemäß einer Ausführungsform enthält ein System ein geistiges Eigentum (IP) eines Produzenten (z. B. ein Medien-IP), einen Rechenkern (z. B. eine GPU oder einen KI-spezifischen Kern der GPU), einen Streaming-Puffer, der logisch zwischen dem Produzenten-IP und dem Rechenkern angeordnet ist. Das Produzenten-IP ist betreibbar, Daten aus dem Speicher zu verbrauchen und die Ergebnisse an den Streaming-Puffer auszugeben. Der Rechenkern ist betreibbar, eine KI-Folgerungsverarbeitung basierend auf den Daten aus dem Streaming-Puffer auszuführen und die Ergebnisse der KI-Folgerungsverarbeitung an den Speicher auszugeben.