Abstract:
Disclosed is an apparatus and method to manage instruction cache prefetching from an instruction cache. A processor may comprise: a prefetch engine; a branch prediction engine to predict the outcome of a branch; and dynamic optimizer. The dynamic optimizer may be used to control: identifying common instruction cache misses and inserting a prefetch instruction from the prefetch engine to the instruction cache.
Abstract:
Disclosed is an apparatus and method to manage instruction cache prefetching from an instruction cache. A processor may comprise: a prefetch engine; a branch prediction engine to predict the outcome of a branch; and dynamic optimizer. The dynamic optimizer may be used to control: indentifying common instruction cache misses and inserting a prefetch instruction from the prefetch engine to the instruction cache.
Abstract:
Described herein are technologies for optimizing computer code. A code generator can optimize a portion of original code to create optimized code. The code generator can create a partial commit point to indicate that execution of the optimized code produces an invalid architectural state. The code generator can create recovery information recover a valid architectural state at a recovery point. The code generator can associate the partial commit point and recovery information with the optimized code.
Abstract:
A computer-readable storage medium, method and system for optimization-level aware branch prediction is described. A gear level is assigned to a set of application instructions that have been optimized. The gear level is also stored in a register of a branch prediction unit of a processor. Branch prediction is then performed by the processor based upon the gear level.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
Systems, methods, and apparatuses for decomposing a sequential program into multiple threads, executing these threads, and reconstructing the sequential execution of the threads are described. A plurality of data cache units (DCUs) store locally retired instructions of speculatively executed threads. A merging level cache (MLC) merges data from the lines of the DCUs. An inter-core memory coherency module (ICMC) globally retire instructions of the speculatively executed threads in the MLC.
Abstract:
A combination of hardware and software collect profile data for asynchronous events, at code region granularity. An exemplary embodiment is directed to collecting metrics for prefetching events, which are asynchronous in nature. Instructions that belong to a code region are identified using one of several alternative techniques, causing a profile bit to be set for the instruction, as a marker. Each line of a data block that is prefetched is similarly marked. Events corresponding to the profile data being collected and resulting from instructions within the code region are then identified. Each time that one of the different types of events is identified, a corresponding counter is incremented. Following execution of the instructions within the code region, the profile data accumulated in the counters are collected, and the counters are reset for use with a new code region.
Abstract:
Methods and apparatus are disclosed for using a register checkpointing mechanism to resolve multithreading mis-speculations. Valid architectural state is recovered and execution is rolled back. Some embodiments include memory to store checkpoint data. Multiple thread units concurrently execute threads. They execute a checkpoint mask instruction to initialize memory to store active checkpoint data including register contents and a checkpoint mask indicating the validity of stored register contents. As register contents change, threads execute checkpoint write instructions to store register contents and update the checkpoint mask. Threads also execute a recovery function instruction to store a pointer to a checkpoint recovery function, and in response to mis-speculation among the threads, branch to the checkpoint recovery function. Threads then execute one or more checkpoint read instructions to copy data from a valid checkpoint storage area into the registers necessary to recover a valid architectural state, from which execution may resume.
Abstract:
Systems, methods, and apparatuses for decomposing a sequential program into multiple threads, executing these threads, and reconstructing the sequential execution of the threads are described. A plurality of data cache units (DCUs) store locally retired instructions of speculatively executed threads. A merging level cache (MLC) merges data from the lines of the DCUs. An inter-core memory coherency module (ICMC) globally retire instructions of the speculatively executed threads in the MLC.
Abstract:
An apparatus and method are described for improved thread selection. For example, one embodiment of a processor comprises: first logic to maintain a history table comprising a plurality of entries, each entry in the table associated with an instruction and including history data indicating prior hits and/or misses to a cache level and/or a translation lookaside buffer (TLB) for that instruction; and second logic to select a particular thread for execution at a particular processor pipeline stage based on the history data.