Abstract:
Techniques and methods are used to reduce allocations to a higher level cache of cache lines displaced from a lower level cache. When it is determined that displaced lines have already been allocated in a higher level, the allocations of the displaced cache lines are prevented in the next level cache, thus, reducing castouts. To such ends, a line is selected to be displaced in a lower level cache. Information associated with the selected line is identified which indicates that the selected line is present in a higher level cache. An allocation of the selected line in the higher level cache is prevented based on the identified information. Preventing an allocation of the selected line saves power that would be associated with the allocation.
Abstract:
A processor pipeline is segmented into an upper portion - prior to instructions going out of program order - and one or more lower portions beyond the upper portion. The upper pipeline is flushed upon detecting that a branch instruction was mispredicted, minimizing the delay in fetching of instructions from the correct branch target address. The lower pipelines may continue execution until the mispredicted branch instruction confirms, at which time all uncommitted instructions are flushed from the lower pipelines. Existing exception pipeline flushing mechanisms may be utilized, by adding a mispredicted branch identifier, reducing the complexity and hardware cost of flushing the lower pipelines.
Abstract:
A processor includes a common instruction decode front end, e.g. fetch and decode stages, and a heterogeneous set of processing pipelines. A lower performance pipeline has fewer stages and may utilize lower speed/power circuitry. A higher performance pipeline has more stages and utilizes faster circuitry. The pipelines share other processor resources, such as an instruction cache, a register file stack, a data cache, a memory interface, and other architected registers within the system. In disclosed examples, the processor is controlled such that processes requiring higher performance run in the higher performance pipeline, whereas those requiring lower performance utilize the lower performance pipeline, in at least some instances while the higher performance pipeline is effectively inactive or even shut-off to minimize power consumption. The configuration of the processor at any given time, that is to say the pipeline(s) currently operating, may be controlled via several different techniques.
Abstract:
Techniques for ensuring a synchronized predecoding of an instruction string are disclosed. The instruction string contains instructions from a variable length instruction set and embedded data. One technique includes defining a granule to be equal to the smallest length instruction in the instruction set and defining the number of granules that compose the longest length instruction in the instruction set to be MAX. The technique further includes determining the end of an embedded data segment, when a program is compiled or assembled into the instruction string and inserting a padding of length, MAX -1, into the instruction string to the end of the embedded data. Upon predecoding of the padded instruction string, a predecoder maintains synchronization with the instructions in the padded instruction string even if embedded data is coincidentally encoded to resemble an existing instruction in the variable length instruction set.
Abstract:
A method of resolving simultaneous branch predictions prior to validation of the predicted branch instruction is disclosed. The method includes processing two or more predicted branch instructions, with each predicted branch instruction having a predicted state and a corrected state. The method further includes selecting one of the corrected states. Should one of the predicted branch instructions be mispredicted, the selected corrected state is used to direct future instruction fetches.
Abstract:
An apparatus for emulating the branch prediction behavior of an explicit subroutine call is disclosed. The apparatus includes a first input which is configured to receive an instruction address and a second input. The second input is configured to receive predecode information which describes the instruction address as being related to an implicit subroutine call to a subroutine. In response to the predecode information, the apparatus also includes an adder configured to add a constant to the instruction address defining a return address, causing the return address to be stored to an explicit subroutine resource, thus, facilitating subsequent branch prediction of a return call instruction.
Abstract:
A processor includes a hierarchical Translation Lookaside Buffer (TLB) comprising a Level-1 TLB and a small, high-speed Level-0 TLB. Entries in the L0 TLB replicate entries in the L1 TLB. The processor first accesses the L0 TLB in an address translation, and access the L1 TLB if a virtual address misses in the L0 TLB. When the virtual address hits in the L1 TLB, the virtual address, physical address, and page attributes are written to the L0 TLB, replacing an existing entry if the L0 TLB is full. The entry may be locked against replacement in the L0 TLB in response to an L0 Lock (L0L) indicator in the L1 TLB entry. Similarly, in a hardware-managed L1 TLB, entries may be locked against replacement in response to an L1 Lock (L1L) indicator in the corresponding page table entry.
Abstract:
A system for optimizing translation lookaside buffer entries is provided. The system includes a translation lookaside buffer configured to store a number of entries, each entry having a size attribute, each entry referencing a corresponding page, and control logic configured to modify the size attribute of an existing entry in the translation lookaside buffer if a new page is contiguous with an existing page referenced by the existing entry. The existing entry after having had its size attribute modified references a consolidated page comprising the existing page and the new page.
Abstract:
Systems and methods for managing access to a cache relate to determining one or more execute permissions associated with a write-address of a write-request to the cache. The cache may be a unified cache for storing data as well as instructions. If there is a write-miss in the cache for the write-request, a cache controller may determine whether to implement a write-allocate policy or a write-no-allocate policy for servicing the write-miss, based on the one or more execute permissions. The one or more execute permissions can relate to a privilege level associated with the write-address. Execute permissions of a producing agent which generated the write-request and an execute permission of a consuming agent which can execute from the write-address may be based on the privilege levels of the producing agent and the consuming agent, respectively.
Abstract:
Systems and method for memory management units (MMUs) configured to automatically pre-fill a translation lookaside buffer (TLB) (206, 208) with address translation (202-204) entries expected to be used in the future, thereby reducing TLB miss rate and improving performance. The TLB may be pre-filled with translation entries, wherein addresses corresponding to the pre-fill may be selected based on predictions. Predictions may be derived from external devices (214), or based on stride values, wherein the stride values may be a predetermined constant or dynamically altered based on access patterns (216). Pre-filling the TLB may effectively remove latency involved in determining address translations for TLB misses from the critical path.