Abstract:
Techniques and mechanisms for determining a latency event to be represented in performance monitoring information. In an embodiment, circuit blocks of a pipeline experience respective latency events at variously times during tasks by the pipeline which service a workload. The circuit blocks send to an evaluation circuit of the pipeline respective event signals which each indicate whether a respective latency event has been detected. The event signals are communicated in parallel with at least a portion of the pipeline. In response to a trigger event in the pipeline, the evaluation circuit selects an event signal, based on relative priorities of the event signals, which provides a sample indicating a detected latency event. Based on the selected event signal, a representation of the indicated latency event in provided to latency event count or other value performance monitoring information. In another embodiment, different time delays are applied to various event signals.
Abstract:
Systems, methods, and apparatuses relating to circuitry to implement precise last branch record event logging in a processor are described. In one embodiment, a hardware processor core includes an execution circuit to execute instructions, a retirement circuit to retire executed instructions, a status register, and a last branch record circuit to, in response to retirement by the retirement circuit of a first taken branch instruction, start a cycle timer and a performance monitoring event counter, and in response to retirement by the retirement circuit of a second taken branch instruction, that is a next taken branch instruction in program order after the first taken branch instruction, write values from the cycle timer and the performance monitoring event counter into a first entry in the status register and clear the values from the cycle timer and the performance monitoring event counter.
Abstract:
Systems, methods, and apparatuses relating to circuitry to implement out-of-order access to a shared microcode sequencer by a clustered decode pipeline are described. In one embodiment, a hardware processor core includes a first decode cluster comprising a plurality of decoder circuits, a second decode cluster comprising a plurality of decoder circuits, a fetch circuit to fetch a first block of instructions and send the first block of instructions to the first decode cluster for decoding, and fetch a second block of instructions younger in program order than the first block of instructions and send the second block of instructions to the second decode cluster for decoding, a microcode sequencer comprising a memory that stores a plurality of micro-operations, and an arbitration circuit to arbitrate access by the first decode cluster and the second decode cluster to a shared read port of the memory, wherein the arbitration circuit is to allow the second decode cluster decoding the second block of instructions access to the shared read port of the memory instead of the first decode cluster decoding the first block of instructions when an instruction of the second block of instructions has a number of corresponding micro-operations in the microcode sequencer below an arbitration threshold.
Abstract:
In one embodiment, an apparatus includes: at least one core to execute instructions; and a plurality of fixed counters coupled to the at least one core, the plurality of fixed counters to count events during execution on the at least one core, at least some of the plurality of fixed counters to count event information of a highest level of a hierarchical performance monitoring organization. Other embodiments are described and claimed.
Abstract:
Systems, methods, and apparatuses relating to circuitry to implement toggle point insertion for a clustered decode pipeline are described. In one example, a hardware processor core includes a first decode cluster comprising a plurality of decoder circuits, a second decode cluster comprising a plurality of decoder circuits, and a toggle point control circuit to toggle between sending instructions requested for decoding between the first decode cluster and the second decode cluster, wherein the toggle point control circuit is to: determine a location in an instruction stream as a candidate toggle point to switch the sending of the instructions requested for decoding between the first decode cluster and the second decode cluster, track a number of times a characteristic of multiple previous decodes of the instruction stream is present for the location, and cause insertion of a toggle point at the location, based on the number of times, to switch the sending of the instructions requested for decoding between the first decode cluster and the second decode cluster.
Abstract:
In one embodiment, an apparatus comprises: a branch prediction circuit to predict whether a branch is to be taken; a fetch circuit, in a single fetch cycle, to send a first portion of a fetch region of instructions to a first decode cluster and send a second portion of the fetch region to the second decode cluster; the first decode cluster comprising a first plurality of decode circuits to decode one or more instructions in the first portion of the fetch region; and the second decode cluster comprising a second plurality of decode circuits to decode one or more other instructions in the second portion of the fetch region. Other embodiments are described and claimed.
Abstract:
Techniques and mechanisms for providing branch prediction information to facilitate instruction decoding by a processor. In an embodiment, entries of a branch prediction table (BTB) each identify, for a corresponding instruction, whether a prediction based on the instruction (if any) is eligible to be communicated, with another prediction, in a single fetch cycle. A branch prediction unit of the processor determines a linear address of a fetch region which is under consideration, and performs a search of the BTB based on the linear address. A result of the search is evaluated to detect for any hit entry which indicates a double prediction eligibility. In another embodiment, where it is determined that double prediction eligibility is indicated for an earliest one the instructions represented by the hit entries, multiple predictions are communicated in a single fetch cycle.
Abstract:
Techniques and mechanisms for determining a latency event to be represented in performance monitoring information. In an embodiment, circuit blocks of a pipeline experience respective latency events at variously times during tasks by the pipeline which service a workload. The circuit blocks send to an evaluation circuit of the pipeline respective event signals which each indicate whether a respective latency event has been detected. The event signals are communicated in parallel with at least a portion of the pipeline. In response to a trigger event in the pipeline, the evaluation circuit selects an event signal, based on relative priorities of the event signals, which provides a sample indicating a detected latency event. Based on the selected event signal, a representation of the indicated latency event in provided to latency event count or other value performance monitoring information. In another embodiment, different time delays are applied to various event signals.
Abstract:
A method and apparatus for disabling ways of a cache memory in response to history based usage patterns is herein described. Way predicting logic is to keep track of cache accesses to the ways and determine if an access to some ways are to be disabled to save power, based upon way power signals having a logical state representing a predicted miss to the way. One or more counters associated with the ways count accesses, wherein a power signal is set to the logical state representing a predicted miss when one of said one or more counters reaches a saturation value. Control logic adjusts said one or more counters associated with the ways according to the accesses.
Abstract:
Systems, methods, and apparatuses relating to circuitry to implement scalable port-binding for asymmetric execution ports and allocation widths of a processor are described. In one embodiment, a hardware processor core includes a decoder circuit to decode instructions into sets of one or more micro-operations, an instruction decode queue to store the sets of one or more micro-operations, a plurality of different types of execution circuits that each comprise a respective input port and a respective input queue, and an allocation circuit comprising a plurality of allocation lanes coupled to the instruction decode queue and to the input ports of the plurality of different types of execution circuits, wherein the allocation circuit is to, for an input of micro-operations on the plurality of allocation lanes, generate a sorted list of occupancy of the input queues of each input port, generate a pre-binding mapping of the input ports of the plurality of different types of execution circuits to the plurality of allocation lanes in a circular order according to the sorted list, when a type of micro-operation from an allocation lane does not match a type of execution circuit of an input port in the pre-binding mapping, slide the pre-binding mapping so that the input port maps to a next allocation lane having a matching type of micro-operation to generate a final mapping of the input ports of the plurality of different types of execution circuits to the plurality of allocation lanes, and bind the input ports of the plurality of different types of execution circuits to the plurality of allocation lanes according to the final mapping.