Abstract:
Techniques are described for providing an enhanced cache coherency protocol for a multi-core processor that includes a Speculative Request For Ownership Without Data (SRFOWD) for a portion of cache memory. With a SRFOWD, only an acknowledgement message may be provided as an answer to a requesting core. The contents of the affected cache line are not required to be a part of the answer. The enhanced cache coherency protocol may assure that a valid copy of the current cache line exists in case of misspeculation by the requesting core. Thus, an owner of the current copy of the cache line may maintain a copy of the old contents of the cache line. The old contents of the cache line may be discarded if speculation by the requesting core turns out to be correct. Otherwise, in case of misspeculation by the requesting core, the old contents of the cache line may be set back to a valid state.
Abstract:
Hardware/Software-Co-Entwurf für eine optimierte dynamische Out-of-order-Pipeline mit Very Long Instruction Words (VLIW-Pipeline). Zum Beispiel umfasst eine Ausführungsform einer Vorrichtung Folgendes: eine Befehlsabrufeinheit zum Abrufen von Very Long Instruction Words (VLIWs) in ihrer Programmreihenfolge aus dem Speicher, wobei jedes der VLIWs mehrere Reduced-Instruction-Set-Computing-Befehlssilben (RISC-Befehlssilben) umfasst, die in den VLIWs in einer Reihenfolge gruppiert sind, die Datenflussabhängigkeiten und falsche Ausgabeabhängigkeiten zwischen den Silben beseitigt; eine Decodiereinheit zum Decodieren der VLIWs in ihrer Programmreihenfolge und zum parallelen Ausgeben der Silben jedes decodierten VLIW; und eine Out-of-order-Ausführungsmaschine zum vorzugsweise parallelen Ausführen der Silben mit anderen Silben, wobei wenigstens einige der Silben in einer anderen Reihenfolge als der Reihenfolge, in der sie von der Decodiereinheit empfangen werden, ausgeführt werden sollen, wobei die Out-of-order-Ausführungsmaschine eine oder mehrere Verarbeitungsstufen aufweist, die, wenn sie Operationen ausführen, nicht auf Datenflussabhängigkeiten und falsche Ausgabeabhängigkeiten zwischen den Silben prüfen.
Abstract:
An apparatus comprises: an instruction fetch unit to fetch very long instruction words (VLIWs) in program order from memory, each of the VLIWs comprising a plurality of reduced instruction set computing (RISC) instruction syllables grouped into the VLIWs in an order which removes data-flow dependencies, or read after write hazards, and false output dependencies, or write after write hazards, between the syllables; a decode unit 1501 to decode the VLIWs in program order and output the syllables of each decoded VLIW in parallel; and an out-of-order execution engine 1502-1507 to execute at least some of the syllables in parallel with other syllables, where at least some of the syllables are to be executed in a different order than the order in which they are received from the decode unit, the out-of-order execution engine having one or more processing stages which do not check for data-flow dependencies and false output dependencies between the syllables when performing operations.
Abstract:
Embodiments of a method and apparatus for implementing and maintaining a stack of predicate values with stack synchronization instructions. In one embodiment the apparatus is an out of order hardware/software co-designed processor including instructions to explicitly manage the predicate register stack to maintain stack consistency across branches of executing that push a variable number of predicate values onto the predicate stack. In one embodiment the stack-based predicate register implementation enables early branch calculation and early branch misprediction recovery via early renaming of predicate registers.
Abstract:
In one embodiment a binary translation is used to fuse multiple macroinstructions of an instruction set architecture into a single macroinstruction. Fusible instruction sequences include a sequence of increment, compare, and jump instructions. In one embodiment, a processing device provides support for the fused macroinstruction. In one embodiment, the processing device executes the fused macroinstruction within a single execution stage of a processor pipeline. In one embodiment, the fused macroinstruction is performed within a single execution cycle.
Abstract:
Disclosed is an apparatus and method to manage instruction cache prefetching from an instruction cache. A processor may comprise: a prefetch engine; a branch prediction engine to predict the outcome of a branch; and dynamic optimizer. The dynamic optimizer may be used to control: identifying common instruction cache misses and inserting a prefetch instruction from the prefetch engine to the instruction cache.