Abstract:
A system for supporting software pipelining using a shifting register queue is provided. The system includes a register file that comprises a plurality of registers. The register file is operable to receive a shift mask signal and a shift signal and to identify a shifting register queue within the register file based on the shift mask signal. The shifting register queue comprises a plurality of queue registers. The register file is further operable to shift the contents of the queue registers based on the shift signal.
Abstract:
A coprocessor executing one among a set of candidate kernel loops within an application operates at the minimal clock frequency satisfying schedule constraints imposed by the compiler and data bandwidth constraints. The optimal clock frequency is statically determined by the compiler and enforced at runtime by software-controlled clock circuitry. Power dissipation savings and optimal resource usage are therefore achieved by the adaptation at runtime of the coprocessor clock rate for each of the various kernel loop implementations.
Abstract:
A processor executes one or more prefetch threads and one or more main computing threads. Each prefetch thread executes instructions ahead of a main computing thread to retrieve data for the main computing thread, such as data that the main computing thread may use in the immediate future. Data is retrieved for the prefetch thread and stored in a memory, such as data fetched from an external memory and stored in a buffer. A prefetch controller determines whether the memory is full. If the memory is full, a cache controller stalls at least one prefetch thread. The stall may continue until at least some of the data is.transferred from the memory to a cache for use by at least one main computing thread. The stalled prefetch thread or threads are then reactivated.
Abstract:
A processor executes one or more prefetch threads and one or more main computing threads. Each prefetch thread executes instructions ahead of a main computing thread to retrieve data for the main computing thread, such as data that the main computing thread may use in the immediate future. Data is retrieved for the prefetch thread and stored in a memory, such as data fetched from an external memory and stored in a buffer. A prefetch controller determines whether the memory is full. If the memory is full, a cache controller stalls at least one prefetch thread. The stall may continue until at least some of the data is.transferred from the memory to a cache for use by at least one main computing thread. The stalled prefetch thread or threads are then reactivated.
Abstract:
A data sorting apparatus comprising 1) a storage sorter that sorts a data set according to a defined criteria; and 2) a query mechanism that receives intermediate sorted data values from the storage sorter and compares the intermediate sorted data values to at least one key value. The storage sorter comprises a priority queue for sorting the data set, wherein the priority queue comprises M processing elements. The query mechanism receives the intermediate sorted data values from the M processing elements. The query mechanism comprises a plurality of comparison circuits, each of the comparison circuits capable of detecting if one of the intermediate sorted data values is equal to the at least one key value or, if no match exists, extracting the minimal value greater than (or less than according to a defined criteria) the at least one key value.
Abstract:
A system for supporting software pipelining using a shifting register queue is provided. The system includes a register file that comprises a plurality of registers. The register file is operable to receive a shift mask signal and a shift signal and to identify a shifting register queue within the register file based on the shift mask signal. The shifting register queue comprises a plurality of queue registers. The register file is further operable to shift the contents of the queue registers based on the shift signal.
Abstract:
Full predication of instruction execution is provided by operand predicates, where each operand has an associated predicate bit intuitively indicating the validity of the operand value. In a programmable processor supporting operand predication, an instruction will execute only if the predicate bit of every register containing a source operand is true. The predicate bit, if any, of the destination register is set to the logical AND of the source registers' predicates. Similarly, in a non-programmable processor synthesized with predicated operand support, an operator will perform the associated function depending on the state of inputs' predicates. The output predicate is evaluated as the logical AND of the inputs' predicates. An additional bit for each data register, a change in the semantics of the instructions to include predication, and a few additional instructions to save and restore register predicate bits and to specifically set or reset a register's predicate bit are required.
Abstract:
A coprocessor executing one among a set of candidate kernel loops within an application operates at the minimal clock frequency satisfying schedule constraints imposed by the compiler and data bandwidth constraints. The optimal clock frequency is statically determined by the compiler and enforced at runtime by software-controlled clock circuitry. Power dissipation savings and optimal resource usage are therefore achieved by the adaptation at runtime of the coprocessor clock rate for each of the various kernel loop implementations.
Abstract:
Clustered VLIW processing elements, each preferably simple and identical, are coupled by a runtime reconfigurable inter-cluster interconnect to form a coprocessor executing only those portions of a program having high instruction level parallelism. The initial portion of each program segment executed by the coprocessor reconfigures the interconnect, if necessary, or is skipped. Clusters may be directly connected to a subset of neighboring clusters, or indirectly connected to any other cluster, a hierarchy exposed to the programming model and enabling a larger number of clusters to be employed. The coprocessor is idled during remaining portions of the program to reduce power dissipation.