Abstract:
PROBLEM TO BE SOLVED: To prevent deterioration in performance or increase in power consumption by switching threads at an optimum timing or by optimum frequency. SOLUTION: A pre-fetch buffer bank 101 includes a plurality of buffer entries and stores a plurality of instructions concerning respective threads. The instruction fetched about each thread is transmitted to a multiplexer (mux) 105 via interconnects T0, T1, T2, T3. The multiplexer 105 selects one instruction corresponding to the selected thread, on the basis of a selection line from thread picker logic 110. The thread picker logic may determines which thread is to be selected, i.e., which instruction of the thread is to be performed, on the basis of the expression of the fetched instruction which is to be provided by a thread block indicator 115 and/or a decoded indication or a decoder 120. COPYRIGHT: (C)2010,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a technique of providing execution resources.SOLUTION: A CPU and a GPU share resources according to workload, power considerations, or available resources by scheduling or transferring instructions and information between the CPU and GPU.
Abstract:
PROBLEM TO BE SOLVED: To provide a technique to increase the memory bandwidth for applications.SOLUTION: An apparatus comprises at least two processors coupled to at least two memories. A first processor 200 of the at least two processors is configured to read a first portion of data stored in a first memory 225 of the at least two memories and a second portion of data stored in a second memory 220 of the at least two memories within a first portion of a clock signal period. A second processor 205 of the at least two processors is configured to read a third portion of data stored in the first memory 225 of the at least two memories and a fourth portion of data stored in the second memory 220 of the at least two memories within the first portion of the clock signal period.
Abstract:
PROBLEM TO BE SOLVED: To share computing resources according to the type of workload to be processed between CPU and GPU. SOLUTION: A processor 100 includes a plurality of processing cores 100-1 to 100-N, a dedicated throughput application hardware 110 (for example, graphics texture sampling hardware), and a memory interface logic 120. The processor is arranged along with a ring interconnection. The CPU transfers some of operations scheduled by GPU hardware through a common memory or a direct link (or annular link), respectively, thereby executing these operations. Conversely, the operation scheduled by the graphics hardware can be transferred to an available CPU using a similar mechanism. COPYRIGHT: (C)2011,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a technique for expanding a memory bandwidth of an application. SOLUTION: In a device having at least two processors connected to at least two memories, a first processor of the at least two processors reads a first portion of a data stored in a first memory of the at least two memories, and a second portion of a data stored in a second memory of the at least two memories, within the first portion of a clock signal period, and a second processor of the at least two processors reads a third portion of the data stored in the first memory of the at least two memories, and a fourth portion of the data stored in the second memory of the at least two memories, within the first portion of the clock signal period. COPYRIGHT: (C)2010,JPO&INPIT
Abstract:
A system and method for a processor to determine a memory page management implementation used by a memory controller without necessarily having direct access to the circuits or registers of the memory controller is disclosed. In one embodiment, a matrix of counters correspond to potential page management implementations and numbers of pages per block. The counters may be incremented or decremented depending upon whether the corresponding page management implementations and numbers of pages predict a page boundary whenever a long access latency is observed. The counter with the largest value after a period of time may correspond to the actual page management implementation and number of pages per block.
Abstract:
A method and apparatus for accessing memory comprising monitoring memory accesses from a hardware prefetcher; determining whether the memory accesses from the hardware prefetcher are used by an out-of-order core; and switching memory accesses from a first mode to a second mode if a percentage of the memory access generated by the hardware prefetcher are used by the out-of-order core.
Abstract:
A processing core is described having execution unit logic circuitry having a first register to store a first vector input operand, a second register to a store a second vector input operand and a third register to store a packed data structure containing scalar input operands a, b, c. The execution unit logic circuitry further include a multiplier to perform the operation (a*(first vector input operand)) + (b*(second vector operand)) + c.
Abstract:
A method and apparatus to detect and filter out redundant cache line addresses in a prefetch input queue, and to adjust the detector window size dynamically according to the number of detector entries in the queue for the cache-to-memory controller bus. Detectors correspond to cache line addresses that may represent cache misses in various levels of cache memory.