Abstract:
A parallelization assistant tool system to assist in parallelization of a computer program is disclosed. The system directs the execution of instrumented code of the computer program to collect performance statistics information relating to execution of loops within the computer program. The system provides a user interface for presenting to a programmer the performance statistics information collected for a loop within the computer program so that the programmer can prioritize efforts to parallelize the computer program. The system generates inlined source code of a loop by aggressively inlining functions substantially without regard to compilation performance, execution performance, or both. The system analyzes the inlined source code to determine the data-sharing attributes of the variables of the loop. The system may generate compiler directives to specify the data-sharing attributes of the variables.
Abstract:
A system and method for cooling a plurality of connectors interfacing electrical and optical signals to circuit boards in an electronics cabinet, such as backplane connectors routing signals to circuit boards housed in card cage assemblies. Heat pipes coupled to the connectors efficiently remove heat from the connectors and sink the connector heat to a cold junction of a liquid cooling system, which cooling system may also extract heat from air flow cooling the circuit boards such that the system is room neutral, meaning that the ambient temperature remains constant during operation of the system. The heat connector cooling system is effective where connectors are outside of an air flow cooling envelope that may cool the circuit boards.
Abstract:
A system implementing a method for generating code for execution based on a SIMT model with parallel units of threads is provided. The system identifies a loop within a program that includes vector processing. The system generates instructions for a thread that include an instruction to set a predicate based on whether the thread of a parallel unit corresponds to a vector element. The system also generates instructions to perform the vector processing via scalar operations predicated on the predicate. As a result, the system generates instructions to perform the vector processing but to avoid branch divergence within the parallel unit of threads that would be needed to check whether a thread corresponds to a vector element.
Abstract:
A system and method for cooling a plurality of connectors interfacing electrical and optical signals to circuit boards in an electronics cabinet, such as backplane connectors routing signals to circuit boards housed in card cage assemblies. Heat pipes coupled to the connectors efficiently remove heat from the connectors and sink the connector heat to a cold junction of a liquid cooling system, which cooling system may also extract heat from air flow cooling the circuit boards such that the system is room neutral, meaning that the ambient temperature remains constant during operation of the system. The heat connector cooling system is effective where connectors are outside of an air flow cooling envelope that may cool the circuit boards.
Abstract:
A system and method for cooling a plurality of electronics cabinets having horizontally positioned electronics assemblies. The system includes at least one blower configured to direct air horizontally across the electronics assemblies, and at least one intercooler configured to extract heat from the air flow such that the system is room neutral, meaning that the ambient temperature remains constant during operation of the system. A plurality of chassis backplanes and power supplies may also include an intercooler, wherein the intercoolers are electronically controlled such that the system is room neutral.
Abstract:
A method and apparatus of precharging data and/or address lines each having a large number of loads to a voltage midway between high and low using a source-follower configuration, and optionally driving only one-half of the precharge circuit based on a previous logical value on the line being precharged. In some embodiments, a driver circuit drives an output node either high or low during a first phase of each clock cycle, and a precharge circuit then precharges the output node to an intermediate voltage during a second phase of the clock cycle in preparation for the following clock cycle. Some embodiments include source-follower configured FETs to precharge, wherein these FETs turn off once the output voltage reaches an intermediate value.
Abstract:
A system for processing gather and scatter instructions can implement a front-end subsystem, a back-end subsystem, or both. The front-end subsystem includes a prediction unit configured to determine a predicted quantity of coalesced memory access operations required by an instruction. A decode unit converts the instruction into a plurality of access operations based on the predicted quantity, and transmits the plurality of access operations and an indication of the predicted quantity to an issue queue. The back-end subsystem includes a load-store unit that receives a plurality of access operations corresponding to an instruction, determines a subset of the plurality of access operations that can be coalesced, and forms a coalesced memory access operation from the subset. A queue stores multiple memory addresses for a given load-store entry to provide for execution of coalesced memory accesses.
Abstract:
Signal transmission structures within a printed circuit are formed to have reduced loss by making specific accommodations to reduce the surface roughness of an adjacent power plane, and thereby reducing the effects of magnetically induced currents. The power plane structure will retain sufficient surface roughness to accommodate manufacturing operations, while also contributing to reduced signal transmission losses in the adjacent signal transmission structure. The transmission structures thereby being capable of more efficiently transmitting high speed signals without undesired attenuation and loss.
Abstract:
A method for prefetching data into a cache is provided. The method allocates an outstanding request buffer (“ORB”). The method stores in an address field of the ORB an address and a number of blocks. The method issues prefetch requests for a degree number of blocks starting at the address. When a prefetch response is received for all the prefetch requests, the method adjusts the address of the next block to prefetch and adjusts the number of blocks remaining to be retrieved and then issues prefetch requests for a degree number of blocks starting at the adjusted address. The prefetching pauses when a maximum distance between the reads of the prefetched blocks and the last prefetched block is reached. When a read request for a prefetched block is received, the method resumes prefetching when a resume criterion is satisfied.
Abstract:
A system is provided for allocating memory for data of a program for execution by a computer system with a multi-tier memory that includes LBM and HBM. The system accesses a data structure map that maps data structures of the program to the memory addresses within an address space of the program to which the data structures are initially allocated. The system executes the program to collect statistics relating to memory requests and memory bandwidth utilization of the program. The system determines an extent to which each data structure is used by a high memory utilization portion of the program based on the data structure map and the collected statistics. The system generates a memory allocation plan that favors allocating data structures in HBM based on the extent to which the data structures are used by a high memory utilization portion of the program.