Abstract:
A system implementing a method for generating code for execution based on a SIMT model with parallel units of threads is provided. The system identifies a loop within a program that includes vector processing. The system generates instructions for a thread that include an instruction to set a predicate based on whether the thread of a parallel unit corresponds to a vector element. The system also generates instructions to perform the vector processing via scalar operations predicated on the predicate. As a result, the system generates instructions to perform the vector processing but to avoid branch divergence within the parallel unit of threads that would be needed to check whether a thread corresponds to a vector element.
Abstract:
An optimization system to apply directives to a computer program without having to perform repeated front-end compilations of source code of the computer program is provided. In some embodiments, the optimization system performs a first compilation of the source code of the program to generate first front-end code and first back-end code of the computer program. The compilation includes a first front-end compilation and a first back-end compilation. The optimization system identifies a compiler directive to apply to a location within the first front-end code. The optimization system then performs a second back-end compilation of the first front-end code factoring in the compiler directive to generate second back-end code affected by the compiler directive.
Abstract:
A system implementing a method for generating code for execution based on a SIMT model with parallel units of threads is provided. The system identifies a loop within a program that includes vector processing. The system generates instructions for a thread that include an instruction to set a predicate based on whether the thread of a parallel unit corresponds to a vector element. The system also generates instructions to perform the vector processing via scalar operations predicated on the predicate. As a result, the system generates instructions to perform the vector processing but to avoid branch divergence within the parallel unit of threads that would be needed to check whether a thread corresponds to a vector element.