Abstract:
Systems and methods may configure a programmable logic device to efficiently run a deep learning (DL) network. Architecture code and algorithmic code may be generated. The architecture code may define convolutional and fully connected processor cores structured to run the layers of a Deep Neural Network (DNN). The processor cores may be interconnected by a First In First Out (FIFO) memory. The architecture code may also define stride-efficient memories for implementing convolution. The algorithmic code may include configuration instructions for running the DNN's layers at the processor cores. The algorithmic code may also include a schedule for executing the configuration instructions on the processor cores, for moving network parameters to the processor cores, and for transferring outputs between the layers.
Abstract:
A system and method optimizes hardware description generated from a graphical program or model having oversampling constraints automatically. The system may include a streaming optimizer, a resource sharing optimizer, a delay balancing engine, and a global scheduler. The streaming optimizer may transform vector data paths to scalar or smaller-sized vector data paths. The resource sharing optimizer may replace multiple, functionally equivalent blocks with a single shared block. The delay balancing may insert one or more elements to correct for data path misalignment. The global scheduler may place portions of the program or model into conditional execution sections and create control logic that controls the model sample times or steps that the portions are enabled. A validation model, a report, or hardware description code that utilizes fewer hardware resources may be generated from a modified version of the model that is created.
Abstract:
A system and method tests for functional equivalence prior to automatically retiming a high-level specification. An Intermediate Representation (IR) includes one or more graphs or trees based on the high-level specification. A functional equivalence (FE) analyzer determines whether one or more components in the graph meet certain value and state conditions and thus is a candidate for retiming. A scheduler can use components that fail FE as a retiming boundary.
Abstract:
A system and method generates optimized code for a source model. The system may include a resource sharing optimizer that evaluates the source model and replaces multiple model elements of the source model that are functionally equivalent with a single shared model element. The model elements replaced with the single shared model element may have different fixed point data types. The resource sharing optimizer may convert some of the fixed point data types to a common fixed point data type.
Abstract:
Systems and methods automatically generate code from an executable model. The code may be generated from one or more in-memory representations constructed for the model. The in-memory representations may be analyzed, and portions that can be mapped to DSP slices of a programmable logic device may be identified. The portions may be modified based on information for a particular programmable logic device, such as the structure of the device's DSP slices. The modifications may ensure that elements of the generated code get mapped to DSP slices, when the generated code is used to synthesize the programmable logic device.
Abstract:
A system and method optimizes hardware description generated from a graphical program or model automatically. The system may include a streaming optimizer, a resource sharing optimizer and a delay balancing engine. The streaming optimizer transforms one or more vector data paths in the source model to scalar data paths or to a smaller-sized vector data paths. The resource sharing optimizer may replace multiple blocks of the source model that are functionally equivalent with a single shared block. The streaming and resource sharing optimizers may also configure portions of the modified model to execute at a faster rate. The delay balancing engine may examine the modified model to determine whether any delays or latencies have been introduced. If so, the delay balancing engine may insert one or more blocks into the modified model to correct for any data path misalignment caused by the introduction of the delays or latencies. A validation model, a report, or hardware description code that utilizes fewer hardware resources may be generated from the modified model.
Abstract:
Systems and methods generate code from a source program where the generated code may be compiled and executed on a Graphics Processing Unit (GPU). A parallel loop analysis check may be performed on regions of the source program identified for parallelization. One or more optimizations also may be applied to the source program that convert mathematical operations into a parallel form. The source program may be partitioned into segments for execution on a host and a device. Kernels may be created for the segments to be executed on the device. The size of the kernels may be determined, and memory transfers between the host and device may be optimized.
Abstract:
Systems and methods automatically generate optimized hardware description language code for a model created in a modeling environment. A training tool selects and provides scripts to a hardware synthesis tool chain that direct the tool chain to synthesize hardware components for core components of the modeling environment. A report generated by the tool chain is evaluated to extract performance data for the core components, and the performance data is stored in a library. An optimization tool estimates the performance of the model using the performance data in the library. Based on the performance estimate and an analysis of the model, the optimization tool selects an optimization technique which it applies to the model generating a revised. Estimating performance, and selecting and applying optimizations may be repeated until a performance constraint is satisfied or a termination criterion is met.
Abstract:
Systems and methods may automatically generate code for deep learning networks. The systems methods may provide a code generation framework for generating target specific code. The code generation framework may include one or more predefined class hierarchies for constructing objects of the generated code. The objects of the class hierarchies may provide an interface to predefined libraries of deep learning functions optimized for use on a target platform. The systems and methods may perform one or more optimizations on the code being generated.
Abstract:
A device generates a model associated with a multi-rate system. The multi-rate system includes a system associated with a clock rate and a sample rate, and the clock rate is greater than the sample rate. The device identifies the clock rate of the multi-rate system based on the model, and identifies a portion, of the model, associated with the sample rate. The device applies clock rate pipelining to adjust the sample rate associated with the portion of the model so that the sample rate substantially equals the clock rate, and generates code associated with the model and the applied clock rate pipelining.