The following tables and figures are taken from Xilinx official documentation:
- UG473 7 Series FPGAs Memory Resources (v1.14)
- UG474 7 Series FPGAs Configurable Logic Block (v1.8)
- UG479 7 Series DSP48E1 Slice (v1.10)
- UG573 UltraScale Architecture Memory Resources (v1.12)
- UG574 UltraScale Architecture Configurable Logic Block (v1.5)
- UG579 UltraScale Architecture DSP Slice (v1.10)
- UG871 Vivado Design Suite Tutorial: High-Level Synthesis(v2020.1)
- UG872 Large FPGA Methodology Guide (v14.3)
- UG902 Vivado Design Suite User Guide: High-Level Synthesis (v2020.1)
- UG998 Introduction to FPGA Design with Vivado High-Level Synthesis (v1.1)
- UG1197 UltraFast Vivado HLS Methodology Guide (v2020.1)
- UG1270 Vivado HLS Optimization Methodology Guide (v2018.1)
Source: UG1197 Figure 4-5
Source: UG902 Figure 4
Type | Attributes |
---|---|
Kernel Optimization |
|
Function Inlining |
|
Interface Synthesis |
|
Task-level Pipeline |
|
Pipeline |
|
Loop Unrolling |
|
Loop Optimization |
|
Array Optimization |
|
Structure Packing |
|
Source: UG902 Table 11
Loop pipelining: |
Dataflow optimization: |
Array reshaping: |
Array partitioning: |
Source: SDAccel Development Environment Help
Source: UG902 Table 12
The header file ap_int.h
defines the following arbitrary precision integer data types:
ap_int<W>
ap_uint<W>
where W
is the number of bits. For example, ap_int<8>
represents an 8-bit signed integer data type; ap_uint<234>
represents a 234-bit unsigned integer type.
The header file ap_fixed.h
defines the following arbitrary precision fixed-point data types:
ap_fixed<W,I,Q,O,N>
ap_ufixed<W,I,Q,O,N>
where W
is the total number of bits, I
is the number of integer bits, W-I
is the number of fractional bits, Q
specifies the type of rounding, O
and N
specify the overflow behavior. For example, ap_fixed<6,3>
represents an 6-bit signed value with 3 integer bits and 3 fractional bits, where the MSB position is the sign bit, followed by 21, 20, 2-1, 2-2, 2-3 bits. ap_ufixed<10,8>
represents an 10-bit signed value with 8 integer bits and 2 fractional bits.
Identifier | Description | ||
---|---|---|---|
W | Word length in bits. | ||
I | The number of bits used to represent the integer value (the number of bits above the decimal point). | ||
Q | Quantization mode dictates the behavior when greater precision is generated than can be defined by smallest fractional bit in the variable used to store thre result. | ||
Mode | Description | ||
AP_RND | Rounding to plus infinity. | ||
AP_RND_ZERO | Rounding to zero. | ||
AP_RND_MIN_INF | Rounding to minus infinity. | ||
AP_RND_INF | Rounding to infinity. | ||
AP_RND_CONV | Convergent rounding. | ||
AP_TRN | Truncation to minus infinity (default). | ||
AP_TRN_ZERO | Truncation to zero. | ||
O | Overflow mode dictates the behavior when more bits are generated than the variable to store the result contains. | ||
Mode | Description | ||
AP_SAT | Saturation. | ||
AP_SAT_ZERO | Saturation to zero. | ||
AP_SAT_SYM | Symmetrical saturation. | ||
AP_WRAP | Wrap around (default). | ||
AP_WRAP_SM | Sign magnitude wrap around. | ||
N | The number of saturation bits used in wrap around overflow modes. The default value is zero. |
- For C and C++ designs only a single clock is supported. The same clock is applied to all functions in the design.
- When using Stacked Silicon Interconnect (SSI) technology devices, it is important to ensure that the logic created by Vivado HLS fits within a single Super Logic Region (SLR).
The LUTs can be configured as a 6-input LUT with one output or two 5-input LUTs with separate outputs but common addresses or logic inputs. Eight 6-input LUTs and their sixteen storage elements, as well as the multiplexers and arithmetic carry logic, form a slice.
Source: UG579 Figure 1-1
Source: UG579 Figure 2-1
Source: UG579 Figure 3-1
The DSP48E2 slice consists of a 27-bit pre-adder, a 27 x 18 multiplier, a second-stage adder/subtracter/logic unit, and a pattern detector. It produces a 48-bit output. If the multiplier is not used, the DSP slice can also be used as a full 48-bit adder/subtracter and AND/OR/NOT/NAND/NOR/XOR/XNOR logic unit. It also includes a pattern detector that provides support for convergent rounding, overflow/underflow, and counter auto-reset.
The typical use of the slice is to calculate P = (D ± A) * B + C. If the multiplier is not used, A and B can be concatenated as A:B to calculate P = A:B + C. Multiple DSP slices can be cascaded to perform accumulation PCOUT = (D ± A) * B + PCIN.
The A, B, C, D input ports have the following bit widths:
Port | Bit Width | Description |
---|---|---|
A | 30 | A[26:0] is the A input of the multiplier or the pre-adder. A[29:0] are the upper bits of the A:B concatenated input. |
B | 18 | The B input of the multiplier. B[17:0] are the lower bits of the A:B concatenated input. |
C | 48 | The C input to the second-stage adder/subtracter, pattern detector, or logic function. |
D | 27 | The D input to the pre-adder or alternative input to the multiplier. |
The P. PATTERNDETECT, and PATTERNBDETECT output ports have the following bit widths:
Port | Bit Width | Description |
---|---|---|
P | 48 | The P output from the second-stage adder/subtracter or logic function. |
PATTERNBDETECT | 1 | Match indicator between P[47:0] and the complement of the 48-bit pattern. |
PATTERNDETECT | 1 | Match indicator between P[47:0] and the 48-bit pattern. |
The DSP slices in the same column can be cascaded to form accumulators, adders, counters, and other more sophisticated operations. The ability is provided by the cascade input ports (ACIN, BCIN, PCIN, CARRYCASCIN, and MULTSIGNIN) and the cascade output ports (ACOUT, BCOUT, PCOUT, CARRYCASCOUT, and MULTSIGNOUT).
Number of DSP slices on Xilinx FPGAs:
Device | # of DSPs |
---|---|
Kintex-7 325T | 840 |
Virtex-7 690T | 3,600 |
Kintex UltraScale KU115 | 5,520 |
Virtex UltraScale+ VU9P | 6,840 |
Virtex UltraScale+ VU13P | 12,288 |
Note that Kintex-7 and Virtex-7 FPGAs have DSP48E1 whereas Virtex Ultrascale+ FPGAs have DSP48E2.
HLS considers one block RAM to be 18K bits. A block RAM has two ports which can each be 1, 2, 4, 9, or 18 bits wide (with depths of 16K, 8K, 4K, 2K, and 1K respectively).
Each UltraRAM stores 4096*72 bits, which is 16 times the size of a block RAM. The port width is always 72 bits.
- Resource utilization
- Design performance
- Power consumption
- Software runtime
- Debugging capability
- Portability
Area : Amount of hardware resources required to implement the design based on the resources available in the FPGA, including look-up tables (LUTs), registers, block RAMs, and DSP48s.
Latency : Number of clock cycles required for the function to compute all output values.
Initiation interval (II) : Number of clock cycles before the function can accept new input data.
Loop iteration latency : Number of clock cycles it takes to complete one iteration of the loop.
Loop initiation interval : Number of clock cycles before the next iteration of the loop starts to process data.
Loop latency : Number of cycles to execute all iterations of the loop.