vivado-hls-nota

The following tables and figures are taken from Xilinx official documentation:

HLS Optimization Methodology

Source: UG1197 Figure 4-5

Vivado HLS Design Flow

Source: UG902 Figure 4

Vivado HLS Pragmas By Type

Type	Attributes
Kernel Optimization	pragma HLS allocation pragma HLS expression_balance pragma HLS latency pragma HLS reset pragma HLS resource pragma HLS stable
Function Inlining	pragma HLS inline pragma HLS function_instantiate
Interface Synthesis	pragma HLS interface
Task-level Pipeline	pragma HLS dataflow pragma HLS stream
Pipeline	pragma HLS pipeline pragma HLS occurrence
Loop Unrolling	pragma HLS unroll pragma HLS dependence
Loop Optimization	pragma HLS loop_flatten pragma HLS loop_merge pragma HLS loop_tripcount
Array Optimization	pragma HLS array_map pragma HLS array_partition pragma HLS array_reshape
Structure Packing	pragma HLS data_pack

Vivado HLS Optimization Directives

Source: UG902 Table 11

Loop pipelining:

Dataflow optimization:

Array reshaping:

Array partitioning:

Source: SDAccel Development Environment Help

Vivado HLS Configurations

Source: UG902 Table 12

C++ Arbitrary Precision Integer Types

The header file ap_int.h defines the following arbitrary precision integer data types:

ap_int<W>
ap_uint<W>

where W is the number of bits. For example, ap_int<8> represents an 8-bit signed integer data type; ap_uint<234> represents a 234-bit unsigned integer type.

C++ Arbitrary Precision Fixed-Point Types

The header file ap_fixed.h defines the following arbitrary precision fixed-point data types:

ap_fixed<W,I,Q,O,N>
ap_ufixed<W,I,Q,O,N>

where W is the total number of bits, I is the number of integer bits, W-I is the number of fractional bits, Q specifies the type of rounding, O and N specify the overflow behavior. For example, ap_fixed<6,3> represents an 6-bit signed value with 3 integer bits and 3 fractional bits, where the MSB position is the sign bit, followed by 2¹, 2⁰, 2^-1, 2^-2, 2^-3 bits. ap_ufixed<10,8> represents an 10-bit signed value with 8 integer bits and 2 fractional bits.

Identifier	Description
W	Word length in bits.
I	The number of bits used to represent the integer value (the number of bits above the decimal point).
Q	Quantization mode dictates the behavior when greater precision is generated than can be defined by smallest fractional bit in the variable used to store thre result.
	Mode	Description
	AP_RND	Rounding to plus infinity.
	AP_RND_ZERO	Rounding to zero.
	AP_RND_MIN_INF	Rounding to minus infinity.
	AP_RND_INF	Rounding to infinity.
	AP_RND_CONV	Convergent rounding.
	AP_TRN	Truncation to minus infinity (default).
	AP_TRN_ZERO	Truncation to zero.
O	Overflow mode dictates the behavior when more bits are generated than the variable to store the result contains.
	Mode	Description
	AP_SAT	Saturation.
	AP_SAT_ZERO	Saturation to zero.
	AP_SAT_SYM	Symmetrical saturation.
	AP_WRAP	Wrap around (default).
	AP_WRAP_SM	Sign magnitude wrap around.
N	The number of saturation bits used in wrap around overflow modes. The default value is zero.

Vivado HLS limitations

For C and C++ designs only a single clock is supported. The same clock is applied to all functions in the design.
When using Stacked Silicon Interconnect (SSI) technology devices, it is important to ensure that the logic created by Vivado HLS fits within a single Super Logic Region (SLR).

Vivado HLS examples

FPGA resources

Look Up Table (LUT)

The LUTs can be configured as a 6-input LUT with one output or two 5-input LUTs with separate outputs but common addresses or logic inputs. Eight 6-input LUTs and their sixteen storage elements, as well as the multiplexers and arithmetic carry logic, form a slice.

Flip Flop (FF)

DSP Slice

Source: UG579 Figure 1-1

Source: UG579 Figure 2-1

Source: UG579 Figure 3-1

The DSP48E2 slice consists of a 27-bit pre-adder, a 27 x 18 multiplier, a second-stage adder/subtracter/logic unit, and a pattern detector. It produces a 48-bit output. If the multiplier is not used, the DSP slice can also be used as a full 48-bit adder/subtracter and AND/OR/NOT/NAND/NOR/XOR/XNOR logic unit. It also includes a pattern detector that provides support for convergent rounding, overflow/underflow, and counter auto-reset.

The typical use of the slice is to calculate P = (D ± A) * B + C. If the multiplier is not used, A and B can be concatenated as A:B to calculate P = A:B + C. Multiple DSP slices can be cascaded to perform accumulation PCOUT = (D ± A) * B + PCIN.

The A, B, C, D input ports have the following bit widths:

Port	Bit Width	Description
A	30	A[26:0] is the A input of the multiplier or the pre-adder. A[29:0] are the upper bits of the A:B concatenated input.
B	18	The B input of the multiplier. B[17:0] are the lower bits of the A:B concatenated input.
C	48	The C input to the second-stage adder/subtracter, pattern detector, or logic function.
D	27	The D input to the pre-adder or alternative input to the multiplier.

The P. PATTERNDETECT, and PATTERNBDETECT output ports have the following bit widths:

Port	Bit Width	Description
P	48	The P output from the second-stage adder/subtracter or logic function.
PATTERNBDETECT	1	Match indicator between P[47:0] and the complement of the 48-bit pattern.
PATTERNDETECT	1	Match indicator between P[47:0] and the 48-bit pattern.

The DSP slices in the same column can be cascaded to form accumulators, adders, counters, and other more sophisticated operations. The ability is provided by the cascade input ports (ACIN, BCIN, PCIN, CARRYCASCIN, and MULTSIGNIN) and the cascade output ports (ACOUT, BCOUT, PCOUT, CARRYCASCOUT, and MULTSIGNOUT).

Number of DSP slices on Xilinx FPGAs:

Device	# of DSPs
Kintex-7 325T	840
Virtex-7 690T	3,600
Kintex UltraScale KU115	5,520
Virtex UltraScale+ VU9P	6,840
Virtex UltraScale+ VU13P	12,288

Note that Kintex-7 and Virtex-7 FPGAs have DSP48E1 whereas Virtex Ultrascale+ FPGAs have DSP48E2.

Block RAM

HLS considers one block RAM to be 18K bits. A block RAM has two ports which can each be 1, 2, 4, 9, or 18 bits wide (with depths of 16K, 8K, 4K, 2K, and 1K respectively).

Ultra RAM

Each UltraRAM stores 4096*72 bits, which is 16 times the size of a block RAM. The port width is always 72 bits.

FPGA design considerations

Resource utilization
Design performance
Power consumption
Software runtime
Debugging capability
Portability

FPGA performance metrics

Area : Amount of hardware resources required to implement the design based on the resources available in the FPGA, including look-up tables (LUTs), registers, block RAMs, and DSP48s.

Latency : Number of clock cycles required for the function to compute all output values.

Initiation interval (II) : Number of clock cycles before the function can accept new input data.

Loop iteration latency : Number of clock cycles it takes to complete one iteration of the loop.

Loop initiation interval : Number of clock cycles before the next iteration of the loop starts to process data.

Loop latency : Number of cycles to execute all iterations of the loop.

jiafulow / vivado-hls-nota Goto Github PK

vivado-hls-nota's Introduction