FPGA and Synthesis Flow
2022-2024
FPGAs are configurable (programmable) integrated circuits that can be used to implement hardware digital functions.
An FPGA contains programmable logic blocks, arranged in an array, and configurable routing elements to connect them. FPGA can be reconfigured (reprogrammed) in-field.
In addition to le logic blocks, modern FPGAs may contain memory blocks (RAM) and dedicated hard-wired block as multipliers or DSPs, PLL, special IOs and transceivers.
The most known FPGA producers are Xilinx (AMD division) and Altera (Intel division). Their FPGAs target a large range of applications from embedded systems to large data-center and network appliances. There are other manufacturers that produce FPGAs as Microchip/Microsemi or QuickLogic.
A programmable table to implement combinational functions.
A I_3 | I_2 | I_1 | I_0 | O
---------|-----|-----|-----|--------
0 0 | 0 | 0 | 0 | init0
1 0 | 0 | 0 | 1 | init1
2 0 | 0 | 1 | 0 | init2
3 0 | 0 | 1 | 1 | init3
4 0 | 1 | 0 | 0 | init4
5 0 | 1 | 0 | 1 | init5
6 0 | 1 | 1 | 0 | init6
7 0 | 1 | 1 | 1 | init7
8 1 | 0 | 0 | 0 | init8
9 1 | 0 | 0 | 1 | init9
11 1 | 0 | 1 | 0 | init10
11 1 | 0 | 1 | 1 | init11
12 1 | 1 | 0 | 0 | init12
13 1 | 1 | 0 | 1 | init13
14 1 | 1 | 1 | 0 | init14
15 1 | 1 | 1 | 1 | init15
In the FPGA context, a Look-Up Table (LUT) is a configurable digital block that allows the implementation of arbitrary combinational functions.
As shown in the previous figure, it can be seen as a small memory where the input is used as an address (or index) to access one of the configuration memory bits.
Here we have an example of a 4-input LUT. The configuration is a 16-bit word (16 = 2^4) where each bit corresponds to one line in the truth table of the combinational function. We can implement arbitrary combinational 4-bit input functions by changing this configuration.
For example:
1000000000000000
(or 0x8000
) is the
configuration for a 4-input AND,0110011001100110
(or 0x6666
) is the
configuration for a 2-input XOR (I_0 ^ I_1
)With a 4-input LUT we can implement all (65536=2^{2^4}) 4-input combinational functions.
For FPGAs, each configuration bit will be implemented as an SRAM cell (Static RAM). The configuration memory is generally programmed serially within a LUT and from one LUT to the other. This is why configuration files for FPGAs are called bitstreams.
In modern FPGAs, we often have more complex LUTs, but the principles are the same.
To be able to implement sequential logic, in addition to LUTs, logic cells will include flip-flops and some local configurable routing.
The previous figure shows a simple example of a logic cell. The output of the logic cell can be either, the combinational output of the LUT, or the output of the flip-flop, depending on a configuration bit.
In modern FPGA the structure of the logic element is often more complex, with several flip-flops and even several outputs.
To interconnect the logic cells, FPGAs include a configurable interconnect. It connects the different elements (logic cells, IOs,…) through switch boxes that can be programmed to route signals from one cell to another.
For practical reasons, this interconnect is often hierarchic with different local and global architectures and routing densities.
The FPGA layout is organized in columns. These columns contains different functional cells.
For this particular FPGA model we can see the following zones:
Surrounding the FPGA cells we have the input and output cells and the PLL (Phase Locked Loop) cells for clocks management.
The number of blocks and columns will depend on the FPGA model (its size), but, the architecture of the cells will be the same for the FPGA family (Cyclone II heare).
The logic cells are organized in local clusters (LAB for logic array blocks) with several level of local and global routing cells.
The architecture of the logic element is very standard (simple) for this “old” family of FPGAs.
This family of FPGA embed 4K bit configurable dual ports synchronous memory blocks. They can be used as a single or double ports with configurable data widths.
Here we have 18x18
multipliers with configurable input
and output registers. They can be used as combinational multipliers or
coupled with the registers to implement Digital Signal Processing
primitives. A feedback path exist to implement accumulators.
A PLL is an analog component that can be used to generate, from a reference input clock, multiple internal clocks with different frequencies and phase offsets.
RTL: stands for Register transfer Level and means the subset of the HDL that describes synchronous and combinational logic.
Synthesis is the process where the RTL code is transformed int a structural gate-level representation, the netlist. It is generally performed in two phases:
To achieve this process, some constraints must be given to the tools. These constraints will contain information about:
Those constraints can be given either using a dedicated graphical tool or by scripts.
Detailed synthesis flow:
This flow is just an illustration of the synthesis process. Real tools will have more complex flows.
Let us consider the following sequential block. With A
is a 4-bit signal (A\in [0:15]).
always_ff @(posedge clk)
if(A/2) B <= 0;
else B <= 1;
always_ff @(posedge clk)
B <= B_c;
always_comb
if(A/2) B_c = 0;
else B_c = 1;
First step is to separate the logic into combinational registers and pure registers.
A A_3 A_2 A_1 A_0 B_r
-----------------------------------
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 1 0 1
3 0 0 1 1 1
4 0 1 0 0 1
5 0 1 0 1 1
6 0 1 1 0 1
7 0 1 1 1 1
8 1 0 0 0 1
9 1 0 0 1 1
11 1 0 1 0 1
11 1 0 1 1 1
12 1 1 0 0 1
13 1 1 0 1 1
14 1 1 1 0 1
15 1 1 1 1 1
We get the following boolean equation:
B_c = A_3 | A_2 | A_1;
For the combinational part, logic optimization is performed. In this simple example, we can enumerate all the input values and have a simple hand-made optimization algorithm. For more realistic designs, heuristics and more complex algorithms are used.
For an ASIC, it will depend on the standard cells available in the technology library
For an FPGA target, combinational functions are mapped to LUTs:
The important information here is the LUT mask that will be used for the configuration.
The final Netlist for an FPGA target:
module foo(
input clk,
input[3:0] A,
output B );
wire A_1 = A[1];
wire A_2 = A[2];
wire A_3 = A[3];
wire B_c;
(.conf(8'hFE))
LUT3 #(.i0(A_1),.i1(A_1),.i2(A_2),
lut3_i
.o(B_c))
REG1(.clk(clk), .d(B_c), .Q(B));
reg_i
endmodule
Once we have generated this netlist, the synthesis process is completed.
The nexte step in the flow will be te placement (which celles too use in the FPGA array) and the routing before the generation of the programming file (the bitstream). Deep knowledge of the FPGA architecture and programing mechanism is needed for this step which is always performed by a tool provided by the FPGA vendor.
The synthesis-tool will detect some logic in the RTL code and infer optimized structures.
This applies to:
assign S = A + B;
For arithmetic operators, the synthesis will not go through the standard logic optimization phase. The operation is detected in the RTL code and an optimized architecture for the targeted FPGA, from a technology library will be inferred.
The figure shows the Cyclone II logic element configured in arithmetic mode. In this mode, the LUT is configured to implement a full adder and generates both the sum and the output carry. Also, the output carry is connected directly to an adjacent logic element, without going through the interconnect, allowing fast carry propagation.
The same thing applies to multiplication where the multiplication
operator (*
in SystemVerilog) is detected, and hardwired
multiplication/DSP block inferred.
This kind of optimization is hard to achieve without a detailed knowledge of the FPGA target and is specific to each FPGA. Thus, it is counter-productive to describe an addition at a lower structural level, and it is more efficient to keep the representation of arithmetic operators at the operator level.
More examples can be found in FPGA vendors Design Guidelines and Coding Styles recommendations:
module sram(input clk, wr,
input [7:0] Addr,
input [7:0] Di,
output logic [7:0] Do );
logic[7:0] mem [0:255];
always_ff @(posedge clk)
begin
if (wr)
mem[Addr] <= Di;
Do <= mem[Addr];end
endmodule
Here a more complex example of inference. The RTL code describes a table that is accessed in a synchronous manner. Also at each clock cycle only one element of the table is accessed (read or written).
This behaviour corresponds to the behaviour of a synchronous memory. As the size of the table is relatively large, the synthesis-tool will infer a embedded memory block automatically. Those memory blocks have a higher density and integrate address decoding and control signal, thus, using them will produce a more efficient design.
Depending on the type of embedded memory blocks available in a family of FPGAs, dual-port memory can also be inferred.
Note that the embedded memory blocks are often synchronous, and inference will not work for if the RTL description is non-synchronous.
© Copyright 2022-2024, Tarik Graba, Télécom Paris. | |
Le contenu de cette page est mis à disposition selon les termes de la licence Creative Commons Attribution - Partage dans les Mêmes Conditions 4.0 International. | |
The content of this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Licence. |