Digital Design Introduction

FPGA and Synthesis Flow

Tarik Graba

2022-2024

Field Programmable Gate Arrays (FPGA)

FPGA

FPGAs are configurable (programmable) integrated circuits that can be used to implement hardware digital functions.

An FPGA contains programmable logic blocks, arranged in an array, and configurable routing elements to connect them. FPGA can be reconfigured (reprogrammed) in-field.

In addition to le logic blocks, modern FPGAs may contain memory blocks (RAM) and dedicated hard-wired block as multipliers or DSPs, PLL, special IOs and transceivers.

The most known FPGA producers are Xilinx (AMD division) and Altera (Intel division). Their FPGAs target a large range of applications from embedded systems to large data-center and network appliances. There are other manufacturers that produce FPGAs as Microchip/Microsemi or QuickLogic.

Look-Up Tables (LUTs)

A programmable table to implement combinational functions.

Look-Up Table (LUT)
 A   I_3 | I_2 | I_1 | I_0 |    O
---------|-----|-----|-----|--------
 0    0  |  0  |  0  |  0  |  init0
 1    0  |  0  |  0  |  1  |  init1
 2    0  |  0  |  1  |  0  |  init2
 3    0  |  0  |  1  |  1  |  init3
 4    0  |  1  |  0  |  0  |  init4
 5    0  |  1  |  0  |  1  |  init5
 6    0  |  1  |  1  |  0  |  init6
 7    0  |  1  |  1  |  1  |  init7
 8    1  |  0  |  0  |  0  |  init8
 9    1  |  0  |  0  |  1  |  init9
 11   1  |  0  |  1  |  0  |  init10
 11   1  |  0  |  1  |  1  |  init11
 12   1  |  1  |  0  |  0  |  init12
 13   1  |  1  |  0  |  1  |  init13
 14   1  |  1  |  1  |  0  |  init14
 15   1  |  1  |  1  |  1  |  init15

In the FPGA context, a Look-Up Table (LUT) is a configurable digital block that allows the implementation of arbitrary combinational functions.

As shown in the previous figure, it can be seen as a small memory where the input is used as an address (or index) to access one of the configuration memory bits.

Here we have an example of a 4-input LUT. The configuration is a 16-bit word (16 = 2^4) where each bit corresponds to one line in the truth table of the combinational function. We can implement arbitrary combinational 4-bit input functions by changing this configuration.

For example:

With a 4-input LUT we can implement all (65536=2^{2^4}) 4-input combinational functions.

For FPGAs, each configuration bit will be implemented as an SRAM cell (Static RAM). The configuration memory is generally programmed serially within a LUT and from one LUT to the other. This is why configuration files for FPGAs are called bitstreams.

In modern FPGAs, we often have more complex LUTs, but the principles are the same.

The Logic Cell

Typical logic cell

To be able to implement sequential logic, in addition to LUTs, logic cells will include flip-flops and some local configurable routing.

The previous figure shows a simple example of a logic cell. The output of the logic cell can be either, the combinational output of the LUT, or the output of the flip-flop, depending on a configuration bit.

In modern FPGA the structure of the logic element is often more complex, with several flip-flops and even several outputs.

Interconnect

Configurable interconnect

To interconnect the logic cells, FPGAs include a configurable interconnect. It connects the different elements (logic cells, IOs,…) through switch boxes that can be programmed to route signals from one cell to another.

For practical reasons, this interconnect is often hierarchic with different local and global architectures and routing densities.

Example Altera/Intel Cyclone II

Global organization

The FPGA layout is organized in columns. These columns contains different functional cells.

For this particular FPGA model we can see the following zones:

Surrounding the FPGA cells we have the input and output cells and the PLL (Phase Locked Loop) cells for clocks management.

The number of blocks and columns will depend on the FPGA model (its size), but, the architecture of the cells will be the same for the FPGA family (Cyclone II heare).


(LAB) Logic Array Block structure

The logic cells are organized in local clusters (LAB for logic array blocks) with several level of local and global routing cells.


Logic Element

The architecture of the logic element is very standard (simple) for this “old” family of FPGAs.


Embedded RAM bloc

This family of FPGA embed 4K bit configurable dual ports synchronous memory blocks. They can be used as a single or double ports with configurable data widths.


18x18 Multiplier

Here we have 18x18 multipliers with configurable input and output registers. They can be used as combinational multipliers or coupled with the registers to implement Digital Signal Processing primitives. A feedback path exist to implement accumulators.


 Phase Locked Loop

A PLL is an analog component that can be used to generate, from a reference input clock, multiple internal clocks with different frequencies and phase offsets.

Synthesis Flow

FPGA Flow

RTL: stands for Register transfer Level and means the subset of the HDL that describes synchronous and combinational logic.

Synthesis is the process where the RTL code is transformed int a structural gate-level representation, the netlist. It is generally performed in two phases:

To achieve this process, some constraints must be given to the tools. These constraints will contain information about:

Those constraints can be given either using a dedicated graphical tool or by scripts.

Synthesis

Detailed synthesis flow:

 

This flow is just an illustration of the synthesis process. Real tools will have more complex flows.


Example:

Let us consider the following sequential block. With A is a 4-bit signal (A\in [0:15]).

always_ff @(posedge clk)
  if(A/2) B <= 0;
  else    B <= 1;

Combinational/sequential separation

always_ff @(posedge clk)
  B <= B_c;

always_comb
  if(A/2) B_c = 0;
  else    B_c = 1;

First step is to separate the logic into combinational registers and pure registers.


Logic optimization:

   A    A_3   A_2   A_1   A_0   B_r
  -----------------------------------
   0     0     0     0     0     0
   1     0     0     0     1     0
   2     0     0     1     0     1
   3     0     0     1     1     1
   4     0     1     0     0     1
   5     0     1     0     1     1
   6     0     1     1     0     1
   7     0     1     1     1     1
   8     1     0     0     0     1
   9     1     0     0     1     1
   11    1     0     1     0     1
   11    1     0     1     1     1
   12    1     1     0     0     1
   13    1     1     0     1     1
   14    1     1     1     0     1
   15    1     1     1     1     1

We get the following boolean equation:

  B_c = A_3 | A_2 | A_1;

For the combinational part, logic optimization is performed. In this simple example, we can enumerate all the input values and have a simple hand-made optimization algorithm. For more realistic designs, heuristics and more complex algorithms are used.


Technology Mapping:

For an ASIC, it will depend on the standard cells available in the technology library

 
 

For an FPGA target, combinational functions are mapped to LUTs:

 

The important information here is the LUT mask that will be used for the configuration.


The final Netlist for an FPGA target:

 
module foo(
   input      clk,
   input[3:0] A,
   output     B  );

wire A_1 = A[1];
wire A_2 = A[2];
wire A_3 = A[3];
wire B_c;

LUT3 #(.conf(8'hFE))
  lut3_i (.i0(A_1),.i1(A_1),.i2(A_2),
          .o(B_c))
REG1
  reg_i (.clk(clk), .d(B_c), .Q(B));

endmodule

Once we have generated this netlist, the synthesis process is completed.

The nexte step in the flow will be te placement (which celles too use in the FPGA array) and the routing before the generation of the programming file (the bitstream). Deep knowledge of the FPGA architecture and programing mechanism is needed for this step which is always performed by a tool provided by the FPGA vendor.

Inference

The synthesis-tool will detect some logic in the RTL code and infer optimized structures.

This applies to:


Example: Addition

assign S = A + B;
Logic element in arrithmetic mode

For arithmetic operators, the synthesis will not go through the standard logic optimization phase. The operation is detected in the RTL code and an optimized architecture for the targeted FPGA, from a technology library will be inferred.

The figure shows the Cyclone II logic element configured in arithmetic mode. In this mode, the LUT is configured to implement a full adder and generates both the sum and the output carry. Also, the output carry is connected directly to an adjacent logic element, without going through the interconnect, allowing fast carry propagation.

The same thing applies to multiplication where the multiplication operator (* in SystemVerilog) is detected, and hardwired multiplication/DSP block inferred.

This kind of optimization is hard to achieve without a detailed knowledge of the FPGA target and is specific to each FPGA. Thus, it is counter-productive to describe an addition at a lower structural level, and it is more efficient to keep the representation of arithmetic operators at the operator level.

More examples can be found in FPGA vendors Design Guidelines and Coding Styles recommendations:


Example: Synchronous Memory Block

module sram(input clk, wr,
            input  [7:0] Addr,
            input  [7:0] Di,
            output logic [7:0] Do );


logic[7:0] mem [0:255];

always_ff @(posedge clk)
begin
   if (wr)
      mem[Addr] <= Di;
   Do <= mem[Addr];
end

endmodule
Simple Synchronous RAM

Here a more complex example of inference. The RTL code describes a table that is accessed in a synchronous manner. Also at each clock cycle only one element of the table is accessed (read or written).

This behaviour corresponds to the behaviour of a synchronous memory. As the size of the table is relatively large, the synthesis-tool will infer a embedded memory block automatically. Those memory blocks have a higher density and integrate address decoding and control signal, thus, using them will produce a more efficient design.

Depending on the type of embedded memory blocks available in a family of FPGAs, dual-port memory can also be inferred.

Note that the embedded memory blocks are often synchronous, and inference will not work for if the RTL description is non-synchronous.

Back to the index