Digital Design Introduction

Binary Arithmetics (Additional notes)

Tarik Graba

2022-2024

Binary representation of Natural Numbers

Non-signed numbers

A natural number N is represented in binary by a vector (a_{n-1}, \dots , a_1, a_0).

Each element a_i can have the value 0 or 1 and is called bit for “binary element

The value of this number is:

N = \sum_{i=0}^{n-1} a_i 2^i

With a fixed number of bits n, we can only represent the numbers from 0 to 2^n-1.

Examples

Binary representation of 42:

\begin{aligned} 42 & = 32 + 8 + 2 \\ & = 1\cdot 32 + 0\cdot 16 + 1\cdot 8 + 0\cdot 4 + 1\cdot 2 + 0\cdot 1\\ & = 1\cdot 2^5 + 0\cdot 2^4 + 1\cdot 2^3 + 0\cdot 2^2 + 1\cdot 2^1 + 0\cdot 2^0\\ \end{aligned}

(42)_{10} = (101010)_2

What number is represented in binary by 11001101?

\begin{array}{l c l c l c l c l c l c l c l c l} N & = & 1\cdot 2^0 & + & 0\cdot 2^1 & + & 1\cdot 2^2 & + & 1\cdot 2^3 & + & 0\cdot 2^4 & + & 0\cdot 2^5 & + & 1\cdot 2^6 & + & 1\cdot 2^7 \\ & = & 1\cdot 1 & + & 0\cdot 2 & + & 1\cdot 4 & + & 1\cdot 8 & + & 0\cdot 16 & + & 0\cdot 32 & + & 1\cdot 64 & + & 1\cdot 128 \\ & = & 1 & + & 0 & + & 4 & + & 8 & + & 0 & + & 0 & + & 64 & + & 128 \\ & = & 205 &&&&&&&&&&&&&& \end{array}

To find the binary representation of a decimal number, we need to decompose it into the sum of powers of 2. This can be achieved by consecutively dividing (euclidean division) the number by 2 and keeping the reminder of each step.

Example

What is the representation of 42 in binary?

\begin{array}{r c l c c c c c} & & & & quotient & & reminder & bit index\\ \\ 42 &=& 2 &\times& 21 &+& 0 & 0 \\ 21 &=& 2 &\times& 10 &+& 1 & 1 \\ 10 &=& 2 &\times& 5 &+& 0 & 2 \\ 5 &=& 2 &\times& 2 &+& 1 & 3 \\ 2 &=& 2 &\times& 1 &+& 0 & 4 \\ 1 &=& & & & & 1 & 5 \end{array}

Example

How many bits do we need to represent 113 in binary?

\begin{array}{r c l c l } 64 &<& 113 &<& 128\\ 2^6 &<& 113 &<& 2^7\\ \end{array}

At least 7 bits are needed.

(113)_{10} = (1110001)_2

Two’s complement signed representation

The value of signed number N represented by a vector of bits (a_{n-1}, a_{n-2}, \dots , a_1, a_0), is:

N = - a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i

The most significant bit (or msb) (a_{n-1}) holds the sign information.

The largest positive value that we can represent with n bits is 2^{n-1}-1.

The smallest negative value that we can represent with n bits is -2^{n-1}.

Example

4 bits signed numbers:

7 0111
6 0110
5 0101
4 0100
3 0011
2 0010
1 0001
0 0000
-1 1111 (-8+7)
-2 1110 (-8+6)
-3 1101 (-8+5)
-4 1100 (-8+4)
-5 1011 (-8+3)
-6 1010 (-8+2)
-7 1001 (-8+1)
-8 1000 (-8+0)

4 bits signed numbers range is [-2^3 : 2^3-1] (or [-8 : 7]).

-N?

For a signed number N = (a_{n-1}, a_{n-2}, \dots a_0) we can show that:

-N = \overline{N} + 1

i.e. The bitwise boolean invert plus one.

Why?

\begin{split} -N & = -(-a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i) \\ & = - (-a_{n-1}) 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (-a_{n-1} +1 - 1) 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 2^{n-1} + (1 - 1 ) + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 1 +\frac{(2^{n-1} -1)}{2 -1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 1 + \sum_{i=0}^{n-2} 2^i + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = 1 + - (1-a_{n-1}) 2^{n-1} + \sum_{i=0}^{n-2} (1 - a_i) 2^i \\ \end{split}

Also, for one bit number, we have \forall x \in\left\{0,1\right\} : 1-x = \overline{x}, indeed:

\begin{array}{r c c c l} 1 - 1 & = & 0 & = & \overline{1}\\ 1 - 0 & = & 1 & = & \overline{0}\\ \end{array}

And so:

-N = 1 + - (\overline{a_{n-1}}) 2^{n-1} + \sum_{i=0}^{n-2} \overline{a_i} 2^i

Example

-7 on four bits knowing that +7 = (0111):

(-7)_{10} = \overline{(0111)} + 1 = (1000) + 1 = (1001)

+1 on four bits knowing that -1 = (1111):

(1)_{10} = \overline{(1111)} + 1 = (0000) + 1 = (0001)

Sign extension

How to extend the representation of a signed number from n bits to a larger m bits representation?

Let’s start with N = (a_{n-1},a_{n-2}, \dots, a_0)_{n} a signed number represented with n bits.

\begin{split} N & = - a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i \\ & = - a_{n-1} 2^{n-1} \cdot (2-1) + \sum_{i=0}^{n-2} a_i 2^i \\ & = - a_{n-1} 2^{n} + a_{n-1} 2^{n-1}+ \sum_{i=0}^{n-2} a_i 2^i \\ N & = - a_{n-1} 2^{n} + \sum_{i=0}^{n-1} a_i 2^i \\ \end{split}

N = (a_{n-1},a_{n-1},a_{n-2}, \dots, a_0)_{n+1} is the same number represented with n+1 bits.

We went from n to n+1 bits by duplicating the most significant bit. As this bit contains the sign information, the extension of signed integer is called “sign extension”.

By induction, we can generalize this extension. Extending a signed number form n to m bits can be done by duplicating the most significant bit m-n times.

Example

4 bits 8 bits
7 0111 00000111
6 0110 00000110
5 0101 00000101
4 0100 00000100
3 0011 00000011
2 0010 00000010
1 0001 00000001
0 0000 00000000
-1 1111 11111111
-2 1110 11111110
-3 1101 11111101
-4 1100 11111100
-5 1011 11111011
-6 1010 11111010
-7 1001 11111001
-8 1000 11111000

Addition

1 bit addiion

\begin{array}{r c l c l c l} 0+0 &=& 0 &\rightarrow & 0 & & \text{no carry} \\ 0+1 &=& 1 &\rightarrow & 1 & & \text{no carry} \\ 1+0 &=& 1 &\rightarrow & 1 & & \text{no carry} \\ 1+1 &=& 2 &\rightarrow & 0+1\times 2 & & \text{a carry is generated} \\ \end{array}

The addition of two bits can not always be represented on one bit. When a carry is produced, 2 bits are needed to represent the result.

More than 1 bit

For multi-bit binary addition, the carry must be propagated from one column to the next (next power of two).

As the addition can produce a final carry, the result of the addition should be represented with an additional bit.

Example

\begin{array}{l c c c c } & & 1 & 1 & 1 \\ + & & 0 & 0 & 1 \\ \hline = & 1 & 0 & 0 & 0 \\ \end{array}

Overflow?

To avoid addition overflow, we must represent the result of the addition of two n bits numbers on n+1 bits.

In deed, pour two numbers A and B represented on n bits:

\begin{array}{r c l} A & \leq & 2^{n}-1 \\ B & \leq & 2^{n}-1 \\ A+B & \leq & 2^{n+1}-2 < 2^{n+1} \\ \end{array}

A+B is always valid with n+1 bits \forall (A,B)

Signed addition

Binary signed addition (Two’s complement) is similar to unsigned addition. The sole differences will be in the overflow and the final carry interpretation.

Example 1

\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 1 & 1 & 1 & 7 & -1\\ + & & 0 & 0 & 1 & 1 & 1\\ \hline = & \textcolor{red}{1} & 0 & 0 & 0 & 8 & 0 \text{ or } -8? \end{array}

In this example, how to interpret the final carry?

Example 2

\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 0 & 1 & 1 & 3 & 3\\ + & & 0 & 0 & 1 & 1 & 1\\ \hline = & \textcolor{red}{0} & 1 & 0 & 0 & 4 & -4 \end{array}

Here the final carry is null, but the signed result is incorrect.

\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 1 & 1 & 1 & 7 & -1\\ + & & 1 & 0 & 0 & 4 & -4\\ \hline = & \textcolor{red}{1} & 0 & 1 & 1 & 11 & +3 + \text{carry?} \end{array}

Here the signed result is incorrect when representing the result on only 3 bits and the carry must be considered.

The rules to detect an overflow and interpret correctly the final carry in a signed addition are:

General purpose processors that have fixed width arithmetic operators generally provide means to detect overflow.

On the opposite, when building a hardware arithmetic operator for signal processing, we want to avoid overflow. The width of addition result will always be one bit more than the width of the operators.

In deed, pour two numbers A and B represented on n bits:

\begin{array}{r c c l} -2^{n-1} \leq & A & \leq & 2^{n-1}-1 \\ -2^{n-1} \leq & B & \leq & 2^{n-1}-1 \\ -2^{n-1} \leq & A+B & \leq & 2^{n-1}-2 < 2^{n-1} \\ \end{array}

A+B is here also always valid with n+1 bits \forall (A,B)

In that case, there is a simple method to always yield a correct result. The operators must be extended before performing the addition, and, as we know that the result will be correctly represented on that size, we can safely ignore the final carry.

This recommendation will work for both signed and unsigned additions. The only difference will be how to perform the extension, sign extended if the numbers are signed, 0 extended otherwise.

Examples for signed numbers

\begin{array}{l c c c c c | r } & & \textcolor{blue}{1} & 1 & 1 & 1 & -1 \\ + & & \textcolor{blue}{1} & 1 & 0 & 0 & -4 \\ \hline = &\textcolor{red}{\cancel{1}} & 1 & 0 & 1 & 1 & -5 \end{array}

\begin{array}{l c c c c c | r } & & \textcolor{blue}{1} & 1 & 1 & 1 & -1 \\ + & & \textcolor{blue}{0} & 0 & 0 & 1 & 1 \\ \hline = &\textcolor{red}{\cancel{1}} & 0 & 0 & 0 & 0 & 0 \end{array}

\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 0 & 1 & 1 & 3\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 0 & 1 & 1 & 1 & 7 \end{array}

Examples for unsigned numbers

\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 1 & 1 & 1 & 7\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 1 & 0 & 1 & 1 & 11 \end{array}

\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 1 & 1 & 1 & 7\\ + & & \textcolor{blue}{0} & 0 & 0 & 1 & 1 \\ \hline = &\textcolor{red}{\cancel{0}} & 1 & 0 & 0 & 0 & 8 \end{array}

\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 0 & 1 & 1 & 3\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 0 & 1 & 1 & 1 & 7 \end{array}

Back to the index