Binary Arithmetics (Additional notes)
2022-2024
A natural number N is represented in binary by a vector (a_{n-1}, \dots , a_1, a_0).
Each element a_i can have the value 0 or 1 and is called bit for “binary element”
The value of this number is:
N = \sum_{i=0}^{n-1} a_i 2^i
With a fixed number of bits n, we can only represent the numbers from 0 to 2^n-1.
Binary representation of 42:
\begin{aligned} 42 & = 32 + 8 + 2 \\ & = 1\cdot 32 + 0\cdot 16 + 1\cdot 8 + 0\cdot 4 + 1\cdot 2 + 0\cdot 1\\ & = 1\cdot 2^5 + 0\cdot 2^4 + 1\cdot 2^3 + 0\cdot 2^2 + 1\cdot 2^1 + 0\cdot 2^0\\ \end{aligned}
(42)_{10} = (101010)_2
What number is represented in binary by 11001101?
\begin{array}{l c l c l c l c l c l c l c l c l} N & = & 1\cdot 2^0 & + & 0\cdot 2^1 & + & 1\cdot 2^2 & + & 1\cdot 2^3 & + & 0\cdot 2^4 & + & 0\cdot 2^5 & + & 1\cdot 2^6 & + & 1\cdot 2^7 \\ & = & 1\cdot 1 & + & 0\cdot 2 & + & 1\cdot 4 & + & 1\cdot 8 & + & 0\cdot 16 & + & 0\cdot 32 & + & 1\cdot 64 & + & 1\cdot 128 \\ & = & 1 & + & 0 & + & 4 & + & 8 & + & 0 & + & 0 & + & 64 & + & 128 \\ & = & 205 &&&&&&&&&&&&&& \end{array}
To find the binary representation of a decimal number, we need to decompose it into the sum of powers of 2. This can be achieved by consecutively dividing (euclidean division) the number by 2 and keeping the reminder of each step.
What is the representation of 42 in binary?
\begin{array}{r c l c c c c c} & & & & quotient & & reminder & bit index\\ \\ 42 &=& 2 &\times& 21 &+& 0 & 0 \\ 21 &=& 2 &\times& 10 &+& 1 & 1 \\ 10 &=& 2 &\times& 5 &+& 0 & 2 \\ 5 &=& 2 &\times& 2 &+& 1 & 3 \\ 2 &=& 2 &\times& 1 &+& 0 & 4 \\ 1 &=& & & & & 1 & 5 \end{array}
How many bits do we need to represent 113 in binary?
\begin{array}{r c l c l } 64 &<& 113 &<& 128\\ 2^6 &<& 113 &<& 2^7\\ \end{array}
At least 7 bits are needed.
(113)_{10} = (1110001)_2
The value of signed number N represented by a vector of bits (a_{n-1}, a_{n-2}, \dots , a_1, a_0), is:
N = - a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i
The most significant bit (or msb) (a_{n-1}) holds the sign information.
The largest positive value that we can represent with n bits is 2^{n-1}-1.
The smallest negative value that we can represent with n bits is -2^{n-1}.
4 bits signed numbers:
7 | 0111 | ||
6 | 0110 | ||
5 | 0101 | ||
4 | 0100 | ||
3 | 0011 | ||
2 | 0010 | ||
1 | 0001 | ||
0 | 0000 | ||
-1 | 1111 | (-8+7) | |
-2 | 1110 | (-8+6) | |
-3 | 1101 | (-8+5) | |
-4 | 1100 | (-8+4) | |
-5 | 1011 | (-8+3) | |
-6 | 1010 | (-8+2) | |
-7 | 1001 | (-8+1) | |
-8 | 1000 | (-8+0) |
4 bits signed numbers range is [-2^3 : 2^3-1] (or [-8 : 7]).
For a signed number N = (a_{n-1}, a_{n-2}, \dots a_0) we can show that:
-N = \overline{N} + 1
i.e. The bitwise boolean invert plus one.
\begin{split} -N & = -(-a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i) \\ & = - (-a_{n-1}) 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (-a_{n-1} +1 - 1) 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 2^{n-1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 2^{n-1} + (1 - 1 ) + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 1 +\frac{(2^{n-1} -1)}{2 -1} + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = - (1-a_{n-1}) 2^{n-1} + 1 + \sum_{i=0}^{n-2} 2^i + \sum_{i=0}^{n-2} (-a_i) 2^i \\ & = 1 + - (1-a_{n-1}) 2^{n-1} + \sum_{i=0}^{n-2} (1 - a_i) 2^i \\ \end{split}
Also, for one bit number, we have \forall x \in\left\{0,1\right\} : 1-x = \overline{x}, indeed:
\begin{array}{r c c c l} 1 - 1 & = & 0 & = & \overline{1}\\ 1 - 0 & = & 1 & = & \overline{0}\\ \end{array}
And so:
-N = 1 + - (\overline{a_{n-1}}) 2^{n-1} + \sum_{i=0}^{n-2} \overline{a_i} 2^i
-7 on four bits knowing that +7 = (0111):
(-7)_{10} = \overline{(0111)} + 1 = (1000) + 1 = (1001)
+1 on four bits knowing that -1 = (1111):
(1)_{10} = \overline{(1111)} + 1 = (0000) + 1 = (0001)
How to extend the representation of a signed number from n bits to a larger m bits representation?
Let’s start with N = (a_{n-1},a_{n-2}, \dots, a_0)_{n} a signed number represented with n bits.
\begin{split} N & = - a_{n-1} 2^{n-1} + \sum_{i=0}^{n-2} a_i 2^i \\ & = - a_{n-1} 2^{n-1} \cdot (2-1) + \sum_{i=0}^{n-2} a_i 2^i \\ & = - a_{n-1} 2^{n} + a_{n-1} 2^{n-1}+ \sum_{i=0}^{n-2} a_i 2^i \\ N & = - a_{n-1} 2^{n} + \sum_{i=0}^{n-1} a_i 2^i \\ \end{split}
N = (a_{n-1},a_{n-1},a_{n-2}, \dots, a_0)_{n+1} is the same number represented with n+1 bits.
We went from n to n+1 bits by duplicating the most significant bit. As this bit contains the sign information, the extension of signed integer is called “sign extension”.
By induction, we can generalize this extension. Extending a signed number form n to m bits can be done by duplicating the most significant bit m-n times.
4 bits | 8 bits | |||
---|---|---|---|---|
7 | 0111 | 00000111 | ||
6 | 0110 | 00000110 | ||
5 | 0101 | 00000101 | ||
4 | 0100 | 00000100 | ||
3 | 0011 | 00000011 | ||
2 | 0010 | 00000010 | ||
1 | 0001 | 00000001 | ||
0 | 0000 | 00000000 | ||
-1 | 1111 | 11111111 | ||
-2 | 1110 | 11111110 | ||
-3 | 1101 | 11111101 | ||
-4 | 1100 | 11111100 | ||
-5 | 1011 | 11111011 | ||
-6 | 1010 | 11111010 | ||
-7 | 1001 | 11111001 | ||
-8 | 1000 | 11111000 |
\begin{array}{r c l c l c l} 0+0 &=& 0 &\rightarrow & 0 & & \text{no carry} \\ 0+1 &=& 1 &\rightarrow & 1 & & \text{no carry} \\ 1+0 &=& 1 &\rightarrow & 1 & & \text{no carry} \\ 1+1 &=& 2 &\rightarrow & 0+1\times 2 & & \text{a carry is generated} \\ \end{array}
The addition of two bits can not always be represented on one bit. When a carry is produced, 2 bits are needed to represent the result.
For multi-bit binary addition, the carry must be propagated from one column to the next (next power of two).
As the addition can produce a final carry, the result of the addition should be represented with an additional bit.
\begin{array}{l c c c c } & & 1 & 1 & 1 \\ + & & 0 & 0 & 1 \\ \hline = & 1 & 0 & 0 & 0 \\ \end{array}
To avoid addition overflow, we must represent the result of the addition of two n bits numbers on n+1 bits.
In deed, pour two numbers A and B represented on n bits:
\begin{array}{r c l} A & \leq & 2^{n}-1 \\ B & \leq & 2^{n}-1 \\ A+B & \leq & 2^{n+1}-2 < 2^{n+1} \\ \end{array}
A+B is always valid with n+1 bits \forall (A,B)
Binary signed addition (Two’s complement) is similar to unsigned addition. The sole differences will be in the overflow and the final carry interpretation.
\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 1 & 1 & 1 & 7 & -1\\ + & & 0 & 0 & 1 & 1 & 1\\ \hline = & \textcolor{red}{1} & 0 & 0 & 0 & 8 & 0 \text{ or } -8? \end{array}
In this example, how to interpret the final carry?
If the inputs are considered unsigned, the carry must be kept.
If the inputs are considered signed, the carry must be removed.
\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 0 & 1 & 1 & 3 & 3\\ + & & 0 & 0 & 1 & 1 & 1\\ \hline = & \textcolor{red}{0} & 1 & 0 & 0 & 4 & -4 \end{array}
Here the final carry is null, but the signed result is incorrect.
\begin{array}{l c c c c | r | r} & & & & & unsigned & signed\\ \hline & & 1 & 1 & 1 & 7 & -1\\ + & & 1 & 0 & 0 & 4 & -4\\ \hline = & \textcolor{red}{1} & 0 & 1 & 1 & 11 & +3 + \text{carry?} \end{array}
Here the signed result is incorrect when representing the result on only 3 bits and the carry must be considered.
The rules to detect an overflow and interpret correctly the final carry in a signed addition are:
If the two operands have different signs, we have the guaranty that the result will not overflow. The carry can be ignored.
If the two operands have the same signs, and the result have the same sign, it means that there is no overflow. The final carry can be ignored.
If the two operands have the same signs, and the result have a different sign, it means that there is an overflow. The final carry must be considered.
General purpose processors that have fixed width arithmetic operators generally provide means to detect overflow.
On the opposite, when building a hardware arithmetic operator for signal processing, we want to avoid overflow. The width of addition result will always be one bit more than the width of the operators.
In deed, pour two numbers A and B represented on n bits:
\begin{array}{r c c l} -2^{n-1} \leq & A & \leq & 2^{n-1}-1 \\ -2^{n-1} \leq & B & \leq & 2^{n-1}-1 \\ -2^{n-1} \leq & A+B & \leq & 2^{n-1}-2 < 2^{n-1} \\ \end{array}
A+B is here also always valid with n+1 bits \forall (A,B)
In that case, there is a simple method to always yield a correct result. The operators must be extended before performing the addition, and, as we know that the result will be correctly represented on that size, we can safely ignore the final carry.
This recommendation will work for both signed and unsigned additions. The only difference will be how to perform the extension, sign extended if the numbers are signed, 0 extended otherwise.
\begin{array}{l c c c c c | r } & & \textcolor{blue}{1} & 1 & 1 & 1 & -1 \\ + & & \textcolor{blue}{1} & 1 & 0 & 0 & -4 \\ \hline = &\textcolor{red}{\cancel{1}} & 1 & 0 & 1 & 1 & -5 \end{array}
\begin{array}{l c c c c c | r } & & \textcolor{blue}{1} & 1 & 1 & 1 & -1 \\ + & & \textcolor{blue}{0} & 0 & 0 & 1 & 1 \\ \hline = &\textcolor{red}{\cancel{1}} & 0 & 0 & 0 & 0 & 0 \end{array}
\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 0 & 1 & 1 & 3\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 0 & 1 & 1 & 1 & 7 \end{array}
\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 1 & 1 & 1 & 7\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 1 & 0 & 1 & 1 & 11 \end{array}
\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 1 & 1 & 1 & 7\\ + & & \textcolor{blue}{0} & 0 & 0 & 1 & 1 \\ \hline = &\textcolor{red}{\cancel{0}} & 1 & 0 & 0 & 0 & 8 \end{array}
\begin{array}{l c c c c c | r } & & \textcolor{blue}{0} & 0 & 1 & 1 & 3\\ + & & \textcolor{blue}{0} & 1 & 0 & 0 & 4 \\ \hline = &\textcolor{red}{\cancel{0}} & 0 & 1 & 1 & 1 & 7 \end{array}
© Copyright 2022-2024, Tarik Graba, Télécom Paris. | |
Le contenu de cette page est mis à disposition selon les termes de la licence Creative Commons Attribution - Partage dans les Mêmes Conditions 4.0 International. | |
The content of this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Licence. |