Hello, this is Duck.
Last time, I showed you how to represent signed numbers. This time, we will talk about floating point.
A floating point is a number represented without a fixed decimal point position. It can represent very large to very small numbers.
There are many ways to represent floating point numbers, but here we will focus on 32-bit single-precision floating point numbers that conform to the IEEE754 system. Throughout this article, when we refer to floating point numbers, we are referring to this type of number.
Structure of floating point
Since the position of the decimal point is not fixed, the representation is similar to the representation of decimal exponents.
It is represented as follows
Similarly, it is represented by this "sign," "mantissa," and "exponent" bit. In floating point, 32 bits are divided into three parts as shown below.
Now, let's see how each is represented as we put numbers into this diagram.
As an example, let us represent -10.25 in bits.
Representation of the mantissa part
The mantissa part is the part that represents the actual number. In the case of binary numbers, integers represent a power of 2 as the digits increase, while decimals represent a power of 0.5 (a negative power of 2) as the digits decrease.
Since the numbers are added to the negative power of 2, they may not be expressed exactly. In such cases, it is approximated by the nearest value.
Let's look at the case of -10.25. First of all, how do we express the 0.25 portion after the decimal point?
The mantissa part is the number obtained by combining the decimal part and the integer part. The integer part is 10, so it can be easily converted to binary. The following figure shows how to convert these two parts together.
The integer and decimal portions are each converted to bits, which are then combined. Then, the decimal point is moved so that the integer portion has one digit "1".
The integer portion will then always be 1 as a bit. So 0100100... without the first 1 becomes the number of the mantissa part. The remaining digits are filled with zeros.
When the most bits are filled, the goal is finally in sight. At this point, we are almost there.
How to find the exponent part
As with decimal numbers, the exponent part represents the number of decimal points moved. Since we moved three decimal places when obtaining the mantissa part, the exponent part is 3.
Therefore, we would like to set the exponent part to 0000 0011, but this is not possible. The exponent part represents the number in offset binary format. The bias value is 127.
Therefore, as shown in the figure below, the exponent part is 1000 0010 by adding the bias value 127.
We can now express -10.25 in floating point.
As a side note, in the explanation of the mantissa part, it is said that the highest order is set to 1. However, there is one exception. That is when it is 0. Therefore, when the exception is 0, all bits are 0, even in floating point.
+ α to keep in mind
With the contents of this issue, we were able to represent positive numbers, negative numbers, decimals, and exponents. However, operations using floating-point numbers are complicated, such as aligning the decimal point of each number and checking for overflow. For this reason, dedicated arithmetic processors are often incorporated.
Intel® FPGAs incorporate a digital signal processing (DSP) block to perform floating-point operations. Also, Arria® V SoC, Cyclone® V SoC, and later SoC devices and FPGAs have hardware implementations of floating-point operations.
New Engineer's Blush Blog Articles