Fixed and Floating-Point Number

Fixed and Floating-Point Number: In digital technology, data is stored in memory registers with binary bits 0’s and 1’s because the computer only understands binary language. When we enter data in the system, it is converted into binary bits, and it is processed and used in the CPU in different ways. Memory registers have a format and a specific range to store data. Scientists have designed a real number representation method in memory registers of 8 bit, 16 bit, 32bit.

There two types of approaches that are developed to store real numbers with the proper method.
In computing, fixed-point number representation is a real data type for a number. With the help of fixed number representation, data is converted into binary form, and then data is processed, stored and used by the system.
Fixed point representation of data
Sign bit -The fixed-point numbers in binary uses a sign bit. A positive number has a sign bit 0, while a negative number has a sign bit 1.
Integral Part – The integral part is of different lengths at different places. It depends on the register’s size, like in an 8-bit register, integral part is 4 bits.
Fractional part – Fractional part is also of different lengths at different places. It depends on the register’s size, like in an 8-bit register, integral part is of 3 bits.
8 bits = 1Sign bit + 4 bits(integral) + 3bits (fractional part)
16 bits = 1Sign bit + 9 bits(integral) +6 bits (fractional part)
32 bits = 1Sign bit + 15 bits(integral) + 9 bits (fractional part)
Number is 4.5
Step 1:- Convert the number into binary form.
                 4.5 = 100.1
Step 2:- Represent binary number in Fixed point notation
The smallest negative number in fixed-point representation.
Smallest negative number = -15.875
The largest number in fixed-point representation.
Larger number = +15.875
Note:- Range of fixed-point notation is from -15.875 to +15.875. We conclude that the fixed-point notation range is very less as we can only represent the number in a set limit. It is not suitable for presenting a large amount of data, so it is not used in computer nowadays.
Therefore, scientists feel that the system needs a new representation format with the least or no limit because data is becoming a vast nowadays. So, floating-point representation came into existence.
To discard the limitation of fixed-point notation, floating-point number representation was developed by scientists. The computer system uses floating-point numbers representation to convert input data into binary form. The binary form number is converted into ‘scientific notation,’ and then this scientific notation is converted into floating-point representation.
The floating-point notation has two types of notation
Scientific notation – Method of representing binary numbers into a x b^e form. Scientific notation is further converted into floating-point notation because floating-point notation only accepts scientific notation. For example:-
Number = 376.423 ( its not scientific notation)
Number in scientific = 36.4423 x 10¹or 3.64423 x 10²
For example:-    32.625 x 10³
1101.101 * 2¹⁰¹
where 1101.101 is the mantissa part.
2¹⁰¹ = It is the base part where we need not explicitly represent radix or base because the binary base is always 2.
Note: The major problem in this notation is while storing mantissa, we need to tell the decimal position every time to the processor. So to overcome this problem, normalized notation was invented and used.
Normalized notation– It is a special case of scientific notation. Normalized means after the decimal point, we have atleast one non-zero digit.
Normalized notation –
where, value of m= .1≤m≤1, b= base, e= exponent integer
                                           ± 0.1bbbb…..b * 2^±e
If mantissa =101, then the processor will interpret it as 0.101 itself, so it’s not necessary to tell the position of the decimal point every time to the processor.
For example-   .36 x 10³⁵is a normalized notation in which the value of m is between .1 to 1. In normalized notation, value of m remains between .1 ≤m≤1.
For example:-   1101.101 * 2^{101 = (5)}₁₀ (convert this into normalized form)
0.1101101 * 2⁽¹⁰⁰¹⁾₂⁼⁽⁹⁾₁₀⁼⁽⁵⁺⁴⁾₁₀
So, there is no need to tell about the decimal point’s position every time to the processor.
So, four things are used to represent a floating-point number: –
Floating-point representation of data in a 16-bit register.
Sign bit -The fixed-point numbers in binary uses a sign bit. A positive number has a sign bit 0, while a negative number has a sign bit 1. In floating-point representation, sign of a number always depends on mantissa, not on exponent. Hence sign bit in the format is always for mantissa and not for the exponent.
Mantissa Part –Mantissa part is of different length at a different place. It depends on the size of the register like in 16-bit register; mantissa part is of 8 bits.

Exponent part – Exponent is the power of the number. It depends on the register’s size; like in the 16-bit register, exponent part is 7 bits. Excess 16,64,128, 512 are used to store exponent in this format.
Steps for representing the number in Floating point format
Step 1: Convert the given number into binary.
6.25 = 110.01
Step 2: Normalize the number = .11001 * 2³ ( base is 2)
Step 3: Represent the number in a 16-bit register in floating-point notation.
This represent value = 6443H
Largest normalized number in 16 bit register with excess 64
    =   .11111111* 2^127-64( excess- 64 is used to store exponent in this format)
=   .11111111* 2⁶³
=    2⁷-1 * 2⁶³
Smallest normalized number in 16 bit register with excess 64
    =   .1* 2^0-64( excess- 64 is used to store exponent in this format)
=   .1* 2^-64
=    .5 * 2⁶⁴
=    2^-65
It is just reverse of normalized notation. In normalized notation, after decimal we have‘1’ written in the equation but in de-normalized notation, we have ‘0’ after decimal. For example:-
Largest De-normalized number with excess-64
      0                                         1111111                              01111111
= .01111111 * 2^127-64
= .01111111 * 2⁶³
= .1111111 * 2⁶²
= (1-2^-7)* 2⁶²
Smallest De-normalized number with excess-64
      0                                         1000000                              0000000
= .00000001 * 2^0-64
= .00000001 * 2^-64
= 2^-8 * 2^-64
= 2^-72
Designed by Elegant Themes | Powered by WordPress

source

Fixed and Floating-Point Number

Published by hadi on December 8, 2020

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized

Related Posts

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized

PHP

Your System Isn’t Slow – Your Bottleneck is not Optimized