My Floating Point adventures

What is it?

This is a story about how I managed to deal with Floating Point numbers in two of my projects.

Background

For a very long time already I'm working on three software projects between other things:
- My own OS, CBM-DOS, meant to run on 8088 PC compatibles. My primary targets: C= PC10 series. Main hurdle at this moment: directories.
- My own bootable RB-BASIC, same targets. Shares a lot of routines with the OS.
- My own Pascal, written in Turbo Pascal, which outputs macros. My own assembler should turn the macros into binaries for the various target machines. For the moment only the PC and the Commodore 64.

The last two projects have one thing in common: they need Floating Point (FP) routines for the mathematical part. For the C64 I can use the onboard FP routines so that problem I consider as been solved. But I have to create new FP routines for the 8088. And if the Pascal version has been created, I will create an ASM one for my BASIC but will park the routines in the Common file of the OS.

Searching for ideas

So far I only found the sources for the 8088 of the original and a hacked Microsoft's GW-BASIC and the so called URC library. The problem: the code of both is very hard to understand. And I'm not the only one with this opinion. And what doesn't help either, I'm a NASM man and the last time I really worked with MASM/TASM was thirty years ago.

That raised the idea to use the to 8088 converted FP routines of the C64 for my Pascal and BASIC as well. OK, it is "only" 40 bits but that is good enough for me. If I understand things better, I always can extend the number of bits. The mmain advantage: to check if things work OK, the idea is just running a small BASIC/ML program and to see what happens with the Floating Point registers FAC1 and FAC2. And then the "only" thing I have to do is to achieve the same result with my Pascal/8088 routines.
But that didn't work out well. The main reason were that 1) the way how the integer part and fraction part were calculated was rather slow and 2) the problem with the FP routines of the C64 was that it used FP division to calculate various steps. So I ran into a kind of "chicken and egg" problem: I cannot calculate smaller FP numbers because I haven't programmed division yet which on its turn mean I cannot develop things because I haven't found a way to create FP numbers. In the end I decided to work out things completely by myself but using one idea I found on internet: use tables.

Using tables

The main idea is to generate a table for fractions with the values for 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, etc., called a01 and one for integers with the values 10, 100, 1000, 10000, etc., called a10. The question is: how many rows do we need and how may bits will each record contain?

I already told to you that I would start programming using Pascal. The idea rose to start with double-precision for the simple reason that Pascal was using it as well. In this way I could see how Pascal stored the numbers in FP format and if my routine produced a different set of bytes, I knew I did something wrong.
Then the idea rose, my own idea, to create an array of words that could hold the FP number in full form. But I soon found out that that was completely undoable for double-precision but doable for single-precision, only 23 words. Single-precision can "only" handle numbers from roughly 1E-38 to 1E+38, compared to the 1E-309 to 1E+309 of double-precsion, but I can live with that. And that's how the next idea rose, to use my own format: single-precision but with an extra 32 bits for the mantissa. It meant in fact that I had a 3 bits better precision than double-precidion.

It was only after some experiments that I found out that I needed an array of 23 words to store the original number: 9 words for the integer part and 14 for the fraction part. As for the tables, I need one with 40 rows with 9 words each for the integer part and one table with 51 rows of 14 words each for the fraction part.
I first used "constant" tables: the values are stored in the program itself. Then the idea rose to calculate the values when starting a program that needs them. Advantages:
- It is stored in RAM. If I don't need these values, the RAM is free for other uses.
- The program to calculate the values is smaller than the set of hard coded values.
- During the development I had to change the number of rows and number of words par row several times. If I have to do it again for some reason, it means I only have to change four bytes at max.

How it works: ASCII to FP number

First we check if there is an exponential notation like 1.234 E 12 present. This is converted into its full text, 1,234,000,000,000 but without the commas. The value 1.2345678 E 3 will end up like 1234.456.
I have thought about mathematical solutions like calculation the first value and multiplying/dividing the result with/by ten but the textual approach turned out to be much smaller and faster.
Another advantage, handling a number like 1.234567 E -50 will cause a "lot" of work only to find out after the multiple multiplications with ten that the number is out of range. Using the textual solution it soon will be clear, because of all the leading zeros, that the number is out of range.

The next step is reading the ASCII string from left to right and place the first number in position 14 of the array. Then after reading the next number, I multiply the already stored numbers with ten and add the new number to position 14 plus corrections for possible Carrys. This sounds simple but if you have a look at the Pascal code, it IS simple. In the mean time the number of processed digits is counted. The conversion is stopped when the string ends or a decimal point ia encountered.
In case of a decimal point the way of conversion is changed. Position 13 down to zero is reserved for the fraction part and is cleared at the start. Then the value of the first digit after the decimal point is determined. The whole row with words for 0.1 is multiplied with this number and the result is added to the words of the array. The value of the next step is determined, multiplied with the words for 0.01 and the result is added to the array again. This continues until we would have to use the row for 1E-51. It doesn't exist so we have to stop and disregard all possible further digits.

Whatever the start was, an integer, a fraction or an integer plus a fraction, the result has to be "normalized". In this case it means that we have to shift all bits to the left or right, or maybe not at all, until position 22 down to 15 are zero and position 14 is one. First, if possible, in steps of whole words, then bitwise. The total number of steps is stored. Shifts to the right, in case of integers, count as positive, shifts to the left, in case of fractions, count as negative.
The first 55 bits of the positions 13 to 10 form the mantissa. Now the number 127 is added to the number of shifts and the result forms bit 56..63 of the FP number. Bit 64, the sign bit, becomes zero if the original number was positive, a one if it was negative.

How it works: FP number to ASCII

Converting the FP number to ASCII is more or less the opposite of the previous conversion. The sign bit is extracted and stored, the number of shifts is stored, and the last 55 bits, the mantissa, are stored from position 13 downwards. Position 14 becomes one. Then the value 127 is subtracted from stored number of shifts. If the result is negative, the words from position 14 down to 10 are shifted to the right. If the result is positive, the words from position 14 to 10 are shifted to the left.
If the number of shifts was positive, we know that the original number is at least bigger than one. We take the binary version of the biggest number from the a10 table, 1E+38, and compare it with the words in the array. If those of the array are greater, we subtract the words af the a10 array. We repeat this until the words of the a10 array are bigger. The number of subtractions form the first digit.Now we take the next lower value of the a10 table, 1E+37, and repeat the subtractions. Now we have the second digit.
If there is no subtraction possible at all, the digit will be zero and we go to the next row. In case of the integer part, the zeros at the start of the string will be ommited of course.
This proces will be repeated until position 14 is zero. If any of the positions lower than 14 is not zero, then that means there is a fraction part as well. A dot must be added to the end of the string and the rest is described below.

In case of a fraction, position 22 down to 14 are zero and one or more positions in positions 13 down to 0 are not. If this was the situation right at the start of the conversion, the ASCII string starts with "0.". Now we try to subtract the binary value of 0.1. If that is not possible, we add a zero to the end of the string. If it is possible, we try again as many times as possible and add the resulting digit to the end of the string. Then we repeat this for the values 0.01, 0.001, etc. In contrary to the integer part, this can keep going on forever. The main reason: the conversion from our decimal system to the binary does not end up in nice rounded numbers:
0.1 : 1999 9999 9999 9999 9999 9999 9999 9......
0.01 : 028F 5C28 F5C2 8F5C 28F5 C28F 5C28 F......
0.001: 0041 8937 4BC6 A7EF 9DB2 2D0E 5604 7......
The result is that, with whatever number you started, you most probably will end up with a number that isn't exactly like the original one. The cause of this is that the mantissa is limited and cannot contain all bits of the binary repesentation of the original number. The difference, it can be positive or negative, also depends on how many bits you reserved for the table anyway. I chose for 14 words because that would cover numbers like 3,345678E-38.

But where should the conversion stop? If the starting value 0.1 ends up in 0.100000000000000001 we can be sure that the original value was 0.1 because the limited mantissa isn't able to represent that number. But how can we program that? I created a mask that covers at first all bits from position 13 up to the last bit of the original number and removed the last 16 bits.

 13   12   11   10    9    8    7     row
0000 0012 3456 7890 1234 8000 0000    original number
FFFF FFFF FFFF FFFF FFFF 8000 0000    original mask
FFFF FFFF FFFF FFFF 8000 0000 0000    shortened mask
0000 0000 0000 0000 3596 2541 8739    rest after subtractions

If the difference ends up outside that mask, then I think that I can be sure that the original value has been found.
Why did I remove the last 16 bits? The original idea was using the original mask but during testing I ran in situations where the result didn't match the original number. After removing those 16 bits, the results all were OK (so far). I don't have an explanation for this but so far it is OK for me.

Remark: even the 55 bits of precision that my format supports, can not be enough in some cases. The value 567890123456789123 ends up after the two conversions as 567890123456789120. The first 17 characters of an integer are correct but from the 18th digit on most probably there will be differences. The only thing that helps to decrease that difference is to increase the number of bits. I have seen formats supporting 128 bits FP numbers.
I ran Turbo Pascal and Free Pascal for years and I'm still using it. So far I never ran into any trouble regarding the precision of a number so I stick to my 64 own bits format.

Addition, subtraction, multiplication and division

These four actions have one thing in common: First I created a 24-word for both the values and converted the two given FP numbers into these arrays. In case of the addition or subtraction I performed the action and stored the result into a third array. Converting this array into FP gave me the wanted result.

In case of a multiplication I multiply all words with each other, just like a normal multiplication. That works fine for Pascal and the 8088 but in case of a 6502 or Z80, I will have to use multiple additions.

In case of division I used multiple subtractions.

Other mathematical functions

For the moment I think I will use the so called < a href="https://en.wikipedia.org/wiki/Taylor_series target=_blank") Taylor series with the exception of the square root. The reason: AFAIK a square root cannot be calculated using Taylor otherwise I most probably would have used this methos here as well.

To be continued.....

Having questions or comment? You want more information?
You can email me here.