RB-BASIC

What is it?

RB-BASIC is a BASIC variant that should run on an 8088 (or better) equipped computer. Either stand-alone or as program under CBM-DOS, my own operating system.
One of the ideas is that it should be able to handle original Commodore PRGs. Therefore I will use, for example, the same codes for the BASIC tokens as used by the C128.

The name

I first wanted to call it "CBM-BASIC" but realized that people could confuse it with the BASIC running on various Commodore computers. So I decided to give it my own initials. As the whole program is written from scratch by me, I don't see any reason why I cannot do that :)

Where is it meant to run on?

It is meant to be run on my own operating system: CBM-DOS. The idea I have is to let it either run as a separate program like GWBASIC under MS-DOS or as the main control program, more or less like the ROMBASIC of IBM. "More or less" because in this case RB-BASIC is loaded from disk as well. A ROM version is not out of the question.

Realization

I had two ideas: to start from scratch or to convert the code of the C64 ROMs into 8088 code. I chose for the first but did the last anyway, just out of fun. The main reason for thinking about converting the 6502 code into 8088 one was that I had the vague idea that things could be done, more or less, in a quick and dirty way. I certainly was wrong about that.

So I decided to start from scratch. The first thing I needed was a screen editor that acted more or less as the one of the C64. Already then I made a important decision. A BASIC line on the C64 can be two screen line minus one character = 79 characters long. I decided to use the same length. The advantage: a line would fit on one 80-columns screen line and I didn't need to maintain a table, like the C64 does, that tells the interpreter which screen line is part of a BASIC line made out of two screen lines. This simply means that RB-BASIC won't run on a 40-columns CGA equipped computer. But honestly, I don't know of any person who really uses this mode except out of curiosity.

The next step was to make a decision how to store data and variables. My main target is the IBM-PC/XT (compatible) computer with at least 256 KB of memory. I wanted to reserve one segment for the code and one segment for the variables.
Having "such amount" of memory triggered the idea to reserve a fixed amount of bytes for every variable. Another idea is to allow names made out of more than two characters. For your information: you can use names langer than two characters on a C64 but only the first two characters are used as identification of the stored variable. So although VAR1 and VAR2 are different names, both are identified as VA and therefore seen as the same vaiable by the C64. But be aware, the idea is that a program created under CBM-BASIC should run on a C64. CBM-BASIC will see VAR1 and VAR2 as two different variables but, as said, the C64 won't. So be carefull in the choice of the names for your variables then.

Choosing for a fixed length for a variable, specifically for strings, has one main advantage: I don't have to worry about garbage collection (see later why). The disadvantage will be that I certainly will waist memory. The main waister will be the string. Choosing a too long fixed value will certainly waist memory but choosing a too short value will hinder the user. Unfortunately a string on a C64 can contain up to 255 characters which means that if I want to be compatible, I have to use this number as a fixed size as well.
Is there a way to circumvent the waist in another way than using garbage collection? The original idea was to store variables starting from address zero in the order they are noticed by the interpreter and with a fixed size. The new idea is to store strings with the size they have the moment they are found. When a string is altered and the new string doesn't fit into the reserved memory, the string is moved to the end of the line and its size is expanded. Then the whole block of memory that was on top of the original string is moved so it occupies the original space.
But what if there is still a shortage of memory? The idea would be to move those strings which occupy more memory than needed at that moment, up to the top of the memory and shorten them at the same time. Hey, wait a minute, isn't that what "garbage collection" is about, freeing memory by shortening too long strings? Hmmm, it seems I invented the wheel again. But in this I see that as something positive. So far I avoided the subject for the simple reason I didn't really understand exactly what "garbage collection" did. And now I know.

Variables

AFAIK BASIC only supports three types of variables:
- integer
- Floating point
- string
These variables can be arranged in arrays, if needed. I don't know how many dimensions BASIC should support but I know the C64 only supports up to three so I will stick to that number as well.

Integer

- type                   1 byte     ( = 1 )
                             = zero if previous line was the last variable
- length name            1 byte, incl. zero byte
- name                       max. 17 bytes incl. zero end
- data                   2 bytes

I first thought of using a fixed length of 17 bytes for the name But using the byte "length name" enables us to save space. Having a fixed length and using a name like "I" would waste 15 bytes. Now it is six versus the original twenty bytes. Also when comparing two names, if their length isn't the same then any further comparision can be skipped.
Floating point

- type                   1 byte     ( = 2 / 6 )
                             = zero if previous line was the last variable
- length name            1 byte, incl. zero byte
- name                       max. 17 bytes incl. zero end
- data                   5 bytes

I'm using 40 bits, just like the C64 does. If needed I always can switch to 64 bits.
The type byte for a FP can be two values: 2 or 6. In case it is 6, it means that the FP is stored as an integer. Why? Let's have a look at the line 'FOR I = 1 to 8'. A very normal and legal line, nothing wrong with it. But on a C64 the variable 'I' is stored as a FP and treated as so. That means that when going through the loop, all increments of 'I' are done using floating point arithmetics. One can improve the speed by using an integer as counter, like 'I%'. But most people don't because, when learning BASIC, the existence of integers was learned much later. In most cases too late to change the habit of using 'I%' instead of 'I'. So my idea is to treat a FP as an integer as long as possible. During this the interpreter can use the much quicker integer arithmetics. When running into a division or outside the integer boundary, the variable has to be converted to a real floating point.

String

- type                   1 byte     ( = 3 )
                             = zero if previous line was the last variable
- length name            1 byte, incl. zero byte
- length of data field	 1 byte     max. 255 bytes w/o zero-end
- name                       max. 17 bytes incl. zero end
- actual length          1 byte     length of the string incl. zero end byte
- data                   x bytes    max. 255 bytes plus a zero end byte

why the "actual length" byte? When a string variable is created, the "actual length" byte will have the same value as the "length of data field" byte. When a garbage collection is needed, it is much faster to compare these two bytes than first having the determine the actual length by checking the string every time. And when comparing two strings, just like with the "length name" byte, if their length isn't the same then any further comparision can be skipped.

Array of integers, floating point

- type                   1 byte     ( = 9 )
                             = zero if previous line was last variable
- length name            1 byte, incl. zero byte
- size of array          2 bytes
- dimensions             3 bytes
- name                       max. 17 bytes incl. zero end
- data                   x bytes

The size of the array could be calculated using a methematical formula but these two bytes are just there to speed up things.

Array of strings

- type                   1 byte     ( = 11 )
                             = zero if previous line was last variable
- length name            1 byte, incl. zero byte
- size of array          2 bytes
- dimensions             3 bytes
- name                       max. 17 bytes incl. zero end
- data                   x bytes

The data is organised by arranging the whole as a number of blocks where each block is defined as:

- length of data field	 1 byte
- actual length          1 byte     length of the string w/o zero end byte
- data                   x bytes    max. 255 bytes plus a zero end byte

The order of the block is (0,0,0), (0,0,1), (0,0,2), ... (0,1,0), (0,1,1), ... (1,0,0), (1,0,1) ... (x,y,z). 'x', 'y' and 'z' is the given dimension minus 1.
The size of the array has to be calculated by adding the size of each individual block.

The use of the available memory

00000 - 003FF     INT vectors
00400 - 00501     Bios variabels, temporary Stack
00502 - 005FF     stack
00600 - 0....     various tables for Common of CBM-DOS
0.... - 0....     BASIC
10000 - 1FFFF     tokenized program
20000 - 2FFFF     variables
30000 - 9FFFF     future use

The above is for the stand-alone version. In case RB_BASIC is running as a program under CBM-DOS, it is possible that other 64 KB segments are chosen for the tokenized program and variables. And if the single segment version works fine, I'm going to work on a version where multiple segments are used for storage.

The structure of the tokenized program

- pointer to the next line: 2 bytes, "00 00" is end of program
- line number: 2 bytes
- tokenized text
- line ends with zero

The program is stored starting from 0002h. First it simplifies saving programs. I want to store the starting addres and these bytes are used for this purpose. A Commodore sends the bytes one by one to a drive but I want to do it sector by sector using DMA.
An extra byte before the program also simplifies the way I can handle the RESTORE command. Maybe Commodore/Microsoft did it for the same reason and... yep, it seems they did :)

Reading a program line

After the "Enter" key has been pressed, the content of the line on the screen where the cursor is at that moment, is copied in the string variable sCommandLine. A BASIC line can be 80 characters long. And 80 characters is what can fit on a screen line, so that simplyfies things (on purpose).
First is checked whether the line starts with a number or not. If it does, the line number is calculated and checked if it valid. Next the text is encoded, if possible, and the binary line number, encoded text and a end zero is stored in the string variable sCodedLine.
If there is not a number at the start of sCommandLine, we are dealing with a direct command. In this case the line number zero plus three end zeros (= end of program) are added to the variable sCodeLine and fed to the interpreter. The interpreter now jumps to the routine bcRUN2 which executes the former direct commend as a one-line program.

##### TO DO: A BASIC line on the C128 can only be 160 charactres long on the screen. That does include the line numbers. Which is weird IMHO because I thought the length of the encoded line would be subjected to a limitation. The CBM and CBM-II have the same limit. I don't see any reason why my lines cannot be longer. For various practical reasons, for example to limit the length of the various buffers, I limit it to three screen lines.
I don't want to exceed the number 252 bytes for code anyway but when choosing the number of four screen lines, I easily run the risk too exceed that number 252. How should the user know hen he exceeds that limit? With only three screen lines he cannot.

But how to program that all? Although I want to invent my own wheel, I know the C64 uses a table. How exactly, I don't know. But I will use a table as well.
Another thing: when entering a new line, how does the C64 know what typed in text is all part of one line? I don't know but here I will use my own method. My definition of "one line" is a piece of text that has been typed in without pressing the "Enter" key or using cursor movement outside the already typed line. Unless the next line is empty, typing past the end of a screen line will cause the next lines to be moved down. Entering the fourth line will also cause an error. Another table will keep track of typed lines that haven't been entered as BASIC line yet.

Interpreting a BASIC line

First the line number is read. If it is zero, it means this is the end of the program. If not, the length of the line is skipped and the next byte is read. FYI: bcRUN2 skipd the line number.
If its bit 7 of this byte has been set, we ar dealing with a token. If it isn't set, we assume are dealing with a variable and consider it as a LET command. A program line with LET can look like "10 let A=10" but also like "10 A=10"; LET can be ommited. A more compliceted line will look like:

10 A=((b+c)*(d+e))

The above is an arithmetical expression. Another type of expression is the logical expression which looks like:

[expression1] [logical operator] [expression2]

where the logical operator looks like "=", ">" or "<=".

A simple expresion can be:
- a constant, like "A=10"
- a variable, like "A=B"
- a function, like "A=log(expression)"

An arithmetical expression can be brought back to:

[-][expression1] [arithmetical operator] [-][expression2]

In my opinion EVERY expression can be brought back to the above basic expression. Let's have a look at various examples:

A + B + C == [ [A + B] + C]

A + B * C == [A + [B * C] ]

The reason of the above construction is that multiplication is of an higher order than addition and therefore has to be performed first. But the above operation incorporates a dificulty. Please have a look at the next one, an expanded verion of the above one: A + B * (C + D * E) == [A + [B * [C + [D * E] ] ] ] The operation has been brought back to pairs of expressions, as it should. The difficulty is that the operation "D * E" has to be calculated before we can do anything else. A human being can do that almost without thinking. A computer can do that as well but it has to be programmed to do so and it will take time. (*)
Another solution is to start and when running into another arithmetical expression the result so far has to be saved some how. A logical place is to store it on the stack. Certainly in the case of the Commodore and other 6502 equipped computers, the Stack is not endlessly. In case there is not enough room on the Stack, the C64 stops the program with an "OUT OF MEMORY" error.
In the above case the values of the variables A, B and C have to be stored on the Stack.

(*) On second thoughts, if we take the expression "(A * B) + (C * D)" as an example, whether we start with "(A * B)" or "(C * D)", we have to remember something and that means we still have to make use of the Stack. Then let us save our self some trouble and let's start using this mechanism right away.

What to store on te Stack?

As said before, IMHO we have to store the result found so far on the Stack. But a computer is just a dumb machine so we first have to tell it WHAT has to be stored. The moment it has to retrieve the value it first has to know what kind of value it will be retrieving; an integer has another size than a floating point or a string.
The operator has to be stored as well otherwise it will be lost when interpreting a complete new expression. IMHO it doesn't matter if it is pushed before or after the type and value. But having to push the type anyway, it being a byte and the operator being a byte and being able to push a word register, I'll combine these two and push both in one go.

How to start interpreting the expressions?

In case of the expression "A + B" we could start with getting A as first parameter, B as second parameter and "+" as operator. But what to do with the expression "-A + B" or something like "-(A * (B + C))"? The easiest way is to start with the first parameter set to zero. In case of "-A + B" the second parameter will be A and "-" will be the operator.
But what "zero" value will the first parameter contain? When seeing a LET statement like "A$=...." we should expect that A is a string. A% points to an integer and A to a floating point.
But in case of a floating point we make an exception. We start to treat it as an integer. I know from my own experience and reading programs made by others that floating points are often used as integers: "FOR I=0 to 10". Why? I was learned to do it this way, not knowing that I was a floating point. And when programming a quick and dirty test program, I stimore often use I than I%.
So my idea is treat floating points as integer until I run into something where I have to treat them like a floating point like dealing with a division or a logarithm

Having questions or comment? You want more information?
You can email me here.