Building a BASIC Interpreter, '80s style

drogon · August 24, 2020, 5:34pm

Following this thread with interest - mostly because I wrote my own BASIC about 10 years back, tweaked it a little (a lot!) when the Raspberry Pi came out and it’s still being used by a few people today (there was even a commercial spin-off called FUZE Basic too - am I the only person to have sold a new BASIC in the 00’s ???)

I decided to make a few changes (from a “classic” BASIC) when I was implementing mine - mostly for efficiency and also to make my life easier… It was also (originally) “My” perfect BASIC - a vanity project if you like. I also have the luxury of writing it in C rather than some assembler and designing it to run under a modern OS, I was able to use the operating systems dynamic memory routines rather than try to maintain my own and “garbage collect”. (ie. I use malloc/free)

One change was that I’d not allow multiple statements per line… In the old days we’d cram as many statements per line to make it run faster and to take up less RAM. I didn’t need either those constraints so I dropped that idea. And on the tokenisation front: Everything got tokenised. And I mean everything. Even comments. Numbers were stored in their native binary format. Line numbers are also optional, but are always there, so when you load a line-number-less program it’s loaded starting at line number 1, increment by 1. (There is a comprehensive renumber command).

The LIST command (for those using the traditional interactive line-number interface) then becames a de-compiler. And here is a problem with tokenising and evaluating everything. Consider

10 A = 1e6

and you type LIST and you get

10 A = 1000000.00004

which isn’t what you typed (I’ve made-up the rounding here - you might actually get 1000000 but the important bit is that it’s not 1e6)

So what do you do? Well, RAM is cheap, so I store the textual part of what was typed along with the binary form of the number… Similarly for strings and comments

20 REM This is a comment
30 A$ = "Fred"

that’s replaced with a single token which is an index into the symbol table which contains the textual value of the comment (or string)

I removed any limitation on variable name length. Afterall, it gets replaced by a single token, no matter how long, so:

40 theLoopCounter = 42

is stored as 3 tokens. The first is an index into the symbol table with the variable - that entry contains its name, type and value (and a flag to indicate used or not used). The 2nd token is the = symbol. this is replaced by a token that indicates a function and the last one is a token that represents a constant.

So what I effectively did was to write a one-pass compiler and a virtual machine to execute the resulting stream of fixed-length (32-bit) tokens. There is a run-time fixup which scans the tokenised code for GOTO, GOSUB and calls to functions and procedures to insert the pointers into their entries in the symbol table to point to the line of their target. (Remember putting common functions/subroutines at the start to make them run faster due to a linear search? I didn’t want any of that old nonsense!)

Unlike the traditional BASICs, I never stored the binary form to disk - mostly because as I was developing I was changing the token values on a daily basis and also because the program became 2 parts -the tokenised code and the symbol table and being a lazy programmer, it was just easier to store the textual part. Loading a test 10,000 line program on a 900Mhz Raspberry Pi 1 did not cause any noticeable slowing of the program load and tokenisation as it loaded.

Anyway, I love old BASICs, but a BASIC of the 80’s … That’s (to me) BBC Basic and not MS Basic which is a product of the 70’s. BBC Basic has long variable names (with all characters significant) and a proper integer data type. It was also faster than all other interpreted BASICs of the time (for the same CPU configuration), but it was also a few years after the MS Basics, so they had a lot to learn from.

So, keep going - always nice to see someone elses interpretation and what they do. I think there is still a place for BASIC - especially in the interactive versions, but there is an avalanche of dislike for it these days…

Cheers,

-Gordon

oldben · August 24, 2020, 6:55pm

OS/9 (6809) has a very nice version of basic.
It could compile to a P-code or Native mode.
At the time I had a COCO II with dual 360k floppies,
not great a development system.

jecel · August 24, 2020, 9:46pm

Another language that uses line numbers is APL, though they are per definition and not global numbers.

I am always interested in how small a system can be. VTL-2 (very tiny language) can be considered a “Basic Jr” and took up 3 PROM sockets on the Altair 6800. That is just 768 bytes of code. It uses the APL trick of jumping to a line number calculated by an expression or the next line if the expression is 0.

Tiny BASIC is written in just 120 lines (443 bytes) of a virtual machine, which is then implemented in very little machine language code. That includes all the resources necessary to edit the Basic code.

Forth, on the other hand, had screen numbers (instead of a file system) and a visual editor (not counting very early versions based on punch cards).

oldben · August 24, 2020, 11:35pm

PDP 8 fans have FOCAL with decimal line numbers
if I remember right. Your 2K gets you floating point
as well.

cjs · August 25, 2020, 12:36pm

Well, that sounds like it would cause trouble for those of us who use REM statements as a handy place to store machine-language routines. :-)

I’m not convinced that showing you the number actually being used, rather than the number you typed in that is not what will be used, is such a big problem, actually.

Which is fair enough; that technique was really a product of the environment in which things ran.

I agree with your taxonomy here, but it wasn’t just a matter of learning from contemporary computing science (where the authors of MS BASIC did a pretty dismal job; the state of the art in interpreters was significantly better in 1975, or even 1965, than MS BASIC), but also very much a matter of machine resources. The original MS BASIC could be loaded into 4 KB of RAM (typical for an “expanded” Altair 8800) and still leave room for a small program; by the time BBC BASIC came around they were looking at having 8 KB or more of RAM plus 8-16 KB of ROM, two or three times the space available even for Mirocosft’s 8 KB Altair BASIC.

Some may dump on the subsequent MS programmers not better fixing a whole lot of the, uh, infelicities in the original BASIC code, but I’m inclined to give them a pass on that. We also have to remember that at the time people weren’t generally using extensive automated test frameworks that provided safety when doing major code redesigns. And even if they were, it was no unusual for clients to depend on what were pretty inarguably bugs in implementations.

whartung · August 25, 2020, 1:44pm

What better kind of runtime that would work in such a constrained environment are you thinking of?

There’s “state of the art” and “what can we do with 8K of RAM”. Those are not necessarily congruent.

EdS · August 25, 2020, 2:12pm

Agreed that BBC Basic existed in a much larger playing field: 16k+16k ROM and 16 or 32k of RAM, whereas my UK101 was I think 10k ROM and 4k or 8k of RAM.

It turns out Basic is a broad church indeed, from Tiny Basic to GW-Basic and so on and on. (The present BBC Basic offering uses no line numbers.)

drogon · August 25, 2020, 2:37pm

Did I mention vanity project?

At the time, I felt I’d done enough assembler to last me a lifetime (famous last words), and the target was (initially) Linux. But yes, stuffing code into a REM was no uncommon on some systems …

This happened during a live demo I was giving - one of the people I was showing it to was horrified that the computer changed his code…

Subsequent to this, I added a built-in editor to (make it easier to) write programs without line numbers - that’s just a nano-like text editor, so for the most part it’s (now) not an issue, but there are still a few people doing it the older way with line numbers, so I try to cater for them by storing what they type…

One of the most surprising times for me was when I was doing another demo and one person walked out when I said line numbers are optional.

So you can’t please everyone…

-Gordon

cjs · August 26, 2020, 7:50am

What, so he was perfectly comfortable with the computer changing his code so long as he wasn’t told it was changed?

Yeah, well. It may not be generally true that, as Dijkstra says, being taught BASIC “mutilates the mind beyond recovery” (it arguably didn’t for me) but I too have certainly met people who were so affected by their initial experience with BASIC that it blinded them to any other way of programming.

One guy I met recently claimed that (+ 2 3) was impossible to understand, and nobody would ever get that, but PRINTA$;B$ was perfectly naturally understood as a concatentation of string variables even by someone with no computer programming experience at all. He also claimed that functions were quite unimportant for program organization.

cjs · August 26, 2020, 8:26am

Have a look at any mid-60s LISP interpreter. Up through then, four kilowords of memory was considered a pretty reasonable amount to have.

There are plenty of parts of MS BASIC one could use as an example of poor design, but one that comes to mind in particular, since I happened to have written an MS BASIC detokenizer in the past week, is their tokenization/interpretation split. Tokenization does some of the lexical analysis and some of the parsing, and then the interpreter does the remainder of the lexical analysis and parsing on the tokenized code. It’s split in a very weird and non-inuitive way for anybody with a basic understanding of parsing, and leads to all sorts of interesting problems and bugs. A particularly egregious example I found when testing the behaviour of MSX-BASIC is:

10 a=1234
20 data ",",a"b,"c,d:printa
30 read x$:print x$:goto 30

Before opening up the actual output below, see if you can guess what this program does (always a fun game with BASIC!) before terminating with the expected Out of DATA in 30 error.

(Click here to view output and discussion.)

 1234
,
a"b
c,d:æA

In other words, various parts of the interpreter disagree on which parts of line 20 are data and which parts are code to be executed, and even on how string literals are to be read.

(If you’re wanting to reproduce this, it was run on openMSX emulating a Sony HB-F500P with ROM images from here. That’s a western character set MSX machine with MSX BASIC version 2.0, but I suspect you can reproduce it in other MS BASICs, too.)

If you’re going to store your program in a processed form, it would make a lot more sense to simply do a proper parse of each line and store the AST. (@Kerri_Shotts may want to consider doing this.) Not only would this let you have very compact program text in memory and on disk without having to have source code that looks like this (taken from an actual program):

30A=1:FORJ=0TO199:D=1:READA$:_KLEN(B,A$,0):IFB>0THENFORI=1TOB:LOCATED,13:_KMID(B$,A$,I,1)ELSE60

but it would also make it a lot easier to avoid “the program can’t agree with itself on what the code means” problems like the one I gave as an example above.

jecel · August 26, 2020, 3:05pm

When Apple announced the Swift programming language, one of its main features was that it had “an intuitive syntax”. I was very interested until I saw it was just another C. Just like until the late 1980s every new language was just another Pascal. I think people don’t know the difference between “intuitive” and “familiar”.

cjs · August 26, 2020, 3:34pm

Well, they’re all just another Algol, aren’t they?

I’m willing to give them a pass on “intuitive syntax”; that just means that the people to whom it’s intuitive don’t have to think too hard about it; they have already internalized an understanding of how it works. And that’s probably true of most programmers in the world today when it comes to Swift syntax. (That of course doesn’t make it a good syntax, one in which you can clearly and concisely express what you’re trying to say.)

oldben · August 26, 2020, 7:43pm

Now ‘Call by name’ is a Algol feature, not found in C.
Made sense when subroutines could be called as inline
codeing.

NoLand · August 27, 2020, 1:51pm

Mind that BASIC lines are just a linked list (forward linked only) with line numbers as a unique key. There’s no need to store them in sequence or to maintain a buffer of the entire file for editing.
(In theory, you could store the individual lines in a db table on your host system, using a schema like “line-no/ID, text, next-line-ID” and a trigger to update the next-line links on any insert/delete.)

EdS · August 27, 2020, 2:06pm

Interesting possibility… on the 6502 Basic I’m familiar with, it’s not a linked list, but a concatenation of variable length records. The 6502 likes to have one-byte numbers to add. It could be that in other instruction sets a two-byte address record would be used instead. Seems like it might be worth studying some implementations.

NoLand · August 27, 2020, 3:24pm

E.g., in Commodore BASIC it’s a plain linked list record structure:

2 bytes … memory address of next line (link, 00 00 indicates EOF)
2 bytes … line number (binary, max. line number is decimal 36999)
(…) variable length tokenized BASIC text (max length 80 bytes)
1 byte … binary zero as EOL

Compare: <Now Go Bang!> BASIC (Re)Numbering - with Commodore

P.S.: Since Commodore/MS BASIC owes some to DEC BASIC, it may well be that it’s about the same in the earlier DEC implementations. (However, I hadn’t any reasons, yet, to read up on this.)

cjs · August 27, 2020, 6:18pm

Well, yes, but once you have a linked list, assigning permanent line numbers to lines adds more state, not less.

As for there being no need to store them in sequence, this is true in both cases. Though MS BASIC still stores them in sequence anyway (in exactly the way you showed in another answer), compacting the entire program text heap every time you edit the program so that it can have a separate array variable heap immediately above that. (It’s followed by a scalar variable heap, the stack and yet another heap for string variables.) That’s part of what I’m talking about when I mention that the implementation is not really using great techniques; the five-heap system may have struck them as easy to implement at the time, but it’s probably not, and it’s certainly not as flexible as having a single heap for all of that stuff and just compacting as you need to. There’s nothing like a program failing even though you have plenty of memory available because you chose the wrong value for a CLEAR statement (or didn’t do one at all).

cjs · August 27, 2020, 6:26pm

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as

[ ON, " ", INTERVAL, EQ, 30 ]

as you’d expect, but as

[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

Yes, it is indeed two tokens used for other purposes (the entirely unrelated INT() and VAL() functions) with a couple of ASCII characters dropped in between. And it ends up using three times the space it should (and almost as much as just spelling it out) because both INT and VAL are two-byte tokens.

Not to mention making life interesting when you’re trying to slightly prettify things by adding appropriate spaces to ONINTERVAL=30 on detokenization.

whartung · August 27, 2020, 6:45pm

cjs:

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as
[ ON, " ", INTERVAL, EQ, 30 ]
as you’d expect, but as
[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

Good heavens. What curious design.

That better explains the insanity of that DATA statement thing you had earlier.

NoLand · August 27, 2020, 7:48pm

cjs:

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as
[ ON, " ", INTERVAL, EQ, 30 ]
as you’d expect, but as
[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

OK, so “INT” triggers as a token, so does “E” as in “1E6”, “VAL” seems to be a token already and they didn’t care about introducing “RVAL”. – Or they could have “INTERVAL” in the token list before “INT”, like the longer “INPUT#” comes before “INPUT”. (Which is, BTW, why with Commodore abbreviations “iN” gives you “INPUT#” and not “INPUT” as may be expected.) Seems to be collateral damage of expanding the vocabulary, but wanting to keep the token list compatible at the same time. (And the poor interpreter has now to carry the extra load.)
Does “ON INT E R VAL” also work as a command?