Building a BASIC Interpreter, '80s style

drogon · August 25, 2020, 2:37pm

Did I mention vanity project?

At the time, I felt I’d done enough assembler to last me a lifetime (famous last words), and the target was (initially) Linux. But yes, stuffing code into a REM was no uncommon on some systems …

This happened during a live demo I was giving - one of the people I was showing it to was horrified that the computer changed his code…

Subsequent to this, I added a built-in editor to (make it easier to) write programs without line numbers - that’s just a nano-like text editor, so for the most part it’s (now) not an issue, but there are still a few people doing it the older way with line numbers, so I try to cater for them by storing what they type…

One of the most surprising times for me was when I was doing another demo and one person walked out when I said line numbers are optional.

So you can’t please everyone…

-Gordon

cjs · August 26, 2020, 7:50am

What, so he was perfectly comfortable with the computer changing his code so long as he wasn’t told it was changed?

Yeah, well. It may not be generally true that, as Dijkstra says, being taught BASIC “mutilates the mind beyond recovery” (it arguably didn’t for me) but I too have certainly met people who were so affected by their initial experience with BASIC that it blinded them to any other way of programming.

One guy I met recently claimed that (+ 2 3) was impossible to understand, and nobody would ever get that, but PRINTA$;B$ was perfectly naturally understood as a concatentation of string variables even by someone with no computer programming experience at all. He also claimed that functions were quite unimportant for program organization.

cjs · August 26, 2020, 8:26am

Have a look at any mid-60s LISP interpreter. Up through then, four kilowords of memory was considered a pretty reasonable amount to have.

There are plenty of parts of MS BASIC one could use as an example of poor design, but one that comes to mind in particular, since I happened to have written an MS BASIC detokenizer in the past week, is their tokenization/interpretation split. Tokenization does some of the lexical analysis and some of the parsing, and then the interpreter does the remainder of the lexical analysis and parsing on the tokenized code. It’s split in a very weird and non-inuitive way for anybody with a basic understanding of parsing, and leads to all sorts of interesting problems and bugs. A particularly egregious example I found when testing the behaviour of MSX-BASIC is:

10 a=1234
20 data ",",a"b,"c,d:printa
30 read x$:print x$:goto 30

Before opening up the actual output below, see if you can guess what this program does (always a fun game with BASIC!) before terminating with the expected Out of DATA in 30 error.

(Click here to view output and discussion.)

 1234
,
a"b
c,d:æA

In other words, various parts of the interpreter disagree on which parts of line 20 are data and which parts are code to be executed, and even on how string literals are to be read.

(If you’re wanting to reproduce this, it was run on openMSX emulating a Sony HB-F500P with ROM images from here. That’s a western character set MSX machine with MSX BASIC version 2.0, but I suspect you can reproduce it in other MS BASICs, too.)

If you’re going to store your program in a processed form, it would make a lot more sense to simply do a proper parse of each line and store the AST. (@Kerri_Shotts may want to consider doing this.) Not only would this let you have very compact program text in memory and on disk without having to have source code that looks like this (taken from an actual program):

30A=1:FORJ=0TO199:D=1:READA$:_KLEN(B,A$,0):IFB>0THENFORI=1TOB:LOCATED,13:_KMID(B$,A$,I,1)ELSE60

but it would also make it a lot easier to avoid “the program can’t agree with itself on what the code means” problems like the one I gave as an example above.

jecel · August 26, 2020, 3:05pm

When Apple announced the Swift programming language, one of its main features was that it had “an intuitive syntax”. I was very interested until I saw it was just another C. Just like until the late 1980s every new language was just another Pascal. I think people don’t know the difference between “intuitive” and “familiar”.

cjs · August 26, 2020, 3:34pm

Well, they’re all just another Algol, aren’t they?

I’m willing to give them a pass on “intuitive syntax”; that just means that the people to whom it’s intuitive don’t have to think too hard about it; they have already internalized an understanding of how it works. And that’s probably true of most programmers in the world today when it comes to Swift syntax. (That of course doesn’t make it a good syntax, one in which you can clearly and concisely express what you’re trying to say.)

oldben · August 26, 2020, 7:43pm

Now ‘Call by name’ is a Algol feature, not found in C.
Made sense when subroutines could be called as inline
codeing.

NoLand · August 27, 2020, 1:51pm

Mind that BASIC lines are just a linked list (forward linked only) with line numbers as a unique key. There’s no need to store them in sequence or to maintain a buffer of the entire file for editing.
(In theory, you could store the individual lines in a db table on your host system, using a schema like “line-no/ID, text, next-line-ID” and a trigger to update the next-line links on any insert/delete.)

EdS · August 27, 2020, 2:06pm

Interesting possibility… on the 6502 Basic I’m familiar with, it’s not a linked list, but a concatenation of variable length records. The 6502 likes to have one-byte numbers to add. It could be that in other instruction sets a two-byte address record would be used instead. Seems like it might be worth studying some implementations.

NoLand · August 27, 2020, 3:24pm

E.g., in Commodore BASIC it’s a plain linked list record structure:

2 bytes … memory address of next line (link, 00 00 indicates EOF)
2 bytes … line number (binary, max. line number is decimal 36999)
(…) variable length tokenized BASIC text (max length 80 bytes)
1 byte … binary zero as EOL

Compare: <Now Go Bang!> BASIC (Re)Numbering - with Commodore

P.S.: Since Commodore/MS BASIC owes some to DEC BASIC, it may well be that it’s about the same in the earlier DEC implementations. (However, I hadn’t any reasons, yet, to read up on this.)

cjs · August 27, 2020, 6:18pm

Well, yes, but once you have a linked list, assigning permanent line numbers to lines adds more state, not less.

As for there being no need to store them in sequence, this is true in both cases. Though MS BASIC still stores them in sequence anyway (in exactly the way you showed in another answer), compacting the entire program text heap every time you edit the program so that it can have a separate array variable heap immediately above that. (It’s followed by a scalar variable heap, the stack and yet another heap for string variables.) That’s part of what I’m talking about when I mention that the implementation is not really using great techniques; the five-heap system may have struck them as easy to implement at the time, but it’s probably not, and it’s certainly not as flexible as having a single heap for all of that stuff and just compacting as you need to. There’s nothing like a program failing even though you have plenty of memory available because you chose the wrong value for a CLEAR statement (or didn’t do one at all).

cjs · August 27, 2020, 6:26pm

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as

[ ON, " ", INTERVAL, EQ, 30 ]

as you’d expect, but as

[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

Yes, it is indeed two tokens used for other purposes (the entirely unrelated INT() and VAL() functions) with a couple of ASCII characters dropped in between. And it ends up using three times the space it should (and almost as much as just spelling it out) because both INT and VAL are two-byte tokens.

Not to mention making life interesting when you’re trying to slightly prettify things by adding appropriate spaces to ONINTERVAL=30 on detokenization.

whartung · August 27, 2020, 6:45pm

cjs:

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as
[ ON, " ", INTERVAL, EQ, 30 ]
as you’d expect, but as
[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

Good heavens. What curious design.

That better explains the insanity of that DATA statement thing you had earlier.

NoLand · August 27, 2020, 7:48pm

cjs:

So I am still finding more fun stuff here. It turns out that ON INTERVAL=30 is tokenized in MSX basic not as
[ ON, " ", INTERVAL, EQ, 30 ]
as you’d expect, but as
[ ON, " ", INT, "E", "R", VAL, EQ, 30 ]

OK, so “INT” triggers as a token, so does “E” as in “1E6”, “VAL” seems to be a token already and they didn’t care about introducing “RVAL”. – Or they could have “INTERVAL” in the token list before “INT”, like the longer “INPUT#” comes before “INPUT”. (Which is, BTW, why with Commodore abbreviations “iN” gives you “INPUT#” and not “INPUT” as may be expected.) Seems to be collateral damage of expanding the vocabulary, but wanting to keep the token list compatible at the same time. (And the poor interpreter has now to carry the extra load.)
Does “ON INT E R VAL” also work as a command?

NoLand · August 27, 2020, 7:59pm

That’s not even the worst, as the heaps for program text, simple variables and arrays adjoin one another, defining/introducing a simple variable moves the entire array heap, which is a complex linked list of its own, at runtime!

(So, for best performance in MS or Commodore BASIC, first have all subroutines at the beginning of the program – lines are searched from top to bottom –, then define all your simple variables in the order of frequency of use, then define [DIM] all your arrays, and then start your program.)

cjs · August 27, 2020, 8:16pm

Actually, at least in the later MS BASICs they tokenize numbers, so 1.23E-4 will be a token followed by the binary floating point value.

Yeah, PEG parsing is the obvious and easy way to go with things like this; I discovered the idea on my own before I’d ever heard of parsing expression grammars. It’s a great failure of computing science education that then and to this day they were turning out graduates who have no idea of even the basics of parsing, which you’d think (or at least I do) is a core part of computing science.

And that’s how you end up with the utter “tokenization” vs. “interpreter parser” mess that creates far more problems than it solves.

Oh, I forgot to mention that. No, ON INT ERVAL=30, which is what my detokenzer spit out in “expansion” (add spaces) mode, is a syntax error.

NoLand · August 27, 2020, 8:30pm

So PEGs are context free grammars (à la Chomsky), but with internal context provided by the order of selective choices?

BTW, I wrote a few blog posts on tokenizing, memory allocation, variable and string representation in Commodore BASIC (which is just “lean” version of MS BASIC 1.0). The pièce de résistance is an exploit of string pointers for a fast, binary FIFO queue: Now Go Bang! – mass:werk blog

cjs · August 28, 2020, 2:34am

Well, yes, if I’m reading you right, but I’d be careful with the terminology there; the “internal context” is the context of the parser itself (things like, “what’s the remaining input?”), and that’s not related to “context” in the Chomsky grammar classification.

The core thing about PEGs is, as Wikipedia states, “the choice operator selects the first match in PEG.” This makes parsing something like a keyword very easy: just sort your list of keywords by length and then go through them from longest to shortest, comparing each with the start of what you’re currently parsing, and take the first match. PEG parsers are in fact so easy to write that you often don’t even need a framework because writing it out from scratch is easier than finding and downloading a library.

Yes, I read those when they came out and they were wonderful. The excellent coloured hex dumps were a particularly beautiful touch.

NoLand · August 28, 2020, 10:09am

Yes, we tend to forget that path selection is always unbalanced and that this a powerful tool.
In the case of INTERVAL or similar, if you want to augment your vocabulary, but maintain your existing list of tokens (for whatever reasons, like backward compatibility, internal standardization, do not want to change the interpreter too much, etc), a second lookup table does the job and separates concerns. So a first one defines your path of parsing, resulting in an intermediate token, which is then translated into the effective token. In the first table, INTERVAL comes before INT, while this index is then transformed to a token at the very end of a list by the second lookup table. (To all probability, your token list index is already in some index register, so the additional lookup is just a few processor cycles in edit state and saves a lot of complexity in the runtime. Which may be worth the ROM space.)

cjs · August 28, 2020, 12:25pm

I am not totally clear on what you’re saying there, but I strongly suspect that the tokenizer was simply left unchanged and the interpreter parsing the tokenized data just tweaked to match FF 85 45 52 FF 94 before it matched FF 85.

NoLand · August 28, 2020, 5:57pm

To clarify, this was a suggested strategy (to optimize both on parsing paths and order of tokens presented to the interpreter), but not an assessment of what really happened.