Building a BASIC Interpreter, '80s style

mauve · August 22, 2020, 11:12am

I was spoiled by my first real computer (dated from 1982, got it in 1984). The Sord M68 offered several Basics, but the best was called Basic II ; while the source was tokenised to some extent, it had no line numbers, but it had named subroutines, full screen editor, and a real compiler that converted programs first to “relocatable binaries” (similar to .obj files). The output modules were thereafter linked with a runtime to form an executable in native machine language. The ultimate refinement being you had the “RB” files to recreate the runtime module to tailor the memory footprint to your needs by adding or removing “libraries” ; for instance a program not using ISAM files wouldn’t need that module in the runtime consuming RAM.
To say the least, after being exposed to such a futuristic dev environment, I was rather non-plussed by friends C64s and what-have-you toy computers.

NoLand · August 22, 2020, 1:10pm

Commodore BASIC had a token for “GO” to be used for a compound “GO TO” synonymous to “GOTO” – this is pure luxury! However, support for ISAM files? What’s next? Graphics commands, subroutines,…?

Honestly, I didn’t know the Sword had such a powerful BASIC.

mauve · August 22, 2020, 2:43pm

It was. Moreover the graphics commands you mention were a distinct language (called SGL, Sord Graphic Language) with calls from all programming languages available on the machine, by simply printing a string after switching the computer to graphic mode. From Basic II you’d issue a Print statement like (syntax may not be entirely accurate after 35 years):

Print “CIRCLE <X> <Y> <RADIUS> [<BorderColor> <FillColor> <Pattern>]”

jecel · August 22, 2020, 4:06pm

That depends on when you encountered minicomputers. Many early installations had no storage and you would just read in paper tapes either from a dedicated reader or using a teletype terminal. You could punch another tape for the result.

As time went on it became common to have disks for minicomputers except for embedded applications. Mainframes had gone through a similar evolution years before and microcomputers did it again a decade later. We call this the “Law of computer recapitulation”.

About line numbers, they are the only practical way to edit text using a teletype. With a video terminal they can still save a little memory, but a simple screen editor can be pretty small.

whartung · August 22, 2020, 7:02pm

Naturally the line numbers weren’t solely used for editing, but you can see that as soon as the editing requirement vanished (i.e. using a separate, dedicated editor), the line numbers pretty much vanished as well. The primary legacy being used for ON ERROR handling, since once line numbers went away, symbolic labels started showing up in their stead (GOSUB POSTINVOICE), but symbolic labels didn’t work well with the legacy ON ERROR RESUME style of error handling. (Notably checking the ERL error line for the source of the error.)

When I was doing BASIC-PLUS on the VAX, the only reason I had line numbers was specifically for error handling, as I could not find a way to simply get status codes from calls, they would fault using the ON ERROR mechanism. Otherwise I would not have used them at all.

And any need for line numbers specifically to facilitate quality of life issues with punch cards (as in “oops, I dropped the deck”) vanished quite early.

cjs · August 23, 2020, 9:30am

If you’re talking about having line numbers in the program text, or even assigning line numbers to specific program text in some other way, no, not at all. Plenty of editors in the 70s, on both minis and micros, let you edit languages without line numbers. Three examples that come to mind would be editing C program text in ed on Unix (often enough done on an actual ASR-33, I’m sure), and assembly language program text in ED.COM on CP/M or EDASM on an Apple II.

I used ed extensively for C and other code in the early '80s, and to find a specific line I used searches far more than I used line numbers.

NoLand · August 23, 2020, 11:46am

But it simplifies things a lot and you can get away with minimal state for a session.
(I somewhat recall an editor from the 1980s on IBM VM/CMS using line numbers for any kind of text. So you could edit your C-files using line numbers… In my case, it was SPSS/X.)

whartung · August 23, 2020, 5:40pm

The a couple of distinctions of line number based editors, especially simple editors, vs something like ed is that there’s no concept of a current line, or even current place in the file. And there’s no searching. Searching can be particularly problematic on a token based system like BASIC.

Most line editors indeed have “line numbers”, but not in the same sense as a system like BASIC. ed knows what “lines 5-10” are, for example. But line numbers are subtly different in they don’t represent position, per se, rather they represent sequence and order. You can enter line 10, then line 100, then line 500 which are quite different from the 10th line, 100th line, and 500th line.

Maintaining the state of the current line is hardly a large burden on an editor.

On the Atari 800, the assembler MAC-65 was much like BASIC. It used line numbers, its source was tokenized. I don’t recall if it used the line numbers as labels for the assembly.

cjs · August 24, 2020, 11:16am

I don’t really see how it simplifies things or reduces in any significant way the amount of state you need to hold.

From what Microsoft calls TXTTAB (the pointer to the first line of the program text) onwards, you have a fairly huge amount of state that seems to me more or less the same state as a standard line editor would hold, including a “current line” for use when the interpreter is running even if you don’t support the CONTINUE statement.

Having each line labeled with its own number and then using that for editing purposes as well may is a reasonably intuitive way to do things for BASIC, but that is very much about the design of the language. In any of the many languages where you organize things by functions (Logo, Lisp/Scheme, most modern languages), rather than by a massive sequence of lines, you can well end up with more or less the same thing except that functions are labeled by names instead of having lines labeled by numbers.

drogon · August 24, 2020, 5:34pm

Following this thread with interest - mostly because I wrote my own BASIC about 10 years back, tweaked it a little (a lot!) when the Raspberry Pi came out and it’s still being used by a few people today (there was even a commercial spin-off called FUZE Basic too - am I the only person to have sold a new BASIC in the 00’s ???)

I decided to make a few changes (from a “classic” BASIC) when I was implementing mine - mostly for efficiency and also to make my life easier… It was also (originally) “My” perfect BASIC - a vanity project if you like. I also have the luxury of writing it in C rather than some assembler and designing it to run under a modern OS, I was able to use the operating systems dynamic memory routines rather than try to maintain my own and “garbage collect”. (ie. I use malloc/free)

One change was that I’d not allow multiple statements per line… In the old days we’d cram as many statements per line to make it run faster and to take up less RAM. I didn’t need either those constraints so I dropped that idea. And on the tokenisation front: Everything got tokenised. And I mean everything. Even comments. Numbers were stored in their native binary format. Line numbers are also optional, but are always there, so when you load a line-number-less program it’s loaded starting at line number 1, increment by 1. (There is a comprehensive renumber command).

The LIST command (for those using the traditional interactive line-number interface) then becames a de-compiler. And here is a problem with tokenising and evaluating everything. Consider

10 A = 1e6

and you type LIST and you get

10 A = 1000000.00004

which isn’t what you typed (I’ve made-up the rounding here - you might actually get 1000000 but the important bit is that it’s not 1e6)

So what do you do? Well, RAM is cheap, so I store the textual part of what was typed along with the binary form of the number… Similarly for strings and comments

20 REM This is a comment
30 A$ = "Fred"

that’s replaced with a single token which is an index into the symbol table which contains the textual value of the comment (or string)

I removed any limitation on variable name length. Afterall, it gets replaced by a single token, no matter how long, so:

40 theLoopCounter = 42

is stored as 3 tokens. The first is an index into the symbol table with the variable - that entry contains its name, type and value (and a flag to indicate used or not used). The 2nd token is the = symbol. this is replaced by a token that indicates a function and the last one is a token that represents a constant.

So what I effectively did was to write a one-pass compiler and a virtual machine to execute the resulting stream of fixed-length (32-bit) tokens. There is a run-time fixup which scans the tokenised code for GOTO, GOSUB and calls to functions and procedures to insert the pointers into their entries in the symbol table to point to the line of their target. (Remember putting common functions/subroutines at the start to make them run faster due to a linear search? I didn’t want any of that old nonsense!)

Unlike the traditional BASICs, I never stored the binary form to disk - mostly because as I was developing I was changing the token values on a daily basis and also because the program became 2 parts -the tokenised code and the symbol table and being a lazy programmer, it was just easier to store the textual part. Loading a test 10,000 line program on a 900Mhz Raspberry Pi 1 did not cause any noticeable slowing of the program load and tokenisation as it loaded.

Anyway, I love old BASICs, but a BASIC of the 80’s … That’s (to me) BBC Basic and not MS Basic which is a product of the 70’s. BBC Basic has long variable names (with all characters significant) and a proper integer data type. It was also faster than all other interpreted BASICs of the time (for the same CPU configuration), but it was also a few years after the MS Basics, so they had a lot to learn from.

So, keep going - always nice to see someone elses interpretation and what they do. I think there is still a place for BASIC - especially in the interactive versions, but there is an avalanche of dislike for it these days…

Cheers,

-Gordon

oldben · August 24, 2020, 6:55pm

OS/9 (6809) has a very nice version of basic.
It could compile to a P-code or Native mode.
At the time I had a COCO II with dual 360k floppies,
not great a development system.

jecel · August 24, 2020, 9:46pm

Another language that uses line numbers is APL, though they are per definition and not global numbers.

I am always interested in how small a system can be. VTL-2 (very tiny language) can be considered a “Basic Jr” and took up 3 PROM sockets on the Altair 6800. That is just 768 bytes of code. It uses the APL trick of jumping to a line number calculated by an expression or the next line if the expression is 0.

Tiny BASIC is written in just 120 lines (443 bytes) of a virtual machine, which is then implemented in very little machine language code. That includes all the resources necessary to edit the Basic code.

Forth, on the other hand, had screen numbers (instead of a file system) and a visual editor (not counting very early versions based on punch cards).

oldben · August 24, 2020, 11:35pm

PDP 8 fans have FOCAL with decimal line numbers
if I remember right. Your 2K gets you floating point
as well.

cjs · August 25, 2020, 12:36pm

Well, that sounds like it would cause trouble for those of us who use REM statements as a handy place to store machine-language routines. :-)

I’m not convinced that showing you the number actually being used, rather than the number you typed in that is not what will be used, is such a big problem, actually.

Which is fair enough; that technique was really a product of the environment in which things ran.

I agree with your taxonomy here, but it wasn’t just a matter of learning from contemporary computing science (where the authors of MS BASIC did a pretty dismal job; the state of the art in interpreters was significantly better in 1975, or even 1965, than MS BASIC), but also very much a matter of machine resources. The original MS BASIC could be loaded into 4 KB of RAM (typical for an “expanded” Altair 8800) and still leave room for a small program; by the time BBC BASIC came around they were looking at having 8 KB or more of RAM plus 8-16 KB of ROM, two or three times the space available even for Mirocosft’s 8 KB Altair BASIC.

Some may dump on the subsequent MS programmers not better fixing a whole lot of the, uh, infelicities in the original BASIC code, but I’m inclined to give them a pass on that. We also have to remember that at the time people weren’t generally using extensive automated test frameworks that provided safety when doing major code redesigns. And even if they were, it was no unusual for clients to depend on what were pretty inarguably bugs in implementations.

whartung · August 25, 2020, 1:44pm

What better kind of runtime that would work in such a constrained environment are you thinking of?

There’s “state of the art” and “what can we do with 8K of RAM”. Those are not necessarily congruent.

EdS · August 25, 2020, 2:12pm

Agreed that BBC Basic existed in a much larger playing field: 16k+16k ROM and 16 or 32k of RAM, whereas my UK101 was I think 10k ROM and 4k or 8k of RAM.

It turns out Basic is a broad church indeed, from Tiny Basic to GW-Basic and so on and on. (The present BBC Basic offering uses no line numbers.)

drogon · August 25, 2020, 2:37pm

Did I mention vanity project?

At the time, I felt I’d done enough assembler to last me a lifetime (famous last words), and the target was (initially) Linux. But yes, stuffing code into a REM was no uncommon on some systems …

This happened during a live demo I was giving - one of the people I was showing it to was horrified that the computer changed his code…

Subsequent to this, I added a built-in editor to (make it easier to) write programs without line numbers - that’s just a nano-like text editor, so for the most part it’s (now) not an issue, but there are still a few people doing it the older way with line numbers, so I try to cater for them by storing what they type…

One of the most surprising times for me was when I was doing another demo and one person walked out when I said line numbers are optional.

So you can’t please everyone…

-Gordon

cjs · August 26, 2020, 7:50am

What, so he was perfectly comfortable with the computer changing his code so long as he wasn’t told it was changed?

Yeah, well. It may not be generally true that, as Dijkstra says, being taught BASIC “mutilates the mind beyond recovery” (it arguably didn’t for me) but I too have certainly met people who were so affected by their initial experience with BASIC that it blinded them to any other way of programming.

One guy I met recently claimed that (+ 2 3) was impossible to understand, and nobody would ever get that, but PRINTA$;B$ was perfectly naturally understood as a concatentation of string variables even by someone with no computer programming experience at all. He also claimed that functions were quite unimportant for program organization.

cjs · August 26, 2020, 8:26am

Have a look at any mid-60s LISP interpreter. Up through then, four kilowords of memory was considered a pretty reasonable amount to have.

There are plenty of parts of MS BASIC one could use as an example of poor design, but one that comes to mind in particular, since I happened to have written an MS BASIC detokenizer in the past week, is their tokenization/interpretation split. Tokenization does some of the lexical analysis and some of the parsing, and then the interpreter does the remainder of the lexical analysis and parsing on the tokenized code. It’s split in a very weird and non-inuitive way for anybody with a basic understanding of parsing, and leads to all sorts of interesting problems and bugs. A particularly egregious example I found when testing the behaviour of MSX-BASIC is:

10 a=1234
20 data ",",a"b,"c,d:printa
30 read x$:print x$:goto 30

Before opening up the actual output below, see if you can guess what this program does (always a fun game with BASIC!) before terminating with the expected Out of DATA in 30 error.

(Click here to view output and discussion.)

 1234
,
a"b
c,d:æA

In other words, various parts of the interpreter disagree on which parts of line 20 are data and which parts are code to be executed, and even on how string literals are to be read.

(If you’re wanting to reproduce this, it was run on openMSX emulating a Sony HB-F500P with ROM images from here. That’s a western character set MSX machine with MSX BASIC version 2.0, but I suspect you can reproduce it in other MS BASICs, too.)

If you’re going to store your program in a processed form, it would make a lot more sense to simply do a proper parse of each line and store the AST. (@Kerri_Shotts may want to consider doing this.) Not only would this let you have very compact program text in memory and on disk without having to have source code that looks like this (taken from an actual program):

30A=1:FORJ=0TO199:D=1:READA$:_KLEN(B,A$,0):IFB>0THENFORI=1TOB:LOCATED,13:_KMID(B$,A$,I,1)ELSE60

but it would also make it a lot easier to avoid “the program can’t agree with itself on what the code means” problems like the one I gave as an example above.

jecel · August 26, 2020, 3:05pm

When Apple announced the Swift programming language, one of its main features was that it had “an intuitive syntax”. I was very interested until I saw it was just another C. Just like until the late 1980s every new language was just another Pascal. I think people don’t know the difference between “intuitive” and “familiar”.