Building a BASIC Interpreter, '80s style

5 Likes

Well that’s good fun! And an emulated machine to run the Basic, running in the browser. Running an invented instruction set, fit for the purpose. And with a kernel called KERNEL, which is refreshing!

Homage to 1980 & 1990 Microcomputers

Retroputer is an emulator for a computer system that resembles machines in the 80’s and 90’s. However, the machine as presented did not exist in this form – rather, the emulator seeks only to emulate the look and feel of these machines, not replicate an actual machine from the era. There are plenty of emulators that do so excellently. Instead, Retroputer has these goals:

  • Be simple to learn
  • Encourage programming
  • Work in a web browser
  • Resemble 8 and 16-bit machines
  • Have a completely understandable architecture from low-level “hardware” to high-level software
1 Like

Hey there - author of Retroputer here! SO COOL to see this pop up here :slight_smile: I’m having a blast build this (although it’s happening verrrry slowly… !)

I’ve posted part two of this series here, where I go into more detail about Retroputer BASIC’s tokenization process, and compare it to how one might accomplish something similar in a high level language (JavaScript) today.

Part 2: https://able.bio/kerrishotts/building-a-basic-interpreter-80s-style-part-2--e91250f1

5 Likes

Unfortunately, it all comes crashing down when you realize that we can’t go about replacing tokens inside of string literals.

Right. Also, remember that typically unquoted non-numeric values in DATA statements are still considered string literals. Here are three DATA items in MS-BASIC (followed by another statement), one of which contains a colon (and MS-BASIC is fine with not having a closing quote):

DATA 001,"002","3:4" : L=3

There’s of course an implication here that you can’t even tokenize numbers in data statements because, as you can see above, if one uses READ S$ to read the first value you need to get back 001, not 1. Of course, you could define the syntax of your BASIC to require quoting of all string vaules in DATA statements, probably avoiding the whole problem but introducing an incompatibility with many other versions of BASIC. (On the other hand, that might also avoid some…interesting things.¹)

…but instead of us having to do the work of checking character by character, we let JS do it instead with match . We know we’ll get a match (we’re in a string)…

So something in previous processing of the input already made sure that all string literals have a closing quote?

Retroputer BASIC (and a few other BASICs, like Atari BASIC) goes one step further: it converts the number into the corresponding binary format.

As do various Microsoft BASICs from the early '80s onward. You can see the details of various formats for one version in Figure 2.12 of the MSX2 Technical Handbook.

Retroputer BASIC also goes another step further: it’ll recognize hexadecimal numbers as well and convert them to their binary equivalent. It uses 0x (just like JS) as the signifier…

MS used an &H prefix for this, which was fairly widespread (at least in Japanese microcomputers), and some versions had &O for octal and &B for binary as well. This was by no means universal, though. National (Panasonic) used a $ prefix in the JR-200, perhaps because that matched the assembly language convention for the 6800 CPU they used.

But variables also start with an alphabetical character. So we can’t immediately tell if what we’re about to crunch is a keyword or a variable. That’s OK—if we don’t find a match in our token list, we can assume it’s a variable instead.

The Array#findIndex method will iterate over the tokens array, and we can test to see if inStr (at the current idx ) starts with the token we’re checking.

This is pretty standard PEG approach, and as usual with that you need to ensure that the alternatives are correctly ordered. E.g., you need to make sure you’re checking for a DEFINT match before you check for a DEF match.

But this runs into problems unless you start requiring spaces in your code to separate lexical tokens. Consider:

IFX=YTHENPRINT"yes"

Your tokenizer has no problem seeing that Y starts a variable name, but where is the end of that variable name?

MS-BASIC kinda half-way bails on this problem at the tokenization stage; it checks at each position for a keyword, consuming it if present, and if that position is not the start of a complete keyword it consumes only that one character, looking again for a keyword at the next character. Yes, that means that LET ION=7 is actually tokenized as LET I ON = 7 (where LET and ON are single-byte tokens) and you later get a syntax error if you try to execute it.

Retroputer BASIC does one more step, which is to tokenize the variables…
But it also comes at a huge cost: we have to store the pointer and the variable name together, and so we tack on an extra four bytes in the output for every variable we see.

Early '80s MS BASIC doesn’t tokenize variable references, but does tokenize line numbers (I am guessing after THEN, GOTO and GOSUB). It uses an interesting trick for this: there are actually two tokens, as you can see from Table 2.12: one for a line number as an unsigned integer in the two bytes following the token, and the other for an address of the line in the two bytes following the token. When you RUN the program, at some point it will replace the first with the second, presumably undoing that change for the entire program text any time the program text is changed. This uses no extra space.


¹ Do not even think about how the following is tokenzed and executed; you will regret it.

10 A=1234
20 DATA ",",a"b,"c,d:printa
30 READ X$:PRINT X$:GOTO 30

I quite like the idea of taking lots of space-time tradeoffs in the direction of better performance: a tokenised Basic program no longer needs to fit in and run in, say, a 4k computer such as my first. With maybe 32k, 64k or in this case maybe even 512k of RAM to play with, lots of preprocessing and lots of pointers in the tokenised program could lead to rather good results.

1 Like

Sure, but while you may feel now that 64 KB is an enormous amount of memory and most programs are never likely to need more than that, things seem to continue to grow, and we may one day reach a stage where 512 KB, or even a megabyte (!) of RAM is considered too small for a typical end-user.

Hee hee, I can’t imagine we’ll ever have more RAM than we have space on our floppy drives…

… But those of us who’ve rigged up a Pi with the PiTubeDirect firmware, to act as a second processor connected to our 8-bit Acorn micros, have now experienced massive amounts of RAM accessible to Basic, running on 32016 and on ARM, as have some lucky few with the corresponding genuine second processors from the mid-80s. I do wonder how many people have run Basic programs which use all that space. I have run the famous 800-line ARM model written in Basic, which just about fits in the 6502 second processor’s 64k of RAM. Its memory use, of course, is mostly for data rather than for the program.

We also know that Acorn rigged up a special 6502 with extended addressing, with 256k RAM, to make it easier to assemble the sources for their 16k ROMs.

The style and design of the MS-BASIC interpreters is focused more on RAM efficiency than runtime efficiency. The whole environment is RAM conscious.

When you eliminate the RAM burden, then “it’s just another language”, and you can make all sorts of assumptions and optimizations.

The primary premise behind line numbers is simply to facilitate very simple editing. No need for a complicated editing tool set (and the memory requirements of such). And the RAM efficient tokenized form that can be rendered back in to text saves a bunch of memory as well.

That’s why it’s curious that the original BASICs on DEC did not do this. They had the line numbers, but the text was stored as text. The text was compiled in to an internal runtime that was not like the tokenize representation that MS chose. So, on DEC BASIC, you had two copies of the code – the source, and the compiled form. (It was not compiled to machine language.)

But then the DEC machine was not necessarily as constrained as the MS micros were. There was no consideration of have to run in 4K or 8K of RAM, and even with the dual representation, you could write “large enough” programs to be useful in the 32K that RSTS provided.

(And, I have written large enough BASIC-PLUS programs to run out of RAM. When we ran out of RAM was started converting the entire thing to Pascal.)

128Kb (64K code/64K data)+ OS(64K) did alot of computing.,
Word processing, spread sheet, a few games, covered most
users, once you got away from Basic. Thinking of DOS here.
I felt BASIC in rom was just a way to sell cheap hardware (but not for the user) until better 16 bit cpu’s came out.
Basic to me means game consoles, not computing.

I’d certainly agree that memory efficiency was more important than runtime efficiency, but of course tokenization did both.

Well, I think that’s a reasonable simplification of what’s going on there, but there are a number of other influences, from the limits imposed by the historical design of BASIC and other factors, that also had people seeing line numbers as the solution. I’m not sure I want to get too deep into that right now, however.

I think I sort of remember the epiphany when I first understood the difference between source code and object. Basic, and other interpreted languages, rather lack the distinction: the in-memory contents is very close to the keystrokes typed in, and can be listed or printed very much as if it is just that. So there was a point when I hadn’t grasped the nature of machine code, or interpreters, still less compilers. I might even have been under the impression, by default, that Fortran was interpreted. Or not even that, because I didn’t quite get that interpretation and execution are different ideas.

It’s not so easy to reconstruct the mental furniture in the mind of a beginner.

One thing about MS Basic and other early microcomputer Basics is that they don’t need a filing system and you can almost get away without storage, when typing in trivial programs. (There are stories of people not understanding the idea of saving to tape.) Whereas, I suspect in the minicomputer world storage and a filing system were always there: having the source form and the ‘compiled’ form as two things would be natural.

I was spoiled by my first real computer (dated from 1982, got it in 1984). The Sord M68 offered several Basics, but the best was called Basic II ; while the source was tokenised to some extent, it had no line numbers, but it had named subroutines, full screen editor, and a real compiler that converted programs first to “relocatable binaries” (similar to .obj files). The output modules were thereafter linked with a runtime to form an executable in native machine language. The ultimate refinement being you had the “RB” files to recreate the runtime module to tailor the memory footprint to your needs by adding or removing “libraries” ; for instance a program not using ISAM files wouldn’t need that module in the runtime consuming RAM.
To say the least, after being exposed to such a futuristic dev environment, I was rather non-plussed by friends C64s and what-have-you toy computers.

3 Likes

Commodore BASIC had a token for “GO” to be used for a compound “GO TO” synonymous to “GOTO” – this is pure luxury! However, support for ISAM files? What’s next? Graphics commands, subroutines,…? :wink:

Honestly, I didn’t know the Sword had such a powerful BASIC.

1 Like

It was. Moreover the graphics commands you mention were a distinct language (called SGL, Sord Graphic Language) with calls from all programming languages available on the machine, by simply printing a string after switching the computer to graphic mode. From Basic II you’d issue a Print statement like (syntax may not be entirely accurate after 35 years):

Print “CIRCLE <X> <Y> <RADIUS> [<BorderColor> <FillColor> <Pattern>]”

1 Like

That depends on when you encountered minicomputers. Many early installations had no storage and you would just read in paper tapes either from a dedicated reader or using a teletype terminal. You could punch another tape for the result.

As time went on it became common to have disks for minicomputers except for embedded applications. Mainframes had gone through a similar evolution years before and microcomputers did it again a decade later. We call this the “Law of computer recapitulation”.

About line numbers, they are the only practical way to edit text using a teletype. With a video terminal they can still save a little memory, but a simple screen editor can be pretty small.

1 Like

Naturally the line numbers weren’t solely used for editing, but you can see that as soon as the editing requirement vanished (i.e. using a separate, dedicated editor), the line numbers pretty much vanished as well. The primary legacy being used for ON ERROR handling, since once line numbers went away, symbolic labels started showing up in their stead (GOSUB POSTINVOICE), but symbolic labels didn’t work well with the legacy ON ERROR RESUME style of error handling. (Notably checking the ERL error line for the source of the error.)

When I was doing BASIC-PLUS on the VAX, the only reason I had line numbers was specifically for error handling, as I could not find a way to simply get status codes from calls, they would fault using the ON ERROR mechanism. Otherwise I would not have used them at all.

And any need for line numbers specifically to facilitate quality of life issues with punch cards (as in “oops, I dropped the deck”) vanished quite early.

If you’re talking about having line numbers in the program text, or even assigning line numbers to specific program text in some other way, no, not at all. Plenty of editors in the 70s, on both minis and micros, let you edit languages without line numbers. Three examples that come to mind would be editing C program text in ed on Unix (often enough done on an actual ASR-33, I’m sure), and assembly language program text in ED.COM on CP/M or EDASM on an Apple II.

I used ed extensively for C and other code in the early '80s, and to find a specific line I used searches far more than I used line numbers.

But it simplifies things a lot and you can get away with minimal state for a session.
(I somewhat recall an editor from the 1980s on IBM VM/CMS using line numbers for any kind of text. So you could edit your C-files using line numbers… In my case, it was SPSS/X.)

The a couple of distinctions of line number based editors, especially simple editors, vs something like ed is that there’s no concept of a current line, or even current place in the file. And there’s no searching. Searching can be particularly problematic on a token based system like BASIC.

Most line editors indeed have “line numbers”, but not in the same sense as a system like BASIC. ed knows what “lines 5-10” are, for example. But line numbers are subtly different in they don’t represent position, per se, rather they represent sequence and order. You can enter line 10, then line 100, then line 500 which are quite different from the 10th line, 100th line, and 500th line.

Maintaining the state of the current line is hardly a large burden on an editor.

On the Atari 800, the assembler MAC-65 was much like BASIC. It used line numbers, its source was tokenized. I don’t recall if it used the line numbers as labels for the assembly.

I don’t really see how it simplifies things or reduces in any significant way the amount of state you need to hold.

From what Microsoft calls TXTTAB (the pointer to the first line of the program text) onwards, you have a fairly huge amount of state that seems to me more or less the same state as a standard line editor would hold, including a “current line” for use when the interpreter is running even if you don’t support the CONTINUE statement.

Having each line labeled with its own number and then using that for editing purposes as well may is a reasonably intuitive way to do things for BASIC, but that is very much about the design of the language. In any of the many languages where you organize things by functions (Logo, Lisp/Scheme, most modern languages), rather than by a massive sequence of lines, you can well end up with more or less the same thing except that functions are labeled by names instead of having lines labeled by numbers.