Building a BASIC Interpreter, '80s style

NoLand · August 27, 2020, 7:59pm

That’s not even the worst, as the heaps for program text, simple variables and arrays adjoin one another, defining/introducing a simple variable moves the entire array heap, which is a complex linked list of its own, at runtime!

(So, for best performance in MS or Commodore BASIC, first have all subroutines at the beginning of the program – lines are searched from top to bottom –, then define all your simple variables in the order of frequency of use, then define [DIM] all your arrays, and then start your program.)

cjs · August 27, 2020, 8:16pm

Actually, at least in the later MS BASICs they tokenize numbers, so 1.23E-4 will be a token followed by the binary floating point value.

Yeah, PEG parsing is the obvious and easy way to go with things like this; I discovered the idea on my own before I’d ever heard of parsing expression grammars. It’s a great failure of computing science education that then and to this day they were turning out graduates who have no idea of even the basics of parsing, which you’d think (or at least I do) is a core part of computing science.

And that’s how you end up with the utter “tokenization” vs. “interpreter parser” mess that creates far more problems than it solves.

Oh, I forgot to mention that. No, ON INT ERVAL=30, which is what my detokenzer spit out in “expansion” (add spaces) mode, is a syntax error.

NoLand · August 27, 2020, 8:30pm

So PEGs are context free grammars (à la Chomsky), but with internal context provided by the order of selective choices?

BTW, I wrote a few blog posts on tokenizing, memory allocation, variable and string representation in Commodore BASIC (which is just “lean” version of MS BASIC 1.0). The pièce de résistance is an exploit of string pointers for a fast, binary FIFO queue: Now Go Bang! – mass:werk blog

cjs · August 28, 2020, 2:34am

Well, yes, if I’m reading you right, but I’d be careful with the terminology there; the “internal context” is the context of the parser itself (things like, “what’s the remaining input?”), and that’s not related to “context” in the Chomsky grammar classification.

The core thing about PEGs is, as Wikipedia states, “the choice operator selects the first match in PEG.” This makes parsing something like a keyword very easy: just sort your list of keywords by length and then go through them from longest to shortest, comparing each with the start of what you’re currently parsing, and take the first match. PEG parsers are in fact so easy to write that you often don’t even need a framework because writing it out from scratch is easier than finding and downloading a library.

Yes, I read those when they came out and they were wonderful. The excellent coloured hex dumps were a particularly beautiful touch.

NoLand · August 28, 2020, 10:09am

Yes, we tend to forget that path selection is always unbalanced and that this a powerful tool.
In the case of INTERVAL or similar, if you want to augment your vocabulary, but maintain your existing list of tokens (for whatever reasons, like backward compatibility, internal standardization, do not want to change the interpreter too much, etc), a second lookup table does the job and separates concerns. So a first one defines your path of parsing, resulting in an intermediate token, which is then translated into the effective token. In the first table, INTERVAL comes before INT, while this index is then transformed to a token at the very end of a list by the second lookup table. (To all probability, your token list index is already in some index register, so the additional lookup is just a few processor cycles in edit state and saves a lot of complexity in the runtime. Which may be worth the ROM space.)

cjs · August 28, 2020, 12:25pm

I am not totally clear on what you’re saying there, but I strongly suspect that the tokenizer was simply left unchanged and the interpreter parsing the tokenized data just tweaked to match FF 85 45 52 FF 94 before it matched FF 85.

NoLand · August 28, 2020, 5:57pm

To clarify, this was a suggested strategy (to optimize both on parsing paths and order of tokens presented to the interpreter), but not an assessment of what really happened.

drogon · August 28, 2020, 7:20pm

It may have been like that when memory was tight, but it doesn’t have to be… My RTB Basic has an array of pointers to a line structure which contains the line number and a pointer to the tokenised version of that line, and a length word. the line number is not actually used at run-time (there is a pre-run pass to fix-up GOTO and GOSUB)

However, I did once write a simple basic-like thing in Pascal that did use a linked list, but that was a programming exercise while at uni…

-Gordon

cjs · August 29, 2020, 6:39am

Well, the fun continues as I continue working on my MS-BASIC detokenizer. After the whole issue with INTERVAL I had kind of brought my mind around to the view that tokenization was really about compression, not about lexical analysis, until a failing unit test reminded me that = is “crunched” (to use the term used in Microsoft’s source code) as '\xEF', not as '=', though this offers no savings in length. This does mean, however, that you should never see an '=' in BASIC program text (outside of string literals). So it is lexical analysis, except when it isn’t, say in the case of '(' and ')', which are passed through like any other character by the CRUNCH routine.

Agh! My head! It hurts!

NoLand · August 29, 2020, 11:55am

Personally, I regard tokenizing rather as converting to ENUM. Regarding lexical analysis, it’s tricky, where this is really located in classic BASIC, since it’s the runtime/interpreter that throws the error, not the tokenizing/crunch routine. However, if there will be an error thrown, is already decided by the tokenizing routine.

This gave me actually some headache when I played around with my own implementation. I did something like splitting the input into objects, which should move most of the load to the parser while providing a classic listing experience, like

A= 22.00

{
    type: "identifier",
    value: <string>
},
{
    type: "token",
    value: <enum/int>
},
{
    type: "whitespace",  // ignore
    representation: <original-input-string>
},
{
    type: "number",
    value: <number>,
    representation: <original-input-string>
}

Now, you clearly have a full lexical analysis in the parser, but, in order to maintain the experience, you want to throw any errors in the runtime. What to do? Store the error and the remaining text in some kind of blob, to be accessed by the runtime for listing, editing, etc?

(BTW, Sinclair computers did a full analysis on line submit. I guess, there were a few others that did this, as well.)

Regarding dual lookup tables, I’m rather fond of my idea. E.g., if you have “++” and “--” operators, you want to have “++” immediately before “+” in your parsing table, but it may be nice to have “++” and “--” grouped together and separated from binary “+” and “-” in your effective token list. (Also, if you want to check for ranges of certain related commands, like in “if (token >= 0xa2 && token <= 0xb0) …”.)

cjs · August 29, 2020, 6:26pm

That doesn’t seem so to me. The tokenizer will just as happily convert 'FNOR' to b'\xDE\xF7' as it will LEN to b'\xFF\x92'. Whether either has any meaning is entirely up to later analysis.

Uh, stop trying to maintain a bad experience? :-)

Right. There’s no particular reason to have MS-BASIC’s “tokenization” stage at all once it hits the point that the tokens are no longer saving significant space, when most of your actual lexical analysis is happening at a later stage, anyway.

Which is absolutely no problem. In my code I list the tokens as (token, value) pairs in the order I find convenient. But the parser doesn’t use that directly; it uses:

#   Tokens sorted by descending length, so that we can walk through this
#   as a list of alternatives, ensuring that a word will match before
#   its prefix.
DETOKENS = sorted(TOKENS, key=lambda t: len(t[0]), reverse=True)

In fact, at the moment the first match it tries is listed last in my list of token/value pairs.

NoLand · August 29, 2020, 9:03pm

I admit, much of this is to be seen in the light of a formative exposure to Commodore BASIC in my youth. Meaning, all the bad behavior is to be replicated for the “true” experience.
Which is also, why I dropped the project once I had a PET emulator to play with, and started to “commodify” this instead. Because, you can’t be truer to the original bad experience than by what you get with the (almost) real thing…

jecel · September 1, 2020, 12:00am

An alternative to Basic developed in the same time frame was JOSS. It used two part line numbers.

Kerri_Shotts · September 6, 2020, 7:05pm

Wow - did this ever blow up with some amazing comments I’ve been wanting to reply and participate in the back-and-forth, but life has been in the way, in particular, going through the purchase of a house and getting it ready for move-in. That’s not going to go away for at least the next month (+ work stuff).

A few quick thoughts from some of the previous discussion:

My parser definitely doesn’t know what to do with “YTHEN” and the like. It’s definitely a mix right now between being somewhat permissive with spacing, while not matching a lot of BASICs from the era either. I may yet decide to require spacing so that syntax becomes less ambiguous… we’ll see.
I don’t have support for DATA yet (or INPUT, for that matter), so haven’t yet had to figure out how I’m going to deal with it. I suspect I’ll end up stopping further tokenization once DATA is seen, and then reuse some of the work I’ll have to do for INPUT – and essentially stream things in as input. That would let you do READ S and READ S$ and still support quote-less strings in DATA… on the other hand, I may decide that DATA should be a little more regular, and require that strings have quotes and numbers don’t. Would be easier to deal with, but not quite as true to the experience of the time.
Once I’ve got this version of BASIC done, I’d very much like to convert it to an AST, and then also add on a compiler (for which the AST work will be needed anyway). I’ve considered that for such an environment, you could just maintain the raw BASIC text in a separate bank of RAM (since Retroputer has more than enough), and then the environment becomes a bit more like QBasic and the like. But my first experience with BASIC was of the tokenized variety, so that’s where I’ve started.
The PRINT command being able to generate graphics is a really interesting idea. I almost considered it for Retroputer BASIC since I was running out of token space (trying to stay to a single byte token). I think I came to a reasonable use of tokens, but I may keep this in the back pocket. The other thing I do like about it is that it means that the OS/KERNEL is the one doing the work, which makes it that much easier for other programs to use it that aren’t dependent on BASIC.
Multiple lines per statement is definitely a little more challenging than it would have to be otherwise (w/ an AST, for example). I haven’t yet got RETURN working, or FOR…NEXT, but that means I’ll have to record not just the source line location, but also the offset within the token list so that the return address is correctly handled. Single command-per-line would make that bit much easier. Retroputer BASIC will also have some more structured-like commands as well (multi-line IF…THEN…ELSE ,for example) but that calls into question how you mix it with multi-command lines, for example. My current thought is that generally, BASIC will expect various tokens to exist only at the beginning of the line in order to simplify scanning ahead. So IF ... THEN: ELSE won’t be valid, but IF... THEN followed by ELSE on the next line would be. We’ll see… lots to do before then.

I’d love to write some more, but back to the other pressing business at hand… but I’m so happy there’s so much discussion and interest around this kind of topic. I love it!

oldben · September 7, 2020, 6:52pm

Micro-soft was not the only full sized BASIC out there, for the 8080. I think there were two others.
Ben.

cjs · September 9, 2020, 3:20am

Just so we’re clear on terminology here, I’m using “token” and “tokenisation” to mean the MS BASIC CRUNCH routine, not tokenisation as done during typical lexical analysis. When I need to, I’ll refer to the latter as “lexical analysis” and what it produces as “lexical tokens.”

I’d strongly encourage you just to move to a per-line AST now. Having written code for detokenisation and tokenisation myself recently, it’s become clear to me that having two stages of lexical analysis and an intermediate format between them is a recipe for confusion and bugs, particularly as the language gets extended over time. The only potential advantage I can see that it offers over a real AST is preservation of source code formatting in listings if you’re saving programs as an AST rather than text (e.g., it will not turn FOR I=1 TO 10 into FOR I = 1 TO 10, or whatever), and that’s not even really relevant when to save space most programmers format as FORI=1TO10 in MS BASICs anyway.

The per-line AST obviously does not include proper syntax trees for FOR/NEXT and the like, since these may be on different lines. But a BASIC with a full-program AST, rather than per-line AST, is effectively a different language. The post A twisted question about BASIC in the 6502.org forums demonstrates some ill-defined language constructs that would probably have to work differently between the two.

Well, you already must be recording that as you parse and execute a line, so there’s nothing extra to add but a way of saving a copy of this state. I think that MS-BASIC just pushed it on to the stack, at least in 8080 BASIC. (The stack there was placed below the string heap, which in turn was just below HIMEM, the top of BASIC’s RAM.)

DATA arguments are not parsed at all the same as as INPUT arguments. (Take the following with a grain of salt on the exact details, but it’s generally correct.)

The grammar for INPUT is roughly INPUT ⟨"text";⟩? var ⟨,var⟩*: there is only one constant allowed at the start of the arguments, which must be a quoted string followed by a semicolon, and the remainder is always comma-separated variable names.

DATA on the other hand takes a comma-separated list of constants (numeric or string) and (fairly reasonably, to my mind) allows unquoted string constants so long as they don’t have commas in them. I don’t see any issues with doing that; it’s very easy to implement and relatively easy to define if you’re careful about it and don’t just slam together pieces of incompatible implementation and let them make a mess, as we saw MS did earlier in this thread. (Hint: work out your grammar for both DATA and string constants before you implement either.)

So the way that MS BASIC handles this seems to be that the lexical analyzer (which is split between CRUNCH and the interpreter that works on those CRUNCHed lines, remember) has single lexical tokens for keywords etc., and variables are a sequence of letters, terminated by a keyword token, space, or whatever else is not a letter. The lexical analyzer then does a PEG-style match at each parse point looking for keywords before letters.

Obviously it’s nicer to do your lexical analysis all in one step, which is not too hard: just make sure you try for keyword etc. matches at every parse point. I.e., if you’ve just consumed ABC of ABCTHEN as char, char, char, you see you thave THEN in your lookahead and that terminates the collection of the variable name.

Well, now you’re effectively producing a different language, I think. Nothing wrong with that, but once you start down that road, you may want to consider basing it on a language better designed for what you want to do. In particular, a language based around functions, rather than “lines” is not only considerably simpler, but avoids many of these problems. The extended LISP family is probably the most obvious example of that, and was quite popular on early microcomputers in the form of Logo.

oldben · September 9, 2020, 4:13am

Also remeber some computers offen found as a UK door stop, tokenized keys from the keyboard. LET key IF key GOTO key ect.The UK also had FORTH computer, the same way I think. Ben.
(checks the web) Forth not tokenized, clones may be avaible from
the internet;

cjs · September 9, 2020, 8:08am

That would be the Jupiter Ace. I’m only passingly familiar with it, but it sounds like a pretty cool machine.

However, Forth is in a somewhat different category of language because of the lack of automatic memory management (beyond the stack), among other things. It’s a good language, but whether it’s what you’re looking for for this particular situation is another question entirely.

EdS · September 9, 2020, 8:24am

Clearly marked as an opinion, @cjs, so thanks for that, but my opinion would certainly be different. Of course adding some block structure does make a big difference, but I’d still call it Basic. Later versions of BBC Basic took the same excursion, even so far as making line numbering optional. BBC Basic is still actively developed - @Kerri_Shotts you are in good company!

drogon · September 9, 2020, 11:04am

It may be worth mentioning that ANSI Standard BASIC defines a multi-line IF/THEN/ELSE clause.

Sadly you need to spend money to get the ANSI standard, ANSI X3.113-1987 (R1998), but you can get the older standard here: BASIC - EDM2 which the ANSI standard replaced. (See page 59)

My RTB Basic supports multi-line IF/THEN/ELSE as well as the SWITCH/CASE construct (but it gets a little lengthy and internally it’s currently no more efficient that chained IF/THEN/ELSE…

-Gordon