Grundy’s NewBrain have an interesting feature. Once you run a program in BASIC, it will compile it before it’s run. This is not really to speed it up, but to save a memory (stock computer comes with 32kB of RAM). If the program is re-run unchanged, it will run directly the compiled version already in the memory (which may in some cases actually run faster then the first time). There is also a kind of “garbage collection” - once the system is short on memory, the compiled version of the program is deleted. Although it is not that simple (the system should be in the exact state), compiled version could be re-used.
I find this pretty interesting idea and I am curious if similar approach was used on some other micros too. I am not much familiar with other systems, but it come to me that this would be a reasonable thing to do on a minicomputer maybe?
There’s nothing particularly “educational” about the Newbrain, and a little more speed is always useful.
Most (?) BASICs of the 80s stored their programs in tokenized form, to save memory and gain a little performance. I think that HP had some BASICs that did even more, like storing the memory address of GOTO/GOSUB targets and FOR/NEXT statements. The same thing could be applied to variables and literals (especially floats).
I think the idea of compiling - or semi-compiling - is an interesting one. Yes, almost all Basics will tokenise. Some, but not many, as noted, will store binary forms of decimal constants, or line pointers for lines which are targets of GOTO or GOSUB. One could go further, and store some more efficient form of expressions for evaluation, even in the form of code: a string of calls to the appropriate routines.
So, there are lots of stopping-off points, and it sounds like the Newbrain went further than most. More interesting, I think, is the idea of the “compiled” and the source version co-existing, until memory pressure forbids it.
Is there anything online, perhaps a scan of a book, which explains these internals in detail?
The problem with extending tokenization to actual byte code and internal representations of literals and targets is always about editing.
Maybe, this could be overcome by some kind of reference table, where we still store the input values (like constant literals, variable names, line numbers) and have them ready for editing, listing, etc? Of course, this would have added to the memory footprint, but nowadays this may be a viable concept. (Think of source maps.)
So about 12 years back, or so, I wrote what I considered to be my “ideal” BASIC - I wrote it in C. There, I tokenise everything. Not really to save space (which it does) but to hopefully make it more efficient, faster and so on.
It wasn’t without its problems… Take the following line:
10 LET a = 50.1
LET is an easy one to tokenise, then the variable a - this is also tokenised with a flag to indicate it’s held in the symbol table. = is another token then the number 50.1.
That’s where the issue is - I wanted to pre-convert it to binary, so I do the conversion at line entry time, create a token that’s a symbol table entry and insert the token into the ‘compiled’ text.
Then you type LIST and get:
10 LET a = 50.099999999
Oops. So I ended up storing the textual part of the number as well as the binary part just to make LIST work.
It actually ends up taking more storage for both the tokenised version of the program and the symbol table of variables, (with their names as well as textual values too), etc. but memory is cheap these days…
Thinking of one other Microcomputer “back then” that compiled what you entered, the Jupiter Ace. Unlike all others this was programmed in Forth. The system was such that when it compiled a new word it threw away the original text definition… Because the de-compiler was good enough.
There’s even some more detailed information on the process on pages 165-170.
Apparently, the compilation is run as each line is encountered on execution (as opposed to compiling the entire program at once), more like caching the parse result for reuse. So every line has a line number, a pointer to a source area and a pointer to the object code (which is initially zero). As this is done per line, there isn’t much of a penalty for caching the parsed representation and purging object code on memory pressure should be uncritical. A nice system!
The source representation is actually more faithful to the original input than with most versions of BASIC, as, e.g., “?” and “PRINT” are represented by different tokens while pointing to the same routines. (On the other hand, there seems to be some positional logic involved, as well, as “LET” and “NOT” share the same token.)
That said, my built-in parser in my not-so-new-brain has difficulties parsing the following (from p. 165):
There are two sets of tokens; one set refers to initial keywords, the second set to other keywords.
Is this second set meant to address the two tokens used for “?” and “PRINT” (where one is the original token and the other one is understood as an alias), or is there more to this?
As I understand it, initial keyword is the one right after a line number or right after a colon. That would make sense using their example of the code 128 being both, LET and NOT (NOT as an initial keyword does not make much sense).
Slightly related: one of the 4th yr students in the year before me did his final year project on modifying our main system compiler to embed the source code along with each statement, both for use with a debugger, and for recovery in case the source was lost. I thought it was a great idea but ironically no copies of it have been preserved
Basic how ever could pack n characters per word the older computers. The PDP 8
Basic was good example of packing big things into small spaces. Basic and Focal were
ment to be the programable calculators of that era. Basic was designed for timesharing
where only the data segment needed to swapped out, and I/O was never faster than 110 baud. You could run say 5 users while your “hello world” program printed “hello world”.
Understanding this about the old Basics was a little disorienting. When I was learning the language as a kid, it was described as an interpreted language, but the explanation was bonkers, saying that your code was translated to machine code “line by line,” before it was executed. It was only many years later, when I read up on how to look at memory, and looked where the tokenized code was located, that I realized it was as you describe, and not that different from what’s done with languages that have run on VMs (compiled to bytecode, then interpreted).
It was confusing, because “machine code” was also used to describe programming in hexadecimal with opcodes executed directly by a hardware processor.
I think the distinction that’s interesting to me is how much work is done once, and how much work is done many times, when for example there’s a loop in the program. Tokenisation as found in most 8 bit microcomputer Basics is a bit of a saving in the parsing: once the keywords are tokenised, it’s straightforward to distinguish variable names and constant values from the various meaningful single characters. But the finding of variable storage, or the evaluation of expressions, or of constants, or the finding of line numbers, is still done (in those basics) each time some code is encountered.
There could I think be an intermediate form which is pure byte code, or direct or indirect threaded code, as we find in Forths. There are implementation choices here, as to whether the digested form sits alongside the original source, or after it, or interleaved within it, or replaces it. (As noted, replacing the source is memory efficient but may be lossy.)
There is often the “usual” time vs. effort, or Time to Market so adding pressures like that often results in short-cuts…
Also the extra effort of writing a compiler vs. interpreter - even if that interpreter uses some sort of tokenisation.
My own experiences suggest that if you are well versed in the art of compiler writing then writing a BASIC compile ought to be fairly straightforward - even compiling it on a line by line basis ought to be relatively easy. I’m not a natural compiler writer so when I looked at this, I decided to opt for the “traditional” tokenisation approach, but took it to the limit. I’m fairly sure this is how some FORTRAN compilers worked way back - e.g. on the PDP-8 - One compiler I tried to use produced a 2-part output, one was a form of tokenised code, the other was the symbol table, you then needed a separate run-time tape/library loaded before you ran your code.
Latterly, in my application I decided to try to cache some of the intermediate steps at run-time - so each expression, even though it’s tokenised is parsed at run time by a fairly traditional “Shunting Yard” approach into an RPN stream which is fed-into the RPN evaluator - I decided to cache the shunted RPN - created a token for it and stuffed it into the symbol table - so subsequent passes over that line of code took the already RPNd version. It did improve execution speed for a slight overhead the very first time that line was executed - so big linear functions were a bit slower, but loops, etc. much faster.
Where do you draw the line and make the jump to a proper compiler though?
Going back to the micros of the late 70s/early 80s… It’s amazing a lot of the micros of that day actually made it to market - although there were a lot of good (and probably overworked!) engineers on the case…
In the end, it’s probably about whether we store the evaluated path in some kind of internal representation for later use, or not, regardless of the level or at what time we do it.
Something I was somewhat stunned to see, though, was, on a machine code level, how much of the time is lost not in the parsing process proper, but in how often values residing in registers have to be stored away and retrieved again, as we switch between parsing and execution. This is maybe even more true, the finer the granularity of the clause/instruction level, e.g., with byte code on a processor with just a couple of registers.
(We must keep track of the code we’re inspecting, maybe stored inside a larger word as a vector, we must extract it, evaluate it, then jump to the execution, just to pick up where we left the main parsing loop, again. Each of these steps involves dropping and restoring registers, which easily amounts to more operations than the evaluation (probably just a jump table) and/or the execution itself. The smaller the granular units, the more often we have to switch, the higher the penalty.)
So byte code becomes really a viable concept as the number of registers increases – and I wouldn’t recommend it for, say, a 6502. With a small processor, you either want large atomic units or proper machine code.
There are really no good Compiler Books before the 1980’s, for the simple reason
most computers where too small for compilers to run on them. 32Kb is needed for a
good sized symble table, Most where language
specific like ALGOL Z ported to the HAL 7000 computer. Most books give a great introduction but never seem to cover the little run time details.
I have yet to see
a good example of how displays work yet for ALGOL like laguages. I rather think ,nested
variables are more problems than they are worth.
I use abebooks https://www.abebooks.com/ a used online bookstore for all my computer science books.
I like “BCPL the language and its compiler” , “Assemblers,Compilers, and Program Translation”, “Compiler Design In C”.