Microcomputers with built-in compilers?

ellvis · May 3, 2023, 10:31pm

Grundy’s NewBrain have an interesting feature. Once you run a program in BASIC, it will compile it before it’s run. This is not really to speed it up, but to save a memory (stock computer comes with 32kB of RAM). If the program is re-run unchanged, it will run directly the compiled version already in the memory (which may in some cases actually run faster then the first time). There is also a kind of “garbage collection” - once the system is short on memory, the compiled version of the program is deleted. Although it is not that simple (the system should be in the exact state), compiled version could be re-used.

I find this pretty interesting idea and I am curious if similar approach was used on some other micros too. I am not much familiar with other systems, but it come to me that this would be a reasonable thing to do on a minicomputer maybe?

I am not attaching any tech details in this post, but I can provide those if there is an interest. As this is quite unknown system, the Wikipedia page is here: https://en.wikipedia.org/wiki/Grundy_NewBrain and software and documentation is here: https://newbrainemu.eu/ (note that although it is free, you have to log in for downloads).

mainframetom · May 4, 2023, 4:58pm

This is an educational computer.
I think most users don’t need and don’t want that, especially not with BASIC.
Who needs a compiler can buy one.

rwiker · May 4, 2023, 5:21pm

There’s nothing particularly “educational” about the Newbrain, and a little more speed is always useful.

Most (?) BASICs of the 80s stored their programs in tokenized form, to save memory and gain a little performance. I think that HP had some BASICs that did even more, like storing the memory address of GOTO/GOSUB targets and FOR/NEXT statements. The same thing could be applied to variables and literals (especially floats).

EdS · May 4, 2023, 5:28pm

I think the idea of compiling - or semi-compiling - is an interesting one. Yes, almost all Basics will tokenise. Some, but not many, as noted, will store binary forms of decimal constants, or line pointers for lines which are targets of GOTO or GOSUB. One could go further, and store some more efficient form of expressions for evaluation, even in the form of code: a string of calls to the appropriate routines.

So, there are lots of stopping-off points, and it sounds like the Newbrain went further than most. More interesting, I think, is the idea of the “compiled” and the source version co-existing, until memory pressure forbids it.

Is there anything online, perhaps a scan of a book, which explains these internals in detail?

NoLand · May 4, 2023, 5:32pm

The problem with extending tokenization to actual byte code and internal representations of literals and targets is always about editing.
Maybe, this could be overcome by some kind of reference table, where we still store the input values (like constant literals, variable names, line numbers) and have them ready for editing, listing, etc? Of course, this would have added to the memory footprint, but nowadays this may be a viable concept. (Think of source maps.)

P.S.: The idea of coexisting compiled versions and purges on memory pressure is pretty much alive, e.g., with JIT code in JavaScript.

drogon · May 4, 2023, 6:04pm

Try telling that to those who bought a BBC Micro (Or Electron) - with BASIC and supported by a national TV program and educational material for schools…

Lets not forget both Microsoft and Apple both started with BASIC too…

As for the Newbrain? It could have been the BBC Micro… Didn’t quite make it though, but did seem some traction in the home market.

-Gordon

drogon · May 4, 2023, 6:12pm

So about 12 years back, or so, I wrote what I considered to be my “ideal” BASIC - I wrote it in C. There, I tokenise everything. Not really to save space (which it does) but to hopefully make it more efficient, faster and so on.

It wasn’t without its problems… Take the following line:

10 LET a = 50.1

LET is an easy one to tokenise, then the variable a - this is also tokenised with a flag to indicate it’s held in the symbol table. = is another token then the number 50.1.

That’s where the issue is - I wanted to pre-convert it to binary, so I do the conversion at line entry time, create a token that’s a symbol table entry and insert the token into the ‘compiled’ text.

Then you type LIST and get:

10 LET a = 50.099999999

Oops. So I ended up storing the textual part of the number as well as the binary part just to make LIST work.

It actually ends up taking more storage for both the tokenised version of the program and the symbol table of variables, (with their names as well as textual values too), etc. but memory is cheap these days…

-Gordon

drogon · May 4, 2023, 6:15pm

Thinking of one other Microcomputer “back then” that compiled what you entered, the Jupiter Ace. Unlike all others this was programmed in Forth. The system was such that when it compiled a new word it threw away the original text definition… Because the de-compiler was good enough.

-Gordon

ellvis · May 4, 2023, 6:19pm

Here is the Software Technical Manual: https://archive.org/details/grundy-newbrain-software-technical-manual/page/158/mode/2up Page 159 is briefly introducing BASIC and mention the compilation. I will go through some magazines I own and will come back here with some more details about the compiler.

Also, the computer was originally designed with just 2kB or RAM (the project started back in 1978). At the end it shipped with 32kB (and is directly expendable up to 2MB).

NoLand · May 4, 2023, 7:07pm

There’s even some more detailed information on the process on pages 165-170.

Apparently, the compilation is run as each line is encountered on execution (as opposed to compiling the entire program at once), more like caching the parse result for reuse. So every line has a line number, a pointer to a source area and a pointer to the object code (which is initially zero). As this is done per line, there isn’t much of a penalty for caching the parsed representation and purging object code on memory pressure should be uncritical. A nice system!

The source representation is actually more faithful to the original input than with most versions of BASIC, as, e.g., “?” and “PRINT” are represented by different tokens while pointing to the same routines. (On the other hand, there seems to be some positional logic involved, as well, as “LET” and “NOT” share the same token.)

That said, my built-in parser in my not-so-new-brain has difficulties parsing the following (from p. 165):

There are two sets of tokens; one set refers to initial keywords, the second set to other keywords.

Is this second set meant to address the two tokens used for “?” and “PRINT” (where one is the original token and the other one is understood as an alias), or is there more to this?

ellvis · May 4, 2023, 7:27pm

As I understand it, initial keyword is the one right after a line number or right after a colon. That would make sense using their example of the code 128 being both, LET and NOT (NOT as an initial keyword does not make much sense).

gtoal · May 4, 2023, 8:05pm

Slightly related: one of the 4th yr students in the year before me did his final year project on modifying our main system compiler to embed the source code along with each statement, both for use with a debugger, and for recovery in case the source was lost. I thought it was a great idea but ironically no copies of it have been preserved

EdS · May 4, 2023, 9:06pm

Thanks! Somehow I just remembered another example - the Basic of the ABC80. Perhaps see this thread
Luxor ABC series of Swedish computers

Nice story!

It’s an odd thing, that 32k is really a lot of RAM, in a context where machines might have 4k or so. But then along came 48k and 64k machines and it doesn’t seem so much.

oldben · May 4, 2023, 11:55pm

Basic how ever could pack n characters per word the older computers. The PDP 8
Basic was good example of packing big things into small spaces. Basic and Focal were
ment to be the programable calculators of that era. Basic was designed for timesharing
where only the data segment needed to swapped out, and I/O was never faster than 110 baud. You could run say 5 users while your “hello world” program printed “hello world”.

mmiler · May 5, 2023, 2:15am

Understanding this about the old Basics was a little disorienting. When I was learning the language as a kid, it was described as an interpreted language, but the explanation was bonkers, saying that your code was translated to machine code “line by line,” before it was executed. It was only many years later, when I read up on how to look at memory, and looked where the tokenized code was located, that I realized it was as you describe, and not that different from what’s done with languages that have run on VMs (compiled to bytecode, then interpreted).

It was confusing, because “machine code” was also used to describe programming in hexadecimal with opcodes executed directly by a hardware processor.

EdS · May 5, 2023, 8:29am

I think the distinction that’s interesting to me is how much work is done once, and how much work is done many times, when for example there’s a loop in the program. Tokenisation as found in most 8 bit microcomputer Basics is a bit of a saving in the parsing: once the keywords are tokenised, it’s straightforward to distinguish variable names and constant values from the various meaningful single characters. But the finding of variable storage, or the evaluation of expressions, or of constants, or the finding of line numbers, is still done (in those basics) each time some code is encountered.

There could I think be an intermediate form which is pure byte code, or direct or indirect threaded code, as we find in Forths. There are implementation choices here, as to whether the digested form sits alongside the original source, or after it, or interleaved within it, or replaces it. (As noted, replacing the source is memory efficient but may be lossy.)

drogon · May 5, 2023, 9:46am

There is often the “usual” time vs. effort, or Time to Market so adding pressures like that often results in short-cuts…

Also the extra effort of writing a compiler vs. interpreter - even if that interpreter uses some sort of tokenisation.

My own experiences suggest that if you are well versed in the art of compiler writing then writing a BASIC compile ought to be fairly straightforward - even compiling it on a line by line basis ought to be relatively easy. I’m not a natural compiler writer so when I looked at this, I decided to opt for the “traditional” tokenisation approach, but took it to the limit. I’m fairly sure this is how some FORTRAN compilers worked way back - e.g. on the PDP-8 - One compiler I tried to use produced a 2-part output, one was a form of tokenised code, the other was the symbol table, you then needed a separate run-time tape/library loaded before you ran your code.

Latterly, in my application I decided to try to cache some of the intermediate steps at run-time - so each expression, even though it’s tokenised is parsed at run time by a fairly traditional “Shunting Yard” approach into an RPN stream which is fed-into the RPN evaluator - I decided to cache the shunted RPN - created a token for it and stuffed it into the symbol table - so subsequent passes over that line of code took the already RPNd version. It did improve execution speed for a slight overhead the very first time that line was executed - so big linear functions were a bit slower, but loops, etc. much faster.

Where do you draw the line and make the jump to a proper compiler though?

Going back to the micros of the late 70s/early 80s… It’s amazing a lot of the micros of that day actually made it to market - although there were a lot of good (and probably overworked!) engineers on the case…

-Gordon

EzikeChristopher · May 5, 2023, 2:46pm

Could suggest a number of good compiler writing book used during your time, 60,70,80 or you know about

NoLand · May 5, 2023, 3:39pm

In the end, it’s probably about whether we store the evaluated path in some kind of internal representation for later use, or not, regardless of the level or at what time we do it.
Something I was somewhat stunned to see, though, was, on a machine code level, how much of the time is lost not in the parsing process proper, but in how often values residing in registers have to be stored away and retrieved again, as we switch between parsing and execution. This is maybe even more true, the finer the granularity of the clause/instruction level, e.g., with byte code on a processor with just a couple of registers.
(We must keep track of the code we’re inspecting, maybe stored inside a larger word as a vector, we must extract it, evaluate it, then jump to the execution, just to pick up where we left the main parsing loop, again. Each of these steps involves dropping and restoring registers, which easily amounts to more operations than the evaluation (probably just a jump table) and/or the execution itself. The smaller the granular units, the more often we have to switch, the higher the penalty.)
So byte code becomes really a viable concept as the number of registers increases – and I wouldn’t recommend it for, say, a 6502. With a small processor, you either want large atomic units or proper machine code.

oldben · May 5, 2023, 6:17pm

There are really no good Compiler Books before the 1980’s, for the simple reason
most computers where too small for compilers to run on them. 32Kb is needed for a
good sized symble table, Most where language
specific like ALGOL Z ported to the HAL 7000 computer. Most books give a great introduction but never seem to cover the little run time details.
I have yet to see
a good example of how displays work yet for ALGOL like laguages. I rather think ,nested
variables are more problems than they are worth.
I use abebooks https://www.abebooks.com/ a used online bookstore for all my computer science books.
I like “BCPL the language and its compiler” , “Assemblers,Compilers, and Program Translation”, “Compiler Design In C”.