BCPL - an even simpler compiler?

Rising to Ed’s challenger over here: B -- A Simple Interpreter Compiler - #2 by EdS

I thought I’d say something about BCPL …

It’s recorded that BCPL is the fore-runner to C, but the recent B thread seems to suggest that there was an intermediate step; B - probably initially written in BCPL, then in B itself. Whatever - The resurrected B is so close to BCPL as to make little difference, so who knows.

BCPL (Basic combined Programming Language) emerged in Cambridge in the mid 60’s but the first implementation was at MIT by Martin Richards the languages creator. Then (and now), it compiles into an intermediate code which is relatively efficient to interpret on different systems, although in recent years there is another intermediate code it can compile to which is easier to translate to native machine codes.

Think of BCPL as C without types. There is just one type: The Word. You can declare words, or vectors (arrays) of words. and you can word or byte address these vectors. initially a 16-bit system, today it’s a 32 (and 64) bit system, but still more or less compatible with the old 16-bit systems.

My experience is with the BBC Micro in the early 80’s - I wanted something a bit better than Basic (even though BBC Basic is the best old school 8-bit basic there is - I’d been doing a lot of C by then), so BCPL it was. It’s also about 3 times faster than Basic and compiles to a denser code than basic, so the programs could be a little bigger. My project was a flexible manufacturing system with independent/autonomous stations controlled by a BBC Micro using a network (Econet) for communications.

BCPL is still in-use today (a lot of Fords manufacturing systems in Europe is written in it apparently) and Martin Richards got the whole thing running on the Raspberry Pi a few years back in an effort to kindle interest. (I think the Python brigade won that battle though)

The compiler is of-course written in BCPL and takes a second or 3 to compile itself on my desktop i3 Linux system (Yes, really, 2-3 seconds). compare that to gcc, etc. …

Personally I feel it’s an ideal higher level language than Basic for those who want a bit more out of their 8-bit micro - it’s also very capable of being self-hosting. I can compile, edit and debug BCPL programs directly on a BBC Micro, or equivalent.

It was also used as the early OS on the Amiga - in the form of Tripos (Another Cambridge university project) which is a single user, cooperatively multitasking operating system

Current BCPL systems use 32 or 64 bit words - which poses a small problem with floating point… It was never really designed for floating point use - the BBC Micro implementation did come with the “calculations package” which used small vectors of 6 bytes to hold a FP number and library calls to do things like floating point add, multiply and so on, however the current versions do support floating point - but only as big as the underlying word, so if you want double precision, 64-bit numbers, then you need to run your code on a 64-bit (Linux) system…

Ever wonder where the byte and word indirection operators come from in BBC Basic? (rather than peek or poke) Well, that’s the product of the computing science and BCPL taught in Cambridge university as they’re right out of BCPL…

Hello world in BCPL:

get "libhdr"
let start() be
  writes ("Hello, world*n")

Personally I think BCPL is an ideal language for lesser able systems - retro 8 (and maybe 16) bit computers. The compile can easily be self-hosting on a modest 32KB BBC Micro, the code efficient and compact, but on todays higher speed systems? C…



BCPL was, and still is, mostly unknown in the US. Outside it’s use in Multics it was hardly used. I believe Richards brought it to the Multics project directly, which may explain its use there. Apparently Ken Thompson liked it enough to copy it for B. To paraphrase Dennis Richie, B was BCPL stripped down to bare essentials and filtered through Thompson’s brain.
BCPL became a bit more known and got something of a boost in the US when the Amiga came out. But it quickly faded away again. It’s too bad. BCPL is a rather nice language from what I’ve seen.

1 Like

That is indeed what Ritchie said, but Thompson himself told a different story at VCF East this year, surprising us all. It’s here:

1 Like

Note that though BCPL was ported to Multics and used for some applications at Bell Labs, Multics itself was implemented in PL/1 which was also used for most of its applications.

CPL (known initially officially as “Cambridge Programming Language” and to Christopher Strachey’s students as “Christopher’s Programming Language”, later renamed to “Combined Programming Language” when Manchester joined the project) was a paper design that was never implemented. When one of Christopher’s students implemented a practical subset it was naturally called Basic CPL.


That’s interesting! Thanks for the details.

And here is a bunch of info, including several compilers and compiler kits.


That’s a good page of links there. I’ve always stuck with the Martin Richards stuff directly off his Cambridge uni pages. I regard that as the canonical sources for my own work.

One thing that is missing though is the sources for the BCPL implementation on the BBC Micro that was produced in the early 80’s. I have spoken to Martin about this, but it appears to all have been lost. I was particularly interested in the assembler to see if I could upgrade it to handle the 65816, but I’ve started to write one from scratch to fill that gap.



Here is a link to Martin Richards’ page:


Back then you had a small machine like a PDP 8 or big timesharing
computer like a IBM 360 or a PDP 10. Two totaly different designs.
(I just had TV shows like Star Trek back then). 12 bit words are Too
small so lets use 32/36 bits. A nice programing model but not very
compact. TAD Z FOO on the 8 is now LD R3 # FOO;ADD R4 R3 )0.
Byte addressing was the big feature in the early 70’s that lead to the PDP 11 and the 8 bit microprocessors. C and Pascal came along,
but 8 bit computers could just fit p-code in a 48k address space.
The 8088/8086 needed a complex compiler and the 11’s all turned
into VAX’s with complex compilers. So a simpler compiler realy depends on the hardware model. Stack based compilers like Forth
are really simple and self hosting.
I am working on a TTL (FPGA protypeing) CPU from the mid 1970’s
with a 9/18 bit word length. Right now I am writing a primitive compiler
in C to cross compiler to the hardware and be self compiling.
The goal is fit in 24Kb of memory with 8Kb for a primitive OS.
Version #2 will have structures and be able do real computing.
C has proven that stuctures are needed to write more flexable
programs than just having indexed array of data. I see most
lanuages are too general and cannot be simplefied to fit the hardware
better. At the same time hardware may not be advancing to make
software better.
Trial and error has given a this model.
Single AC , Two index registers stack, and PC. This is ample
for small systems with floating point subroutines (not written).
Byte and word adressing, with conditional branches and logic
operations like set ac to 1 if ac is less than 0.
9 bit bytes are needed with immedate and 9 bit indexing Z X Y S.
18 bits give a ample address space, as I/O is tty and programs
are modest.
10/20 bits would be needed to handle byte/word/half words
and be about 50% increase in PROM address space. floating
point and mult/divide as subroutines.


Looked at BCPL again, I have now a simple 32 bit cpu,
but looks like BCPL has grown, all singing, all dancing
all written in C. I just have 512 KB total with this guy
and I suspect that will not be ample to even boot BCPL
with a BCPL O/S. Am I looking in the right place?
Any BCPL books are expensive and costly to ship to Canada. Ben.

1 Like

Let me congratulate you in being the 4th BCPL programmer left in the world :wink:

I have a retro 65816 system that runs BCPL in 512KB. Other than overall speed there is more than enough RAM to run my minimal command-line OS and the compiler. I can compile the compiler but it takes a long time. Compiling “Hello World” is about 20 seconds though and most of that time is the compiler loading from the filing system.

I decided to create a new Cintcode VM from scratch in 65816 assembler. The '816 is a 16-bit CPU limited by a 24-bit address and 8-bit data bus and some other coding challenges. It’s not going to be fast, but it’s fast enough. So-far I’ve written a simple CLI and nano-like editor (which I ported from a C version I’ve been writing/using for some years now) as well as some utilities - I’m finding it hard to escape the Unix world of ls, rm, mv and so on though. Currently working on multi-processing so I can have pipes.

The other (non-Linux) BCPL I’m aware off can be found over on the anycpu site and I think that’s just a 16-bit system and he has the compiler produce the SIAL intermetiace code rather than Cintcode which he then translates into the native code for his CPU. I stuck with Cintcode as I felt it was more flexible for me. It’s (arguably) slower but the code density is very high - a typical 2 byte Cintcode instruction may need 20 or more bytes of 65816 code to execute.

I started by getting the BCPL system going on my (Linux) desktop and using that to initially compile and edit programs. I use the ‘bin’ flag to the compiler to produce more compact object files which load faster on my system. I do have the luxury of having a separate filing system CPU (ATmega with SD card) so the '816 side just has an open/read/write/close Pozix style API to it but if I ever moved to alternative/better hardware I’d be looking to re-write it all in BCPL. (My filing system has a similar structure to Apple ProDOS with a 32MB volume limit for now)

There are some details here: The 6502 | Gordons Projects and a video or 2 here:

Hope that gives you some encouragement!

I may be interested in your CPU - I have a few frustrations with my system right now, mainly for speed (I estimate it’s running at the equivalent of a 200-500 Khz 32-bit system - which also means that the compiled code is very efficient). However picking a suitably Retro system is tricky - it’s 68K or - well, not a lot else back in the late 70’s or early 80’s (VAX, Interdata?) so who knows.




The cpu is rather crude, living in a DE1 FPGA development
card, so it really does nothing fancy or fast. Covid stopped
development on real hardware, front panel, logic and alu cards. This project started out as 18 bit cpu in TTL but grew
to a 20 bit cpu to have all features I wanted. BCD math
(excess 3) and memory about 512KB. Memory cycle
is 1.8432 Mhz with MOS memory. (Fast/PAL is 3.2Mhz).

A similar design
with 2 16 bit alu cards, is now a 8/16/32 bit cpu.
The design goals mid 1970’s time frame and memory sizes and speed. The prime number sive from Byte is about 10 seconds. 1975 to 1990 looks to be a valid
ballpark for this Instruction set layout.
Both designs are still in a state flux,and don’t support
stack operations, just simple indexing.

1 Like

FWIW: My Cintcode interpreter written in 65816 assembler doesn’t use the hardware stack at all. (Although it’s true there are 2 instances of JSR/RTS, these could be expanded with macros if required).

My “Byte Sieve” benchmark written in BCPL takes about 13 seconds which is OK given the limitations I’m working with

The thing (one thing) that lets the '816 down is the inability to fetch (or store) a single byte efficiently. My dispatcher works by fetching the byte-code, multiplying by 2 and using that as an index into a 16-bit wide jump table. I have to fetch the byte as a 16-bit value, mask the top 8 bits, multiply by 2 then do the indexed jump. If I put the CPU into 8-bit mode, then I save a cycle on the fetch but it takes 2 cycles to get into 8-bit mode and 2 to get back to 16-bit mode. I still have to do the mask as I need to double the byte fetched and to do that I need to be in 16-bit mode and to do that I need to make sure the top byte of the 16-bit accumulator is zero before the shift instruction. So an efficient way to fetch an 8-bit value into a register, zeroing the top byte(s) would be a boon… My dispatcher code takes 29 cycles (at best) which in a 16Mhz system is almost 2µS, so the maximum effective instruction rate is never going to be more than 500Khz.

This sequence would be eliminated if I chose to compile into native code, but while faster, the code would be considerably larger and then the issues concerning code execution over different 64K banks in the 65816 rears its ugly head.

The segmented/banked nature of the 65816 is also a hindrance here too - I wanted my BCPL programs to be transparent to this, so I do a lot in the Cintcode VM to make that issue go away. I can allocate an array of 200KB in a BCPL and iterate over it as I might do in any other system.

So once I’ve fully explored the 65816 solution, I feel that anything newer will be better, but the question is where to go - the 68K is a 32 bit CPU that existed at that time (My timeframe is sort of up to pre-ARM, so mid 80’s) but there really weren’t any other (32-bit) CPUs at that time that might also be hobby friendly. I did use an Interdata 32-bit “mini” in '78/'79 and VAX was also a thing then, as was Prime but I have horrible memories of the Prime, so discounting that! The INS 32016 is a bit of a no-starter too (not to mention availability) So the option is looking at some other “core” that might be in the spirit of that era (Sweet32 is one I found), or perhaps the Inmos Transputer which I also used in the mid-late 80’s and into the 90’s… Possibly too many choice, possibly veering a little OT here too!



1 Like

Perhaps modifiing Cintcode from a byte format to a short int format with the upper byte zero, and having data packed to
even locations. 0 for short, x for char. Code size will grow
but you save now that you have less bit fiddling. Not sure
how it will effect the compiler.
How can you say the prime was BAD. It had Dr Who promoting it. :slight_smile:
The hardest part for me in designing a retro computer
is remembering not to save on memory and pack everything
into 64Kb and 16 bit ints.
My computer design has two data buses
n bits data in, n bits data out + some control logic.
Memory cards latch mar and data out. Timing is simple
but has 3 phase clocks. Addressing mode is indexing
or registers. Thus building a N wide cpu just means making the alu/memory cards wider. 32 bits is just
two standard 16 bit alu cards (high and low).
3 44 ? pin card edge connectors would make up the buss.
Just a bit more work than 16 bit cpu with standard busing.
I can get 32 bits for the same cost as 16 bits, Timing is
based on 250 ns 4 K Mostek DRams.
My biggest regret is only having 20 switches for the Front panel, making panel access 16 bits rather than 20 bits for addressing.
Byte packing and 16 bit CPU’s seem ‘Penny Wise and Pound Foolish’ to me if you need data space/size larger than 8 bit micros or general purpose computing,

In hindsight after 1976 the 8 bitters were the easy construction projects because they used little ram
they were affordable persional computers.Having
memory and disk I/O, timesharing on real computer
like a PDP 11 seemed so complex,the average persion
could not build a simple version of a computer out of
TTL. That is wht you have so few TTL computer designs
out there bigger than 8 or 12 bits.
My FPGA hardware emulates a machine with 2 hard discs
(RK05’s ~1.6 Mb) 48Kb+32Kb dram memory and 2400 serial I/O. Not afforable for average person in 1976 but the Alu/Control/frontpanel display cards are simple designs.

When shifting a byte any most significant bit will be shifted to carry. Why not use that within your interpreter to distinguish whether the carry is set or clear? That’s what ProDOS HyperC does.

If a 65802/816 accumulator is 8 bits and index registers are 16 bits any shift from A -> X would automatically zero the high byte no?


It’s all about saving precious cycles. I could fetch in 8-bit mode, shift, then branch on carry, but then I have the overhead of the switch to 8-bit mode, branch then shift to 16-bit mode.

I think so, but I still need to double the fetched byte as the jump table of opcodes is 16 bits wide.



is the current fetch/decode/dispatch code. It’s a macro and is included in-line on every single opcode execution. It runs in full 16-bit mode because almost all the opcodes are interpreted in 16-bit mode. Numbers in ()'s are cycle counts. It embiggens code size but saves 3 cycles per opcode decoded (a JMP back to the dispatcher)

An optimisation I have thought of would be to move the PC increment into the start of each decoded instruction - except those that modify the PC - ie. the call//return/branch/goto/ switch instructions. That would save a good number of cycles, but only on those instructions. Might speed up recursion and anything with a function call in a loop, but I’m not sure it’s worth it - for now…



1 Like

Hi Gordon,

From your opcode.txt file I see your point. So there’s a tradeoff as usual (speed in this case) emulating a VM. On the other hand, you seem to have broken the 64K barrier for program code for the 65816.


1 Like

That’s at the expense of VM execution speed. As a 16-bit CPU, you can run code in a 64K bank and access another 64K data bank with a good deal of speed, but trying to have an array (or other data structure or even a program) span more than 64K starts to become problematic - especially if your code flow demands ‘random access’ to the data. You can put in special cases which can be faster but need more careful coding, restrict data structures (and program segments) to 64K, or use a generic “access data anywhere in the address space” mechanism which is slower. I took the slower route just so I did not have to deal with it in the high level language. (65816 specific)

Not an insurmountable problem and other systems with segmented memory have this issue more or less and various ways round it. I threw CPU cycles at the problem and tried to save them elsewhere.

In these enlightened days of course, no-one believes a word of it…


All I have here is a Windows 10 machine, and a quick glance BCPL needs the GCC C compiler. I have to dig around and see if the compiler is self hosting once built on windows. I use Pelles C for what little programming I do.

The machine I plan to run BCPL on just has a cross assembler and basic serial I/O and read/write a block from a sd card in rom and 256KB of memory.,

The goal of this design, is a simple computer, with about 20 bit addressing
and 8/16.32 data size for the DOS command line era of computing. Dual floppy or small HD
(16Meg or less). No viritual memory or complex math like * / % or floating point.
At this point I am doing a review of the hardware features, and looking at other
software languages for a good general purpose instrction set.

It ts rather weird looking at history how with BCPL and other Languages where
popular in the 1970’s and vanished as the PC took over home computing, with unix for the commercai platforms, in about 5 years, 1983 to 1987. UNIX,APPLE and MS really killed off all other programming ideas, as older computers with punched cards where being replaced in the early 1980’s.


Even 8-bit micros had great variety of programming languages available.