Stack machines vs. RISC, a modern comparison?

I was looking for some more modern sources about the AT&T Hobbit, and came across this article on The Chip Letter:

Looking at the benchmarks got me thinking… I realize that first-mover and market forces and all that, but is there any theoretical reason we didn’t move forward with stack machines? In this case a system with complexity similar to the R2000 outperforms it (at least at the micro level) and it doesn’t seem like a cached stack would immediately not be similar in performance to a similar number of register renaming slots, and the instruction set would be more compact.

So is there any good articles on any theoretical reasons a stack machine might not have scaled the same way the RISC systems ultimately did? The article mentions “stack vs register based architectures needs a full post of its own”, but I did not find such a thing, and would like to read about it!

2 Likes

For the last 30 years it is not how good the cpu is, but how cheap we can make it, to get the most
profit. For the last 40 years, the brand of cpu is only not important as who’s C compiler we are
using.
Stack machines imply FORTH style operands. A real stack machine (no knowledge about the hobbit) tends to imply a ALGOL style stack frame, more complex than standard C.
A stack machine is better load/store architecture than RISC because you abstract at a
higher level of design. You need a large microcode and that could slow you down except
any modern machine is tied to raw memory speed and waiting for cache line is more often than
marketing will admit.
With the more abstract logic, you have
fewer branches to beak the pipeline like branches to sign extend a number on RISC machine.
One pass compilers often generate stack based code, but that can be poor code simply
because multi pass scanning is needed to re-arrange source code.
Code in parenthesis needs to be done first, and that means buffering and re-ordering at some
level. Only with ample memory can a compiler do that and at the RISC machines came
out, when one had ample memory. The compiler was better, not the RISC cpu.

Remember the FAT MAC, upgrade your MAC to 512Kb from 128Kb.
1st Linux 4 Mb ram 386. RISC was better than the 386 only because it was newer.
Had the hobbit been developed by any one other than AT&T it may have been able
to find a market, before Windows 95 cornered the consumer market.

I think a stack machine stands a good chance of competing with a risc machine,
providing we don’t compare apples and oranges, and the same clock and bus speeds
and alu design.

I gather the difficulty with stack machines is in scaling up performance: too many data dependencies on the stack, difficult to go superscalar, something like that. It’s maybe like the opposite of a large register file and a huge reorder buffer - everything gets bottlenecked. It allows for dense instruction encoding, but otherwise not great. This is hearsay, mind, from working in transputer and post-transputer design teams.

Edit: amusingly, Tony Finch seems to take away a different lesson from T9000 in this HN thread on stack machines.

These two HN threads have some relevant comments - and links to previous discussions.
Stack Computers: the new wave (1989)
Ask HN: Why did stack-based CPUs lose out?

1 Like

Perhaps the change from Floating Point Alu’s to Bitmapped graphics was more important design
issue as the mix of software changed. TEMP_A[i,j] := (A[i+k,j+l]+…A[i-k,j-l]) / 8.0;
to macro XorSprite(dest,r1,r2,source,r3,r4);
Oberon moved from a CISC ( Early vesion ) to RISC hardware that includes bit mapped graphics.

Think about expressing an algorithm. Do you think of pushing the operands first and then doing each operation? If you’re like most people, you don’t. The evidence comes from the calculator and programming language marketplaces, e.g. Hewlett Packard and Forth: some people got very good at it, but most people didn’t prefer it. Today most calculators and programming languages have parenthesis, not reverse Polish.

Now think about building a CPU. If you want it to run as fast as possible, you give it multiple functional units and build a dataflow engine that pulls up as many instructions as possible to keep them busy. In other words, the CPU hunts forward for work until and unless (a) all the function units are busy or (b) it comes to a place where it has to wait for a result that hasn’t yet been computed.

The resulting sequence of operations ends up looking nothing like a stack; it has addressing computations intermixed with data computations, it reuses operands heavily, etc.

So if people don’t think like stack machines, and CPUs aren’t best designed as stack machines, why did we ever dream up stack machines in the first place? We did it because stack machines optimize one thing: instruction bandwidth. “Add” is just “Add”. There’s no need to specify operand locations or where to put the result; that’s all implied. So the instructions can be very small.

In earlier times, this made programs smaller and also faster, because not as much instruction “data” had to be fetched. But as time marched on, we figured out how to build caches. Caches require a huge amount of extra storage (cache tags) and logic (hardware invalidation logic, etc.) So they were basically impossible in the early days. But caches work particularly well for instructions, because the instructions generally don’t change during execution.

Instruction caches removed the last good reason for building stack machines.

1 Like

Algol and friends is why we have stack machines I think.The only way to parse with 1 or 2 token look ahead is with a stack, and recursion requires a stack. Computer Science did not stop with FORTRAN II in the late 50’s.When SSI chips came out, did we get index registers. Time matches on to the newer and better…

“people don’t think like stack machines” However compiler-writers do. And they’re the people who matter in this context.

1 Like

Note that the CRISP wasn’t a normal stack architecture. It had a stack cache instead of a register bank (MIPS) or register windows (SPARC) but you could access local variables as if they were RISC registers instead of having to shuffle everything through the top of the stack (with DROP, DUP, SWAP, etc).

One very modern feature was that the instruction read from memory was expanded into a more RISC-like format before being stored in the instruction cache. And this expanded format did both control flow and data processing for each instruction for zero overhead loops.

2 Likes

I ended up not addressing your main question: stack architectures are very competitive for low end designs (just compare the many Forth processors with tiny RISCs) but fall behind as implementations get fancier due to the bottleneck of the top of the stack.

For really modern and high end implementations (with register renaming and out of order execution) the problems with stack ISAs would go away, but I have not yet seen anybody try this.

1 Like

Or … (waving my hands wildly)… maybe multiple stacks, that could operate in parallel?

Bernd Paysan cooked up just such an idea - not sure it was entirely finished:
4stack Processor

1 Like

There are some excellent resources in here, thanks everyone!

Long and short: stacks fall apart when you have conditionals that aren’t idealized, and are difficult to turn into superscalar versions because every instruction is inherently dependant on previous ones and you can’t really scoreboard the entries.

1 Like

I might say a little more about T9000 - it was intended as a better faster transputer than the first-generation T4/T8 series. (T2 was a proof of concept, more or less, and T8 was T4+FPU. All were simple scalar machines, I think, with no cache of any sort.)

Here’s his post from HN, posting as fanf2:

In 1993-4 I was an intern at Inmos, when they were trying to get the T9000 transputer to work.

The transputer was a stack architecture, with a bytecode-style instruction set. The stack was shallow, just big enough to evaluate a typical expression you would write in a high-level language. It also had a local addressing mode, relative to a “workspace” pointer register, which was used for local variables.

To make the T9 go faster, they gave it a “workspace cache”, which was effectively a register file. The instruction decoder would collect up sequences of bytecodes and turn them into RISC-style ops that worked directly on the registers, so the stack was in effect JITted away by the CPU’s front end.

A really cool way to revamp an old design; a pity that the T9 was horribly buggy and never reached its performance goals :frowning:

The official T9000 Hardware Reference Manual has some things to say about the pipeline and the instruction grouper, and shows how several instructions can be launched in each cycle. See p10 (p33) and also p74 (p96) of the PDF.

Since the processor can fetch one word, containing four bytes of instructions and data, in each cycle it is possible to achieve a continuous execution rate of four instructions per cycle (200 MIPS). However, if any of the instructions require more than one cycle to execute, then the instruction fetch mechanism can continue to fetch instructions so that larger groups can be built up. Up to 8 instructions can be put into one group and there may be five groups in the pipeline at any time.

During my tenure, some preliminary work started on a successor to T9000, which as I recall was to involve out-of-order machinery (Tomasulo’s algorithm as “first implemented in the IBM System/360 Model 91’s floating point unit”)

The major innovations of Tomasulo’s algorithm include register renaming in hardware, reservation stations for all execution units, and a common data bus (CDB) on which computed values broadcast to all reservation stations that may need them. These developments allow for improved parallel execution of instructions that would otherwise stall under the use of scoreboarding or other earlier algorithms.

(One of the principals of that work is now a Fellow and Chief Architect at ARM.)

I think it was understood by engineers, if not by management, that there was no hope of implementing a successor machine: we didn’t have the skills, techniques, or resources. We barely got the T9000 working, had no chance of reaching a useful clock speed, and had already suffered an exodus. Indeed, most of the T9000 team was hired after a post-T800 exodus. We, as engineers, could do interesting work, and learn things, with some small hope of making a success of the project.

While we rightly praise Moore’s Law for making computing available to the masses, it is only about having newer products cheaper and more powerful than the previous ones. It is not what allowed the exact same $100 68000 processor to be sold for just $5 a few years later.

The two related things that bring down the price of the exact same chip over time are the learning curve in manufacturing (yield goes up, processes get automated, higher volume means bigger discounts for incoming material, etc) and the fact that after a while the NRE (non recurring engineering) costs will have been paid off and won’t have to be included in the price anymore.

While Inmos started out with a reasonably competitive chip node, as far as I know they sat on it for years and years and used the same process for all their Transputer products meaning they did not benefit from Moore’s law to keep up with their competitors. I think the T9000 was their first attempt to move to a newer node (BiCMOS if I remember correctly) but their problems with that hurt the project as much as anything else.

The original goal of the Transputer was to have processors be as common as transistors were at the time. For that to happen the processors would have to have been really cheap (eventually). But with Transputers being a very niche product they didn’t ride the learning curve and were being sold for roughly the same price years after introduction. They did try to create some cut down products like the T400 and T222 to allow cheaper applications, but it was too little, too late. The T9000 went in the other direction trying to compete with the most expensive processors.

2 Likes

The T9000 idea of merging two or more stack instructions into a more RISC-like 3 address instruction was revisited in Sun’s PicoJava processor.

One interesting T9000 feature was that the internal memory about be used as a simple RAM like in previous Transputers or configured as a cache.

The biggest problems with the transputers-as-implemented were, I think:

  • new parent company had no idea how to sell microprocessors.
  • all the layout was custom, there was no easy process node migration, only very difficult migration.

Another problem within INMOS was that the initial memory offering was too successful and made the memory business seem more important, whereas the initial idea was to have memory fund processors.

T9000 was only ever NMOS. [Hmm - see downthread!] I think we had two and a half layers of metal. There was an effort to convert the layout automatically to some newer process but it fell flat: some clever small company had a tool, but wanted seven figures (IIRC), and that was considered too much for mere software. Of course that’s an incorrect judgement - the value of software is in what it can do for you.

I liked the discussion about renaming registers, and “moving registers.” The latter is how I’ve described working with the Forth stack, “a moving array of registers.”

I was puzzled reading Torvalds saying that stack machines “don’t work well with compilers,” describing the behavior of optimizers. He made it sound like he’d had experience with this, but I had to wonder if he really had. I got the impression he was talking about taking an existing C compiler, and just substituting some equivalent opcodes, between a register and stack machine, in the back end, but not changing the code generation algorithms the compiler used, to account for the fact that it was targeting a different machine architecture. I don’t see how the problem he described has to exist. Some of the discussions focused on out-of-order execution re. pipelining, and how it’s extremely difficult with a stack. I grant that, but I go back to a discussion about the 6502: You plan out this stuff. Pipelining is a compensation for slow instructions. I’m not saying the programmer should necessarily have to worry about this. I’m saying the compiler should figure out how to sequence instructions for optimal performance.

My sentiments were echoed by CHY872 on Jan 1, 2023 in “Ask HN: Why did stack-based CPUs lose out?”

Building a solid computer stack (hardware through software) is an exercise in mechanical sympathy, understanding the traits of the underlying components which enable the practitioner to get as much out of it as possible. The software should do the thing that the software is good at, and the hardware should do the things that the hardware is good at, and neither exists in a vacuum, both should be aware of the constraints of the other.

For one example, about 25 years ago Intel made a processor called Itanium which used ‘VLIW’ (very long instruction word) instructions as opposed to x86. In this case, a lot of the complexity of operating a processor is shifted into the compiler, since individual CPU instructions may perform complex operations. In a modern processor, instructions are ‘simple’, and the processor is responsible for finding the data dependencies that allow multiple instructions to be executed in parallel (this is called ‘superscalar’ processing). With Itanium, the compiler author can encode such individual operations into more complex instructions which can be executed in parallel, all at once. So, where in x86 I might have in 3 different instructions (add r1, r2, sub r3, r4, mul r5, r6), with Itanium I might be able to combine all of those into a single instruction at compile time. By doing this, in theory we can remove a lot of the complex superscalar logic from the processor, devote more of the chip to execution units and achieve more overall performance.

Well the best I can see for super scaler, is 3 per clock.
1 generate efa.
2 use efa’ to access random access memory
3 do previous memory op.
point 2 is always the bottleneck.
I think the DG Nova computers (70’s) had the first multi-port ram, so it was close being a RISC
architecture even if it had core memory.

In the T9000 reference manual we read:

INMOS has used advanced CMOS technology to integrate a 32-bit integer processor, a 64-bit floating point processor, 16 Kbytes of cache memory, a communications processor and four high bandwidth serial communications links on a single IMS T9000 chip

I don’t know why I thought they had tried BiCMOS.