I might say a little more about T9000 - it was intended as a better faster transputer than the first-generation T4/T8 series. (T2 was a proof of concept, more or less, and T8 was T4+FPU. All were simple scalar machines, I think, with no cache of any sort.)
Here’s his post from HN, posting as fanf2:
In 1993-4 I was an intern at Inmos, when they were trying to get the T9000 transputer to work.
The transputer was a stack architecture, with a bytecode-style instruction set. The stack was shallow, just big enough to evaluate a typical expression you would write in a high-level language. It also had a local addressing mode, relative to a “workspace” pointer register, which was used for local variables.
To make the T9 go faster, they gave it a “workspace cache”, which was effectively a register file. The instruction decoder would collect up sequences of bytecodes and turn them into RISC-style ops that worked directly on the registers, so the stack was in effect JITted away by the CPU’s front end.
A really cool way to revamp an old design; a pity that the T9 was horribly buggy and never reached its performance goals 
The official T9000 Hardware Reference Manual has some things to say about the pipeline and the instruction grouper, and shows how several instructions can be launched in each cycle. See p10 (p33) and also p74 (p96) of the PDF.
Since the processor can fetch one word, containing four bytes of instructions and data, in each cycle it is possible to achieve a continuous execution rate of four instructions per cycle (200 MIPS). However, if any of the instructions require more than one cycle to execute, then the instruction fetch mechanism can continue to fetch instructions so that larger groups can be built up. Up to 8 instructions can be put into one group and there may be five groups in the pipeline at any time.
During my tenure, some preliminary work started on a successor to T9000, which as I recall was to involve out-of-order machinery (Tomasulo’s algorithm as “first implemented in the IBM System/360 Model 91’s floating point unit”)
The major innovations of Tomasulo’s algorithm include register renaming in hardware, reservation stations for all execution units, and a common data bus (CDB) on which computed values broadcast to all reservation stations that may need them. These developments allow for improved parallel execution of instructions that would otherwise stall under the use of scoreboarding or other earlier algorithms.
(One of the principals of that work is now a Fellow and Chief Architect at ARM.)
I think it was understood by engineers, if not by management, that there was no hope of implementing a successor machine: we didn’t have the skills, techniques, or resources. We barely got the T9000 working, had no chance of reaching a useful clock speed, and had already suffered an exodus. Indeed, most of the T9000 team was hired after a post-T800 exodus. We, as engineers, could do interesting work, and learn things, with some small hope of making a success of the project.