Yes. They are called microops. Some conflate that with microcode, but I think that’s doing a disservice. I don’t know whether modern x86 processors also have microcode; it cold well be.
I’d say it’s anything but simple!
There’s an interesting 5 page article by Robert P. Colwell (it’s behind a paywall)
The Origin of Intel’s Micro-Ops about what was P6 and became the Pentium Pro. Strong opinions in here!
Some excerpts:
Many if not most researchers in the computer architecture area had become convinced that complex instruction sets such as the Intel x86 were doomed in light of the many advantages promised by reduced instruction set architecture publications. There were many voices within Intel urging upper management to abandon x86 and get started on some alternative.
…
Meanwhile, in 1990, Dave Papworth and I were recent refugees from defunct startup Multiflow Computer and its Very Long Instruction Word (VLIW) machine, so we knew there was plenty of intrinsic parallelism available among a wide variety of operation sequences, even in fairly unpromising instruction sequences. But attempting to apply Multiflow’s compiler-based approach to what was already a large legacy of existing x86 code was simply a nonstarter.
…
Multiflow’s best trick, its compiler, was not going to directly help us with the new x86 chip we had been tasked with conceiving. But we still wanted to get that instruction-level parallelism.
…
the notion of turning complicated instructions into independent tasks, like little white paper boats (Figure 1) that would float down a stream, go where they needed to go for execution, observe machine interlocks for data dependencies and functional unit latencies, find their way home, and get reordered into a final queue for retirement as a group …that idea just took root in the project.
…
Was it possible that there was something about the microdataflow idea that would turn out to be fundamentally incompatible with the x86 instruction set? Back then, we did not know. Our collective best engineering judgment was that we’d be ok, but I worried about this a lot until a great deal of presilicon validation testing had passed.
…
In practical terms, this meant that micro-ops would carry a lot of bits — 100+ each in the machines I worked on. Some μops performed very simple operations, while others did complicated things.
…
It is now been about 30 years since we first came up with this micro-op approach to x86 design. Over that time period, I have seen dozens, possibly hundreds, of papers that refer to micro-ops as “converting CISC instructions into RISC instructions.” […] A common phrasing goes like this: “The x86 ISA implements a variety of complex instructions that are internally broken down into RISC-like micro-ops…” with references to “…the micro-op ISA…”
This is not just wrong. It is wrong-headed.
(I realised this discussion is a bit of a new adventure in the original thread, so I’ve rethreaded)
This article discusses AMD’s K6, 1996 vintage and therefore some 30 years ago.
Internally, the K6 translates the multimedia instructions into RISC86 operations — the RISC-like primitives that were the most innovative feature of the Nx586 when it appeared in 1994. The Nx586 was the first x86 processor to introduce this concept of a decoupled CISC/RISC microarchitecture. On the outside, to x86 software, the chip behaves like a normal x86 CPU. But inside, special decoders translate the variable-length CISC instructions into fixed-length (albeit long) RISC-like operations that execute in a RISC-like core.
NexGen, Intel, and AMD are now using decoupled microarchitectures in all their latest CPUs. They believe it’s a better approach than trying to execute multiple CISC instructions in parallel and out of order. The only holdout is Cyrix; engineers there say a decoupled microarchitecture will become too difficult to manage in wider superscalar designs.
Edit: AMD’s K6 technical brief might also be of interest:
The scheduler is the heart of the AMD-K6 processor. It contains the logic needed to manage out-of-order execution, data forwarding, register renaming, simultaneous issue and retirement of multiple RISC86 operations, and speculative execution. The scheduler’s RISC86 operation buffer can hold up to 24 operations. The scheduler can simultaneously issue a RISC86 operation to any available execution unit (store, load, branch, integer, integer/multimedia, or floating point). In total, the scheduler can issue up to six and retire up to four RISC86 operations per clock.
The scheduler and its operation buffer can examine an x86 instruction window equal to 12 x86 instructions at one time. This advantage stems from the fact that the scheduler operates on the RISC86 operations in parallel and allows the AMD-K6 processor to perform dynamic on-the-fly instruction code scheduling for optimized execution. Although the scheduler can issue RISC86 operations for out-of-order execution, it always retires x86 instructions in order.
I was involved in code optimization for x86 processors ever since I read The Zen of Assembly Language.
Early models of the processor family were heavily microcoded. The very CISCy instructions such as the string instructions along with XLAT and effective address calculations were known to be very slow. Even so, they were often faster than implementing the equivalent using simpler instructions.
The situation slowly improved with the 286 and 386 with the complex operations becoming very competitive.
The 486 was a major leap ahead. Intel made many of the simple instructions very fast - able to execute at a rate of one per clock cycle if there were no data dependencies or register conflicts. The optimization advice is to code string operations using separate simple instructions instead of the special dedicated ones.
The Pentium took it one step further by having two integer execution units. It can execute two instructions per clock cycle if they do not interfere with each other. I had real fun with this one; some of my fastest code looked like two independent programs interleaved together because that was essentially what it was.
Somewhere along the way was a company called Transmeta with its Crusoe processor. It executed the x86 instruction set by translating instructions on the fly to its internal native instruction set. An x86 emulator implemented in hardware, if you will. Its claim was low power consumption but it was not as fast as the competition.
My recollection is hazy here, but I had thought I remembered Intel acquiring a company in Israel called NexGen. I am wrong, but the Pentium Pro was the first member of the family to do out-of-order execution. The problem was that some instructions were slower on the Pentium Pro than the Pentium.
Newer members of the family perfected the technique of decomposing x86 instructions into micro-ops which were then executed in the most efficient order possible.
For the record there is a book about the K6. It is called The Anatomy of a High-Performance Microprocessor A Systems Perspective, Shriver and Smith. The authors were two of the chip’s architects. Personally I think the K6 was important. It came along at the time when the technology first permitted a single modern core to be integrated onto a single chip. (350nm, 5 metal laters, 8.8M transistors) Unfortunately the book is very heavy going, rich on details like bitfield layouts but impoverished on overviews. But it’s obvious that the “RISC opquads” are nothing like any RISC instruction set anyone would ever design for standalone use. For example there are four “weird” branch conditions dedicated to escaping from REP OPCODE sequences. There are many similar examples. With respect to microcode, these sequences do come from a ROM. But to save space they are heavily merged together and then fleshed out at runtime with values from registers. The entire process is extremely complex, and to further complicate matters, there are parallel hardware decoders that act as optimizations, recognizing and directly decoding many of the simpler instructions. I described them as optimizations because the authors actually recommend in the text to think about them this way.
It was AMD who bought NexGen, but your history is pretty good.
The Transmeta chips had a VLIW (very long instruction word) internal instruction set and a key feature is that it could save all registers at the start of an execution block and either commit them at the end of the block or discard them if any exception happened. This would roll back the processor state to the start of the block and then x86 instructions could be interpreted one by one so the exception would be taken in the correct block. This was important because it allowed the compiler to reorder the code inside a block which greatly increased performance.
Nexgen was essentially set up to commercialize the early research work on micro-ops and thus big influence on the AMD K6, after the K5 wasn’t that successful. It’s a weird processor where some ops are actually executed as x86 subroutines.
The other player was IDT (mostly famous for MIPS) who built an x86 frontend to their CPU designs for the IDT Winchip (which became the VIA CPU line).
Yes: after I posted I realize the two authors of the book I mentioned must have been working for NexGen when they did the work that led to the book. I don’t think there is any mention of this, or the AMD acquisition, in the book. I’ve been looking at the book more closely since I posted and it really does describe pretty well how the x86 instruction set is translated into these “opquads” of “RISCops” and how they are then executed. The K6 “Scheduler” works somewhat like the “reorder buffer” found in classical out-of-order designs. By the way, if you are curious about this type of technology, I recommend reading this: https://user.eng.umd.edu/~blj/RiSC/RiSC-oo.1.pdf - it’s the best overview of out-of-order CPU technology I’ve ever found on the Internet. The author has left UMD and is now at the Naval Academy. I guess I am off topic here because this is not exactly retro, sorry.
The references in the paper are from 1967, 1985 and 1987 and the paper itself is over a quarter of a century old. That seems like retro to me. Probably the most retro thing is on page 11: “This corresponds roughly to the design of today’s DRAM architectures (e.g. Direct Rambus) that
allow pipelined requests to memory but can handle a maximum of three requests in the pipe
simultaneously.” since the other day I saw a video about the history of computing which explained what Rambus was since many viewers were born after it had failed.
The Centaur Technology VIA C3 family of processors also has an internal 32-bit micro-op instruction set. It has a capability, called the Alternative Instruction Set, to execute those directly.
Fair. Thank you. I guess part of my issue with it is the building such computers with TTL seems pretty much impractical because of the amount of state that has to be kept around during the execution of these small pieces of instructions. In the PDF I linked above, I think it’s about 80 bits per operation, which would be tough on a homebrew. Just the bus transceivers or multiplexers for all those busses would be an insane amount of wiring.There is a guy named Fabian Shuiki who is trying to do it, though: https://www.youtube.com/@fabianschuiki. He’s made 52 videos so far and is just getting to the superscalar part, which sort of makes my point.
What happened to the idea with RISC, that the compiler would handle all that nasty stuff?
It is only at the higher level program code that can you see the logic. Perhaps a different computing model
is needed, to hint at what things need to done and when.
All this reminds of * wheel of Reincarnation* The Wheel of Reincarnation with computer graphics.
Just my rambling thoughts on the post.
It wasn’t true. Compilers can do data flow analysis and can pre-load data in some cases. But if the data layout is bad, they cannot change it in ways that make more efficient use of caches which turns out to be the critical factor in many applications. Watch https://www.youtube.com/watch?v=rX0ItVEVjHc for more (Mike Acton on data-oriented design).