Measuring CPU speed in a fantasy retro emulator

Hey everyone,

I’ve developed a fantasy retro computer emulator a while back. The CPU in it uses a custom instruction set inspired by the Z80 and other CPUs from that era. The architecture is fully custom, and the CPU is “unlocked,” meaning it runs as fast as the host system allows — no artificial timing constraints.

Now, I’d like to show some CPU speed information to users (something like a “MHz” display). Here’s the method I came up with:

  • I picked the simplest instruction, NOP, as the baseline.
  • I assigned NOP a theoretical execution cost of 3 cycles:
    1. Incrementing the instruction pointer
    2. Fetching the next byte
    3. Decoding and performing the null operation
  • I treat these 3 cycles as a base measurement that I established to 3 Hz.
  • Then, I benchmark the actual time it takes to execute a large number of NOPs and measure the average time per NOP.
  • I do similar benchmarking for other instructions. Since they will naturally take longer, I calculate their cycle-equivalents proportionally, even if it results in fractional cycles (e.g., 7.73 cycles for a more complex instruction).
  • During runtime, I keep track of total “computed cycles” to estimate a dynamic CPU speed (like MHz).

My questions:

  • Is this a reasonable approach for measuring CPU speed in an unlocked fantasy CPU?
  • Am I missing something important, conceptually or practically?

The reason for my question is that when benchmarking the way I described above, my emulator reaches around 578 Mhz on a host with an AMD Ryzen 9 7940HS. So this would place my CPU close to a Pentium III or AMD K6-2/K6-3 which seems a bit hard to believe.

Thanks in advance for your insights!

Define your memory timing cycle first for program and data.
Static ram. Dynamic ram, Fast static ram or cache.
Around the time of the 386 you started to get DRAM off the bus and straight onto the CPU.
What do you use for a mmu with z80?
Ben.

1 Like

How about simply running some benchmarks on it? do you have a BASIC for it? If-so then there are many old benchmarks - run them on a ‘real’ Z80 and compare…

-Gordon

1 Like

I agree, some kind of benchmarking would be called for, since there can’t be an exact correspondence to a particular clock speed.

Some years back I made a 6502 emulator for ARM and got a very good clock rate for NOP which - it turned out - wasn’t at all representative of real instruction mixes!

It would be relatively easy to measure and report instructions-per-second which you could then perhaps compare with the same measure of a real z80 running a known clock rate - but running, again, some real-world mix of instructions.

All that said, you’ve already measured a variety of instructions, and it sounds like your measurements take account of the mix of executed instructions, so perhaps you’re already doing fine. Remember that the original Z80 clock speeds were a healthy multiple of the memory speed, so Z80 clock speeds aren’t comparable to 6502 clock speeds, even without taking notice of the different instruction sets.

1 Like

Hey guys,

Thank you for your inputs!

I have to apologize first, I was not clear enough to begin with. The CPU I created was “inspired” by the Z80, Motorola 68000 and some other architectures of that era, but it’s completely different. It’s totally “fantasy”.

The inspiration from the Z80 limits to how some mnemonics are written (LD - for load, SR, SL - for shift right, left, DJNZ, etc…), how the registers can have compositional forms:
i.e.

A, B, C, ... Z but also:
AB, BC, CD, ..., ZA (colliding 16 bit)
AB, CD, EF, ... (non-colliding 16 bit)
...
ABCD, BCDE, and so on.

And finally, how the registers work with instructions. Basically all instructions can use all registers in any form, so there’s no specific “accumulator” or “memory incremental” or anything like that.

@drogon
So, I can’t reference anything against the Z80 since it’s a totally different CPU. I am implementing a higher level programming language compiler which I called MOSAIC which is a cross between BASIC, some simple C# and the ability to have images as variables you can work with similarly as with int(s), for example. But I feel the benchmarking should be done at assembly level, moreover, even at machine level so I can get as accurate as possible (per host machine).

One other thing that would taint the results would be the fact that most of the older hardware (as far as I am aware of) implemented BASIC as an intepreter, not as a compiler (which I am doing). So, right from the start, whatever you write with MOSAIC would be way faster than the interpreter equivalent.

However, I do think it’s an interesting idea to “port” some sort of old benchmarking programs and see how they run in comparison (with the notes above made very transparent). I think it’s a great idea! Thanks!

@oldben
I am rather ashamed to admit that in 2 and a half years since I wrote the RAM controller, I never considered the nature of it until you asked these questions.
My RAM is completely inside the CPU, next to the registers. Fetching data from memory or registers is the same thing in terms of speed, or very close, depending on how many bits you fetch (8, 16, 24 or 32).
I don’t have a data bus, nor anything required to sync with the RAM. It simply fetches from the CPU RAM in a single cycle. And I think that answers my initial question, since real hardware (for cost and engineering reasons) does not tipically do this.

@EdS
Indeed, I thought about the bias that testing one single instruction can introduce, but, I also think it should not matter much. Say we write a very basic emulator with 6 instructions:

LD r, immediate
ADD r, r
LD r, (r)
LD (r), r
JZ r, immediate	(if r == 0, PC = immediate)
HLT

Probably the “fastest” instruction would be HLT, which should take 3 cycles:

  • increment PC
  • read next instruction from memory
  • halt

Now, let’s benchmark at CPU level and run 1000 HLT instructions which takes a second. This means (I think) a theoretical 3Khz CPU so far.
But how about the other instructions. Let’s say we benchmark the ADD r, r and we find out it takes 2.3 seconds. Longer, but that’s fine. It simply means internally, using the same logic, it uses 6.9 cycles to reach the end of that addition. The CPU frequency should be the same, only the number of cycles is increased because it does more internally.

Below is a sort of benchmark I made on all instructions except “WAIT” and “INT”.

I excluded WAIT for obvious reasons, it simply waits and taints the result, and INT because it’s actually a system function call. For example, I draw rectangles with such an INT and depending on rectangle size it takes more or less.

I also noticed my NOP instruction there takes more time than other more complex instructions. I fail to understand why since there’s no way the runtime execution can be optimized since it’s tightly binded to the RAM, so there’s no way for the CPU to know what’s the next instruction until it actually reads it.

Maybe the execution as a function is somewhat cached somewhere with some lookups, but that would be the execution interpreter, not anything I did. I still need to figure that out.

Ah, I see… I think… and I think I don’t quite agree.

I would break down each instruction type into a series of nominal cycles, one for each memory access, and one for any necessary internal operations (such as the calculation of an effective address). That would give me the cycle count for each instruction type. I think that’s more meaningful than back-calculating it from the present emulation performance.

In the case of 6502, there’s an additional aspect that every instruction takes a second cycle which can fetch an operand, even if none is needed. But there’s no final cycle to write back any result to a register, because that’s overlapped with the following instruction.

So we see that the internal organisation makes a difference to the cycle counts - and your fantasy machine doesn’t have an internal organisation, so you have to make some assumption.

You might for example assume that your fantasy machine has excellent pipelining which hides the cost of all internal operations, so you only need to count the memory accesses.

1 Like

Ah, I see where you’re going with this. Offcourse this makes more sense and also solves the problem of non-integer cycles. Practically I need to define each instruction starting from the actual operations it’s expected to do within the CPU.

The only problem I have with this is that it may report different frequencies depending on which instructions are more prevalent within a test program meant for benchmarking.

But maybe I can take the best of both worlds, use the existing benchmark I’ve shared earlier to get closer to an accurate frequency.

I may also have to consider that on some hosts, some Continuum instructions would perform better than others compared to another host where this would be the other way around. One thing that comes to mind is floating point register operations. I implemented them and I think I “can” expect that on some newer CPUs they would perform considerably faster.

I’ll see how I can draw a line through all this, this is very helpful for me, thank you!

Byte magazine’s prime number sieve still is a valid benchmark for timing, other than slow cpu’s.

1 Like

Yes, I think it’s inevitable that different code mixes will result in different speed estimations. The only way otherwise is if you make a paced emulator which makes an effort to keep up a particular tempo. (PiTubeDirect has both fast-as-possible and paced models for 6502. The paced model runs one instruction at a time and then delays a variable amount according to the cycle count of the instruction.)

Interesting read. It seems that test was primarly designed to benchmark different computer languages/compilers, not necessarily CPUs comparatively. But I don’t see why not use it this way as well.

For me it opens a new dimmension, since I can implement it to assembly and when MOSAIC is ready, I can implement it there and see what’s the penalty.

Also, I managed to find some benchmark results for it:


Source and context

And, very recently, someone did benchmark quite a few languages on a TI-99 using the same Byte Sieve:

Source

That thread also continues with a lot of folks doing several other benchmarks on different systems.

Yeah, I’m thinking of putting together a procedure to determine all that. Once I figure out the algorythm that I’d do manually to assign cycles and measure, I can automate it, make it a part of Continuum’s “BIOS” and get as close as possible to a rational/fair representation of CPU’s frequency.

I did struggle a bit with whether to introduce some delays to match-up cycles, but I eventually decided against. Practical implementations prove difficult and not that accurate/reliable. To maximize the accuracy I had to significantly reduce the CPU speed to about 20% of what’s today, and I didn’t want that.

But a separate build with a paced implementation sounds interesting. I never thought of separating them. I’ll put this in the backlog for more concept exploration.

Alright, had the curiosity to actually write the Byte Sieve in Continuum’s assembly. I raised the number of iterations to 200 for a more reliable result and I implemented the algorythm without any optimizations, to be in the exact close spirit of the original implementation, as that’s the actual point of the benchmark.

So, as per the original benchmark which only takes 10 iterations, Continuum finished in 28.3 ms.

Looking at the assembly performance of the TI-99 represented above which states 9.3 seconds I then looked at the “Strange Cart BASIC” which has actually a faster result of 8.7. I can’t figure out which BASIC on this Earth surpasses the assembly implementation unless it’s a BASIC compiler with a very aggressive optimizing architecture, but I decided to take that as a reference instead.

So, 8700 ms vs 23.3 gives a difference at a magnitude of roughly 373 times.

The TI-99 has a TI TMS9900 at 3MHz. So, with some ignorantly naive math, I’ll multiply 3 by 373 which would theoretically put Continuum at 1119 Mhz, or 1.1 Ghz.

I was unable (so far) to find other assembly tests performed on other hardware to make parallel calculations and draw an average, though I did found some folks doing some tests on a 1.77-MHz MOS 6502 versus a 4.77-MHz Intel 8088 with a highly optimized version of the sieve. But I have no actual numbers nor the certainty they used the same algorythm.

So, that 1.1 Ghz above is highly unreliable IMHO. But it was fun to try out.

Well done! In my experience, every discussion of sieve benchmarks turns into a discussion about optimising the implementation - which, as you say, moves the result away from being a measure of the CPU performance.

Well, I think that depends a bit on how one uses this particular test. My brief readout through the article OldBen provided revealed that the initial scope seems to have been to benchmark compilers/languages between themselves. So, this code has two separate possible uses:

  1. Benchmark how a language behaves versus another on the same machine;

or

  1. Benchmark how the exact implementation on one machine measures against the same implementation on a different machine.

For instance, let’s say I use my future MOSAIC language to compile an implementation of this Byte Sieve and I want to compare it with my assembly implementation of the same Byte Sieve on the same Continuum “hardware”.

The very first thing that would happen would be this part of the code:

for (i = 0; i <= size; i++) {
    flags[i] = true;
}

immediately being optimized away to this single instruction here:

MEMF (.flags), 8191, 1

that, compared to the assembly implementation:

	LD XYZ, 8191
	LD BCD, .flags

	; Fill the sieve with 1s
.populateNext
	LD (BCD), 1
	INC BCD
	DEC XYZ
	CP XYZ, 0
	JR NE, .populateNext

is blazing fast. Namely, one instruction (albeit it could take a bit longer since it has a loop behind, but not by much) versus approximately 40957 instruction executions which that loop produces.

I am respecting the “rules” on both situations, I do write the implementation correctly on both cases, but in one case, an optimization occurs and that basically is legal since we’re measuring the languages.

Moreover, some advanced version of MOSAIC with some aggressive optimization flags enabled will conclude that the whole block produces a finite response, predictable and reasonably sized to be cached and will simply replace it with the results, thereby reducing the “execution time” to a microscopic fraction of the original. It’s still legal.

However, indeed, when you try and judge fairly how long does it take one hardware to execute over the other, you need to have an uniform implementation, prefferably in assembly and not skip any of the steps taken, even if you are able to fill memory, do advanced arythmetic and so on.

… and I think with that I may have also answered my own curiosity as of “Why was Strange Cart BASIC faster than assembly on that same TI-99?”. Maybe due to such an optimization.

1 Like

The TI99/4 and /4A computers had a meager 256 bytes of memory, mostly used for storing the virtual registers for the TMS9900. If you wanted to store a “flags” array with more than a thousand elements then the only option would be to use the video memory. That meant accessing the registers in the 9918 chip to set up the desired address and then some more register accesses to actually read or write the data. I can easily imagine a carefully crafted BASIC outperforming a sloppy assembly version of this operation.

About clock speeds, before the 1990s microprocessors had significant differences in their implementations such that direct comparisons didn’t make sense (the famous Z80 @ 3.6MHz vs 6502 @ 1MHz discussions were a great example).

Though it is still a work in progress, you can see the BASIC Sieve translated to RISC-V assembly and used as a benchmark on a 95MHz processor. Besides the Sieve, I have a trivial sine wave generator and a Mandelbrot generator. To make it easier to compare different processors I use the number of times per second each can execute the benchmarks.

2 Likes

Ah, I did not know that about the TI-99. I remember checking the specs and noticing it has an optional block of 24k of RAM so I presumed naturally that the respective test took advantage of that.

But yes, the CPU <=> RAM collaboration seems to be a heavy factor here. The CPU in my emulator does not suffer such penalties since it’s as fast to access any RAM address as it is to access a register. So, I think that the fair parallel benchmark in my case would be to write code that does not touch RAM and benchmark that, for instance on a Z80 versus my Continuum to get a more accurate representation of my CPU frequency which I expect to be much less that that 1.1Ghz. At least on the computer I am using for testing it.

I did it differently now using a ZX Spectrum 48K+ as reference. Wrote a simple looping program that tinkers with some registers, the stack and some calls. I wrote it in such a way so I could write it almost identically for Continuum. Functionally and procedurally, both are indistinguishable.

Spectrum runs it in 211250 ms (about 3 and a half minutes)
Continuum runs it in 906 ms (less than a second)

So, Continuum finishes 233 times faster.
Spectrum has 3.54 Mhz, so Continuum is calculated at 824.82 Mhz

Then, fearing the bias of the PUSH/POP instructions which DO touch memory, I removed the PUSH/POP from the “somework” subroutine, thereby removing 6553500 executions of them.

New benchmark data:
Spectrum runs it in 171200 ms
Continuum rus in 641 ms

This time, Continuum finishes 267 times faster, contrary to what I expected.
Spectrum still has 3.54 Mhz, and Continuum changed its mind to 945.18 Mhz

First conclusion seems to be that Continuum’s push/pop is more expensive, proportionally speaking.

However, given all this “benchmarking”, I’d establish the error as the difference between 945.18 - 824.82 which is: 120.36 Mhz.

Since it’s an error margin, I’ll now subtract it from the lowest value, yielding: 704.46 Mhz. This is the value I consider (so far) rational and safe to consider as close as possible to the realistic theoretical frequency Continuum reaches in relation to a Z80 on my AMD Ryzen 9 7940HS. Incidentally, it’s not very far from what my current Continuum benchmark yields: 578 Mhz

So, I’ll stem this (which will probably be tweaked later after many more tests): “Continuum runs at 17.6% of the frequency of a single core of the host CPU it runs on”. Maybe 20%.
Remains to be seen on SBC’s and other small/older hardware.

The programs ran on Z80 and Continuum, for reference.

Z80 code

	ORG 40000

	LD BC, 100
loopMain:
	CALL mainLoop
	DEC BC
	LD A, B
	OR C
	JR NZ,loopMain
	RET

mainLoop:
	PUSH BC
        LD   BC,0FFFFh    ; initialize 16-bit counter to 65 535

loop:

	CALL somework

        DEC  BC           ; BC ? BC–1
        LD   A,B          ; bring high-byte into A
        OR   C            ; Z?1 iff B=0 AND C=0 (i.e. BC==0)
        JR   NZ,loop      ; repeat until BC==0
	POP BC
	RET

somework:
	PUSH AF
	LD D, 0
	LD E, 12
	LD H, D
	LD L, E
	ADD HL, DE
	XOR A
	POP AF
	RET

Continuum 93 code:


	#ORG 0x80000

	LD BC, 100
.loopMain

	CALL .mainLoop
	DEC BC
	LD A, B
	OR A, C
	JR NZ, .loopMain

.mainLoop
	PUSH BC
	LD BC, 0xFFFF

.loop
	CALL .somework

	DEC BC
	LD A, B
	OR A, C
	JR NZ, .loop
	POP BC
	RET

.somework
	PUSH AB
	LD D, 0
	LD E, 12
	LD H, D
	LD I, E
	ADD HI, DE
	XOR A, A
	POP AB
	RET

If you have 8080 version, then you could compare the z80 and the Continuum for hardware improvements over time.

1 Like