Retro8 - what would have been the ultimate 8 bit micro?

On June 21, 2013 Keith Clark wrote in the Retro Computing community in Google Plus:

What about some Retro Computing community project ideas here.

1.) Design/build the ultimate 8 bit cpu.
2.) Design/build the ultimate 8 bit computer

Both would be emulated in software, of course, so the sky would be the limit. Pick open source, multi platform software to build with so everyone can participate. A custom OS could be made for it, or several of them.

I don’t know, just toying with ideas. I know I had ideas of what the ultimate system would be in the late 70’s, early 80’s. Maybe time to make them come to life!

Any interest in such an idea?

retro8 is my contribution, but it would be interesting to hear other people’s suggestions. There are actual projects to build something like this, such as the C256 Foenix or the Commander X16.

These use processors that were available back then and at the time I considered the Motorola 6809 to be one of the best 8 bit processors (it was probably the last to come out, not counting microcontrollers). But since item 1 wonders about the ultimate 8 bit cpu I am proposing a new design but very much in the style of the era (CISC, for example).

3 Likes

VIDEO and AUDIO:

Each of these chips would have its own 64KB DRAM (the video chip can have up
to 128KB plus an extra 64KB, but even the highest resolution modes fit in
64KB though without multiple screens to do double buffering).

Originally I thought about a completely new design for the video part with fancy features like the original Gameduino. But the update to the classic Texas Instruments 9918 is more than reasonable and the sound chip from the GS can be considered a worthy successor the the famous SID chip from the Commodore 64.

retro8 PROCESSOR:

12 registers in four triples

Each of the 8 bit registers is indicated by a 4 bit code in the instruction:

11xx 10xx 01xx
xx00 AH AM AL
xx01 BH BM BL
xx10 CH CM CL
xx11 DH DM DL

The remaining codes indicate a byte in memory pointed by a triple of registers (if not modified by prefix). This is a bit like the old (HL) “register” in the 8080.

0000 *A
0001 *B
0010 *C
0011 *D

Some instruction operate on register triples. In those instructions fewer bits are needed to indicate the operands:

00 A
01 B
10 C
11 D

Some instructions (pushalt and popalt) use these encodings to indicate alternative registers:

00 PC 24 bits
01 SP 24 bits
10 Status 8 bits: IL2 IL1 IL0 EI Z N V C
11

instruction formats

retro8 is a two address machine, with an operand indicated by “s” in the following instruction formats and a second operand that is also the destinations indicated by “d”. The other bits in the instructions are “o” for the operation codes and “i” for immediate values.

hex binary format type mnemonics
A0-AF oooo ssss oooo dddd combine source with destination add, sub, addc, subb, mov, xor, or, and, lsl, asr, shr, rot, mul, muh, mus, rotc
00-7F oooo dddd iiii iiii combine immediate with destination addi, subi, addci, subbi, movi, xori, ori, andi
C0-EF oooo ssdd 24 bit operations, transfer source to destination mov24 (s=>d), st (s=>*d), ld (*s=>d)
80-9F oooo oodd 24 bit operations, change destination inc24, dec24, callr, jumpr, push, pop, pushalt, popalt
B0-BF oooo cccc iiii iiii branch on condition call (coded as brnv which has cccc=0000), br??
F0-FB oooo oooo implied operands cc, sc, cv, sv, cn, sn, cz, sz, di, ei, rti, brk

Interrupts

The first 32 bytes in RAM are the interrupt table. Each entry has three bytes to be loaded into the PC and one into STATUS. There is no interrupt level 0, so the first four bytes are used by the brk instruction. The last 7 bytes in memory are loaded into the PC, STATUS and SP after a reset and are normally in ROM. An interrupt pushes the PC and STATUS to the stack and if the newly loaded STATUS doesn’t have EI set then further interrupts will be ignored. When EI is set, interrupts at a lower or equal level than IL[2:0] are ignored.

24 bit values are stored in memory in little endian order, though that makes hex dumps harder to read.

2 Likes

retro8+ PROCESSOR

The set of instructions described for retro8 are complete and make for a very
nice 8 bit processor that can elegantly handle 16MB of memory. 10 opcodes of
the possible 256 first bytes are not used, so for the enhanced retro8+ they
are defined as prefixes that modify the behavior of the following instruction,
From an interrupt point of view they are a part of the instruction. With the
prefix that adds a 3 byte displacement, the longest instruction can have 48 bits
(6 bytes: prefix, two bytes of instruction, 3 bytes of displacement).

C0, C5, CA, CF would be a 24 bit move from a register group to itself, so they
are instead defined to add A, B, C or D respectively as an offset to the
address. This can be either the source or the destination or even both,
depending on which are in memory and not a register.

9B and 9F would be push/pop an undefined alternate register so they are instead
defined to add 1 or 3, respectively, bytes that follow the instruction as an
offset to the address. Again, this might be the source, destination or both.

FC, FD, FE, FF would be implied operand instructions that haven’t been defined,
so instead they modify the address to increment before, increment after,
decrement before and decrement after, respectively. Since the inc and dec
instructions are a single byte long, these prefixes are interesting if both
source and destinations are addresses to be changed.

2 Likes

PACKAGE/PINOUT

Most 8 bit microprocessors used 40 pin dual inline packages (DIP), either
ceramic or plastic. With 24 address lines instead of the usual 16 it is a bit
harder to fit retro8 there, but a very frugal interface should be possible:

D0 to D7
A0 to A23
Read, Write and Wait
Power and Ground
Clock and Reset
Interrupt

Missing is any way to allow other devices to take over the address and data
busses, so no DMA in this system.

It would be amusing to use an even smaller package like the EPROMs did. 24 pins,
for example. An option optimized for DRAMs could be:

A16 to A23 / D0 to D7
A8 to A15 / A0 to A7
!RAS, !CAS and R!W
Power and Ground
Clock and Reset
Interrupt

The MOS technologies of the 1970s could not switch their pins fast enough to
make this work, but in the mid 1980s this would be a practical option. The
!RAS signal should use external latches to save the top 16 address bits so
EPROMs and other non DRAM devices could be used. This is missing the Wait or
equivalent signal, so the Clock would have to be streched to achieve the
same result.

2 Likes

LANGUAGE

By the mid 1980s people were used to full screen editors (GW Basic is from
1983, for example), but a minimal teletype style system where the only editing
is entering a new line with the same number has its merits.

Several different implementation styles of Basic were done: pure text
interpreters, tokenized interpreters, two level interpreters and compilers.

A new option would be an incremental compiler. In such a system the memory
would hold a machine language version of the user’s program plus some helper
tables mapping line numbers and variable names to memory positions.

When listing the program it would be decompiled as needed. When a new line
is entered, it is translated to machine language and replaces the old
code (if any) and all the rest of the program is patched to take into
account changed addresses as everything after the new code would have to
be slightly moved.

Each line could be a Basic statement (or more than one if colons are used)
or an assembly language instruction (also allowing more than one), in which
case the “compilation” is trivial.

Several Basic added named subroutines with actual arguments and that made
a huge difference compared to only GOSUB. They often added more control
structures and made line numbers optional, but the standard IF and FOR are
enough for most cases.

An alphabetical list of assembly instructions

As 0d    add d,s   ;  d += s
As 2d    addc d,s  ;  d += s + carry
2d ii    addci d,#i;  d += i + carry
0d ii    addi d,#i ;  d += i
As 7d    and d,s   ;  d &= s
7d ii    andi d,#i ;  d &= i
As 9d    asr d,s   ;  d >>= s
B3 ii    bc        ;  if c = 1 then pc += i
B1 ii    beq       ;  if z = 1 then pc += i
BC ii    bhi       ;  higher unsigned if !(z=1 | c=0) then pc += i
B8 ii    bge       ;  if n^v = 0 then pc += i
BA ii    bgt       ;  if !(z=1 | n^v) then pc += i
BB ii    ble       ;  if z=1 | n^v then pc += i
BD ii    bls       ;  lower or same unsgined if z=1 | c=0 then pc += i
B9 ii    blt       ;  if n^v = 1 then pc += i
B5 ii    bn        ;  if n = 1 then pc += i
B2 ii    bnc       ;  if c = 0 then pc += i
B0 ii    bne       ;  if z = 0 then pc += i
B6 ii    bnv       ;  if v = 0 then pc += i
B4 ii    bp        ;  if n = 0 then pc += i
BF ii    br ii     ;  pc += i
FB       brk       ;  push PC, push Status, PC = *0 , Status = *3
B7 ii    bv        ;  if v = 1 then pc += i
BE ii    call ii   ;  push pc , pc += i
88-8B    callr A-D ;  push pc , pc = d
F0       cc        ;  c = 0
F4       cn        ;  n = 0
F2       cv        ;  v = 0
F6       cz        ;  z = 0
84-87    dec A-D   ;  d -= 1 (24 bits)
F8       di        ;  ei = 0
F9       ei        ;  ei = 1
80-83    inc A-D   ;  d += 1 (24 bits)
8C-8F    jumpr A-D ;  pc = d
E?       ld d,s    ;  d = *s (24 bits)
As 8d    lsl d,s   ;  d <<= s
As 4d    mov d,s   ;  d = s
4d ii    movi d,#i ;  d = i
C?       movl d,s  ;  d = s (24 bits)
As Dd    muh d,s   ;  d = ((unsigned) d * (unsigned) s) >> 8
As Cd    mul d,s   ;  d = (0x00FF) d * s
As Ed    mus d,s   ;  d = (d * s) >> 8
As 6d    or d,s    ;  d |= s
6d ii    ori d,#i  ;  d |= i
94-97    pop A-D
9C-9E    pop PC, SP, Status
90-93    push A-D
98-9A    push PC, SP, Status
As Bd    rot d,s   ;  d = (d << s ) | (d >> (8 - s))
As Fd    rotc d,s  ;  t = (d >> (8 - s)) & 0x01
FA       rti       ;  pop Status , pop PC
F1       sc        ;  c = 1
As Ad    shr d,s   ;  (unsigned) d >>= s
                   ;  d = carry << s - 1 | d << s | (d >> (9 - s))
                   ;  carry = t
F5       sn        ;  n = 1
D?       st d,s    ;  *d = s (24 bits)
As 1d    sub d,s   ;  d -= s
As 3d    subb d,s  ;  d -= s - carry
1d ii    subi d,#i ;  d -= i
3d ii    subbi d,#i;  d -= i - carry
F3       sv        ;  v = 1
F7       sz        ;  z = 1
As 5d    xor d,s   ;  d ^= s
5d ii    xori d,#i ;  d ^= i
1 Like

Why 24 bits, 20 bits was all you neeeded in the era of 64K drams.
The undeveloped 65e4 claimed to be the ultimate mico, but I am having
problems reading the pdf. Ben.
PS: R65F11 (6502 ish) had FORTH kernal built in rom. Is that a Forth
cpu? or 6502?

Indeed, even with 256Kbit dram chips you need more than 32 of them before 20 bits becomes a limit. But 24 bits fit nicely in 3 bytes which works well both with register and pointers in memory, so why not? An application or hardware can always ignore address bits it doesn’t need.

I haven’t seen it, so can’t compare it. Let me check… ah, the datasheet is from 1982. At first glance it seems like MOS’s answer to Intel’s iAPX432 and given the time frame it is not hard to imagine why they abandoned it. At least it could also execute 6502 code. There have been more modern attempts to extend to 6502 to 32 bits but I think the Acorn people were right: you might as well just do a RISC.

There are also processors with TinyBASIC in ROM. A conventional processor with an internal Forth interpreter certainly looks like a Forth processor to the user.

2 Likes

A cpu with 8 bits of data and bit 9 as flag bit, is stiil a 8 bit cpu data wise.
Set,clear and test flag would only effect bit #9. Then you could be like the
early computers that could process data of varable length.
RISC is not the answer, because when IBM forced 8 bit bytes on the
world, characters where string style acess. Set pointers up and
process a string. Data types where half and full words. A odd floating
point number here or there. The IBM 360 is CISC because of all the
weird instructions to fit in a 16 bit opcode. The same goes for the PDP 11. You need a few more bits for your opcodes to have a simple machine, if you have lots of memory access like arrays or records.

A 24 bit machine is the cleanest design I have so far, and that just
characters (12 bits) and ints (24). Real numbers are 48 bits, as
software trap.
Ben.

Very nice machine @jecel! (@Keith_Clark is a member here but not a frequent visitor.)

I think I will reference this machine over on anycpu. Done!

A post was split to a new topic: A 32-bit successor to 6502: the MCS65E4 (1982 design doc)

The idea behind retro8 is to have some shock value, to make people rethink some of their ideas about 8 bit microprocessors.

For an actual project I would prefer a RISC like my own baby42 (4 bytes data, 2 byte instruction).

One example I used of baby42 assembly is the pointer chasing code fragment Jan Gray used to illustrate his own XR16 processor:

typedef struct TN {
  char k;
  struct TN *left, *right;
} *T;

T search(int key, T p) {
  while (p && p->k != key)
    if (p->k < key)
      p = p->right;
    else
      p = p->left;
    return p;
}

What would this look like in retro8+ assembly?

            ; offsets: k = 0, left = 1, right = 4
            ; p in A, key in BL
        search:
80          inc A
84          dec A   ; set z flag
        for:
B1 11       beq done ; null pointer
A0 49       mov BM,*A ; p->k
A8 19       sub BM,BL  ; compare key and p->k
B1 0B       beq done ; found key
B8 05       bge else
9B E0 04    ld A,*(A+4)  ; p = p->right
BF F1       br for
        else:
9B E0 01    ld A,*(A+1)  ; p = p->left
BF EC       br for
        done:
9C          pop PC ; return

In a regular retro8 the “ld A, *(A+4)” would be more awkward:

04 04    addi AL,#4
28 00    addci AM,#0
2C 00    addci AH,#0
E0       ld A, *A
1 Like

I’m of the opinion that the plain old CMOS Z80 can be reliably overclocked to 25MHz and with fresh new design approach can make it small and cheap so the resulting product can do amazing thing that will change people’s perception about retro 8-bit processor. Zilog also has several products with integrated I/O built around Z80 core such as Z84C15 that may only be spec’d to 16MHz, but in fact can run 25-30MHz.

Plain old Z80 today can be an order of magnitude smaller, faster and cheaper than what was 40 years ago.
Bill

Rabbit Semiconductor was based on the idea that there was a market for an improved Z80.

You can get Z80 compatible open source cores that can be used in an FPGA or implemented as a dedicated integrated circuit:

core name language Z80 cycles FPGA ASIC
A-Z80 Verilog yes 2000 LUTs, 18MHz in Cyclone II
T80 VHDL yes 35MHz in Spartan 2 100Kgates, 100MHz in 180nm
TV80 Verilog yes 20Kgates 250MHz in 130nm, 125MHz in 65nm
Y80e Verilog 2557 cells in Cyclone III
NextZ80 Verilog faster 40MHz in Spartan 3

I remember reading about a Z80 that ran at 100MHz in an FPGA and was able to execute most instructions in just one clock cycle. The NextZ80 seems to be the closest to that, though I am pretty sure that wasn’t the name of the core I saw.

In any case, there are currently many resources for anyone who is interested in a modern Z80.

In 1983 I manually translated my Super Logo interpreter to Z80 assembly so we could test it on my friend’s TRS-80 Model I. I then translated it again but to 6809 assembly since that was the processor in our prototype children’s computer. For this pointer chasing intensive application the difference was amazing. On average I was getting one 6809 instruction for every three Z80 instructions. Given the two index registers in the Z80 I was not expecting anything like that. This is the reason I said at the beginning of this thread that the 6809 was the best 8 bit micro.

2 Likes

Hi Jacel,

I also recently designed an 8-bit processor, but my goals were different. Primarily I wanted to minimize and simplify the Verilog code. This processor (which I’ve been calling mc8) takes only about 1000 lines of Verilog code (including comments). I currently have this processor running on an FPGA dev board. The purpose of this processor is just moving bits and writing to control registers so it is not a fully general purpose CPU. It has a 6502-ish feel to it but it has full 16-bit index registers (X and Y) and a 16-bit stack pointer and I put the stack in it’s own address space. Architecturally the stack has a 64k address space but in the FPGA implementation I gave it 2k bytes of RAM. I/O also has it’s own address space via IN and OUT instructions like the z80, but the I/O address space is a full 64k locations and not limited to 256 device registers. I also included a djnz instruction like the z80, but with a full 16-bit loop counter register (W). I created an assembler for it (based on an open-source 68xx assembler). The architecture still may change, for example, I would like to add the capability to address more than 64k, perhaps by adding a “long jump” instruction that would update a segment register and all other instructions would operate relative to the current 64k sector. I don’t know how well this would work. If you have any suggestions, let me know.

Here is the current instruction set:

0001                         
0002                         ;==========================================================
0003                         ;
0004                         ;  Assembler test for all mc8 instructions
0005                         ;
0006                         ;  Instruction format
0007                         ;
0008                         ;      7       ..      5 | 4     ..      0
0009                         ;     -------------------------------------
0010                         ;     | register select  | opcode         |
0011                         ;     -------------------------------------
0012                         ;
0013                         ;==========================================================
0014                         
0015 0100                               org    $100
0016                         
0017 0100 00                 start:     czf               ; opcode = 0x00
0018 0101 01                            szf               ; opcode = 0x01
0019 0102 02                            ret               ; opcode = 0x02
0020 0103 03                            halt              ; opcode = 0x03
0021 0104 04                            inx               ; opcode = 0x04
0022 0105 05                            dex               ; opcode = 0x05
0023 0106 06                            iny               ; opcode = 0x06
0024 0107 07                            dey               ; opcode = 0x07
0025 0108 08                            sla               ; opcode = 0x08
0026 0109 09                            sra               ; opcode = 0x09
0027                         
0028 010a 8a                            ldxw              ; opcode = 0x0a | regsel = 0x80
0029 010b aa                            ldyw              ; opcode = 0x0a | regsel = 0xa0
0030 010c ca                            ldax              ; opcode = 0x0a | regsel = 0xc0
0031 010d ea                            lday              ; opcode = 0x0a | regsel = 0xe0
0032                         
0033 010e 8b                            stxw              ; opcode = 0x0b | regsel = 0x80
0034 010f ab                            styw              ; opcode = 0x0b | regsel = 0xa0
0035 0110 cb                            stax              ; opcode = 0x0b | regsel = 0xc0
0036 0111 eb                            stay              ; opcode = 0x0b | regsel = 0xe0
0037                         
0038 0112 0c                            push  a           ; opcode = 0x0c | regsel = 0x00
0039 0113 ac                            push  w           ; opcode = 0x0c | regsel = 0xa0
0040 0114 cc                            push  x           ; opcode = 0x0c | regsel = 0xc0
0041 0115 ec                            push  y           ; opcode = 0x0c | regsel = 0xe0
0042                         
0043 0116 0d                            pop   a           ; opcode = 0x0d | regsel = 0x00
0044 0117 ad                            pop   w           ; opcode = 0x0d | regsel = 0xa0
0045 0118 cd                            pop   x           ; opcode = 0x0d | regsel = 0xc0
0046 0119 ed                            pop   y           ; opcode = 0x0d | regsel = 0xe0
0047                         
0048 011a 0e 01 23                      in    port1       ; opcode = 0x0e
0049 011d 0f 03 45                      out   port2       ; opcode = 0x0f
0050                         
0051 0120 10 01 68                      lda   dat1        ; opcode = 0x10
0052 0123 11 01 69                      sta   dat2        ; opcode = 0x11
0053 0126 12 01 69                      add   dat2        ; opcode = 0x12
0054 0129 13 01 68                      sub   dat1        ; opcode = 0x13
0055 012c 14 01 6a                      cmp   dat3        ; opcode = 0x14
0056                         
0057 012f 15 10                         ldi   a, $10      ; opcode = 0x15 | regsel = 0x00
0058 0131 b5 01 23                      ldi   w, $0123    ; opcode = 0x15 | regsel = 0xa0
0059 0134 d5 12 34                      ldi   x, $1234    ; opcode = 0x15 | regsel = 0xc0
0060 0137 f5 56 78                      ldi   y, $5678    ; opcode = 0x15 | regsel = 0xe0
0061                         
0062 013a 16 12                         adi   $12         ; opcode = 0x16
0063 013c 17 34                         sbi   $34         ; opcode = 0x17
0064 013e 18 56                         cpi   $56         ; opcode = 0x18
0065 0140 19 12                         ani   $12         ; opcode = 0x19
0066 0142 1a 34                         ori   $34         ; opcode = 0x1a
0067 0144 1b 56                         xri   $56         ; opcode = 0x1b
0068                         
0069 0146 1c 01 67                      jsr   sub1        ; opcode = 0x1c
0070 0149 1d 01 64                      jmp   loop        ; opcode = 0x1d
0071 014c 3d 01 64                      jeq   loop        ; opcode = 0x1d | ccsel = 0x20
0072 014f 5d 01 64                      jne   loop        ; opcode = 0x1d | ccsel = 0x40
0073 0152 7d 01 64                      jgt   loop        ; opcode = 0x1d | ccsel = 0x60
0074 0155 9d 01 64                      jlt   loop        ; opcode = 0x1d | ccsel = 0x80
0075 0158 bd 01 64                      jge   loop        ; opcode = 0x1d | ccsel = 0xa0
0076 015b dd 01 64                      jle   loop        ; opcode = 0x1d | ccsel = 0xc0
0077 015e fd 01 64                      jnv   loop        ; opcode = 0x1d | ccsel = 0xe0
0078 0161 1e 01 6b                      jpi   dat4        ; opcode = 0x1e
0079                         
0080 0164 1f 01 64           loop:      djnz  loop        ; opcode = 0x1f
0081                         
0082 0167 02                 sub1:      ret
0083                         
0084 0123                    port1:     equ  $0123
0085 0345                    port2:     equ  $0345
0086                         
0087 0168 aa                 dat1:      db   $aa
0088 0169 bb                 dat2:      db   $bb
0089 016a 12                 dat3:      db   $12
0090 016b 12 34              dat4:      dw   $1234
0091                         
0092                                    end
0093
2 Likes

It seems like a clean and regular design, congratulations!

You don’t explain register W, so I would guess it is the stack pointer. Register S only appears once - status, perhaps?

As you program more you might find things you would like to add. Having only the option of constant i/o addresses makes generic drivers harder to write, for example. The 8080 had that limitation so they added instructions to the Z80 which used the C register to indicate the port.

Unless you have a lot of comments, 1000 lines of Verilog seems a bit large for this design. Several simpler RISC-V implementations are smaller (half the size in two cases) than that.

I don’t understand the part about it not being a general purpose CPU. That made me think about something like the HP Nanoprocessor, but this seems reasonably complete.

1 Like

Hi Jecel,

=> You don’t explain register W, so I would guess it is the stack pointer.
The W register is the loop counter register used exclusively by the djnz instruction.

=> Register S only appears once
Oh… that’s a typo… a remainder from an older version… S is the stack pointer, but when I moved the stack to it’s own address space I removed any instructions that explicitly modify S.

=> added instructions to the Z80 which used the C register to indicate the port
Nice, I like this change.

=> 1000 lines of Verilog seems a bit large for this design.
Maybe my coding style is too verbose ??
Well at least I think the code is simple for a CISC :wink:

=> I don’t understand the part about it not being a general purpose CPU.
It’s limited to 8-bit arithmetic. I didn’t include any opcode that would make it possible to do multi-precision arithmetic… and good luck trying to implement a multiply or divide. :wink:

Do you have any ideas of a clean-ish way to expand the addressing range to > 64K

Regards,
Scott

Any reason why X or Y couldn’t be used as a counter? I can imagine you might want both as pointers inside a loop (a source and destination when copying bytes, for example).

How does having a separate address space free you from manipulating the stack pointer? In retro8 it is awkward, but still possible: the reset vector has the initial values for PC, SP and status and you can push/pop SP.

I did notice that you had a zero flag but no carry. In a Turing complete processor (an extremely low bar) multi-precision arithmetic must be possible, even if very awkward:

addc:     lda op1
          ani $01
          sta carry   ; used as scratch for now
          lda op2
          ani $01
          add carry
          sra
          sta carry   ; now holds carry from bit 0 to bit 1
          lda op1
          sra
          ani $7F     ; we wanted shift right logical
          sta sum     ; also used as scratch for now
          lda op2
          sra
          ani $7F
          add sum     ; (op1 + op2) / 2
          add carry   ; take ninth bit into account
          jnv setcarry;  nv = negative value? so bit 7 = 1
          ldi a,0
          jmp savecarry
setcarry: ldi a,1
savecarry:sta carry   ; final value of carry
          lda op1
          add op2
          sta sum     ; final value of the addition
          ret

op1:      db $F3
op2:      db $46
sum:      db $00
carry:    db $00

That was the whole point of my design. There is no relation at all between the size of your A register and ALU (8 bits) and the size of your pointer registers PC, S, X, Y and W. You made these 16 bits, but what would change if they were 24 bits instead? Or 20 bits, other than 4 bits wasted in every three bytes?

I bet with a little effort you could make this a parameter in your Verilog (though it would change the number of memory accesses of many instructions).

=> Any reason why X or Y couldn’t be used as a counter?
X and Y are absolute addresses and would not necessarily count down to zero.
And I thought it would be cleaner to just have a dedicated djnz loop counter. It’s CISC after all.
None of this general purpose any register can do anything RISC philosophy :wink:

=> How does having a separate address space free you from manipulating the stack pointer?
I didn’t say that. I said there are no opcodes that explicitly modify the stack pointer.
It is only(implicitly manipulated by push/pop opcodes. There are no explicit (load or store) stack pointer opcodes in the current architecture.

=> multi-precision arithmetic must be possible, even if very awkward: adc: …
Yikes!! That is an heroic adc coding !! :slight_smile: It makes me think I should stop with the 8-bit-operand-only purist idea and just add ADC and SBC opcodes.

=> but what would change if they were 24 bits instead?
I think you’re right that the most straightforward solution would be to increase the width of all 16-bit registers to 24-bits (and change all the direct-addressing mode opcodes to use 3-byte addresses). Another idea would be to add a 24-bit segment register where all the 16-bit memory address of the current architecture are added to the segment register to create a 24-bit memory address.

It depends on how much you use it. If it is rare in your code the subroutine I showed is not too painful.

Segments are certainly a solution, but you asked for a simple one. There is a reason while most people preferred to program the 68000 than the 8086 if they had a choice.

That said, several of my embedded processor projects were 16 bit Forth chips with segments in the form of object tables. Since you switched segments automatically on calls and returns they were mostly invisible.