One Page Computing - Suite16

monsonite · September 29, 2020, 11:59am

A couple of years ago I was inspired by Ed and Hoglet’s One Page Computing (OPC) project - and so in the course of the last year I have been slowly putting together my own version which hopefully will meet the criteria of the OPC project.

Suite16 is an experimental 16-bit cpu, inspired by Steve Wozniak’s Sweet-16 virtual machine written for the 6502 and Apple II.

It uses an accumulator R0 and fifteen general purpose registers R1 to R15. The registers R12 to R15 have secondary functions such as PC, SP, IP and W (for a future Forth or other VM implementation).

This architecture was dictated by the availability of the 74F219 16 x 4-bit TTL RAM - as ultimately I wish to implement the architecture in real TTL devices.

The instruction set is split roughly into 2 groups - those that involve the accumulator and another register - which includes the arithmetic and logical instructions.

The other group are branches and calls, which affect the PC plus some miscellaneous operations that act only on the accumulator.

The instruction is 16-bits wide, with the 4-bit operation and 4-bit register operand contained in the upper byte.

The lower byte which I call the “payload” can be used to hold an 8-bit literal, an 8-bit address offset or be decoded directy to perform operations on the accumulator, such as shifts, complement, clear, set-carry, clear carry, and setting to frequently used constants etc.

Currently instruction 0Fxx has been assigned to NOP, but further decoding of xx provides a mechanism for adding additional instructions to this group. This was inspired by the “microcoded” instructions of the PDP-8.

The instruction set is still being finalised, in particular those that utilise the lower payload byte. Having written some assembly routines for numerical input and output and base conversions between decimal and hex, I notice that the lower byte is very under utilised.

This suggests that a better code density or instruction flexibility might be achievable by making more creative uses of the payload byte.

However, I don’t want to make the ISA so complex, that I will never succeed in implementing it in TTL.

I have a cpu simulation in 60 lines of C++ that runs on any Arduino compatible board. My current favourite is the 600MHz Teensy 4.x, which executes my instruction set at about 20 million simulated instructions per second.

This is the sort of performance that could be achieved in a cpu implemented in 74F series TTL.

A further aim of this project is to port the architecture to an FPGA using verilog - again a learning exercise for me.

Fortunately there is a close correlation between a cpu simulator written in C as a large switch-case statement, and the description of the cpu in verilog.

For the moment I am happy with the simulation running on the Teensy 4.x, and the next plan is to use James Bowman’s “Dazzler” module to generate a 1280 x720p 24-bit HDMI video output.

For an outlay of about $75 the combination of the Teensy 4.1 and Dazzler should provide a very capable platform for further retrocomputer explorations.

Below I reproduce the instruction set as it stands at the moment.

/* Suite-16 Instructions

Register OPS-
     0n        ---       --     Non-Register Ops
     1n        SET       Rn     Constant  (Set)        Rn = @(PC+1)
     2n        LD        Rn     (Load)                 AC = Rn
     3n        ST        Rn     (Store)                Rn = AC
     4n        LD        @Rn    (Load Indirect)        AC = @Rn
     5n        ST        @Rn    (Store Indirect)       @Rn = AC
     6n        POP       @Rn    Pop  AC                AC = @Rn  Rn = Rn - 1
     7n        PUSH      @Rn    Push AC                @Rn = AC  Rn = Rn + 1 
     8n        AND       Rn     (AND)                  AC = AC & Rn 
     9n        OR        Rn     (OR)                   AC = AC | Rn 
     An        ADD       Rn     (Add)                  AC = AC + Rn
     Bn        SUB       Rn     (Sub)                  AC = AC - Rn
     Cn        INV       Rn     (Invert)               Rn = ~Rn
     Dn        DCR       Rn     (Decrement)            Rn = Rn - 1
     En        INR       Rn     (Increment)            Rn = Rn + 1
     Fn        XOR       Rn     (XOR)                  AC = AC ^ Rn
    
Non-register OPS-
     00        BRA       Always                        Target = IR7:0
     01        BGT       AC>0                          Target = IR7:0
     02        BLT       AC<0                          Target = IR7:0
     03        BGE       AC>=0                         Target = IR7:0
     04        BLE       AC<=0                         Target = IR7:0 
     05        BNE       AC!=0                         Target = IR7:0
     06        BEQ       AC=0                          Target = IR7:0     
     07        JMP       16-bit                        Target = @(PC+1)
     08        CALL      16-bit                        Target = @(PC+1)
     09        RET       Return
     0A        ADI       Add R0 8-bit Immediate        Immediate = IR7:0
     0B        SBI       Subtract R0 8-bit Immediate   Immediate = IR7:0
     0C        OUT                                     putchar(AC), port = IR7:0
     0D        IN                                      AC = getchar(), port = IR7:0
     0E        JP@                                     BRA (R0)
     0F        NOP                                     AC &= AC IR7:0 decoded to give
                                                       "microcoded" instructions

And here is a listing of the simulator code - condensed to about 64 lines of code:

    unsigned long count = 0 ;
    int A = 0 ;    int  i=0 ; 
    unsigned int IR = 0  ; unsigned int PC = 0  ;
    int R[16]= {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0x0200} ;  // Zero the Registers, Set the top of the return stack at 0x0200
     
    void setup() {  Serial.begin(115200);   }
    void loop() {  fetch() ; execute() ;  }     
    void fetch() {  IR = M[PC] ;    PC ++ ;  PC &= (MEMSIZE-1) ; }    
    void execute()   {
      
    /* Instruction Decoder */
      
      unsigned int op = IR >> 12 ;            // op = IR 15:12      Opcode             
      unsigned int n  = (IR & 0xf00) >> 8 ;   // n = IR 11:8        Register or Condition
      int addr = IR & 0x0ff ;                 // addr = IR 7:0      Address or Displacement
      int PCM  = PC & 0xff00 ;                // Modified version of PC containing page byte only, lower byte cleared
      
      /* Opcode Execute */
       
        switch (op) {
          
        case 0x0:   break ; 
        case 0x1:   R[n]= M[PC] ; PC++  ; break ;  /* SET */      
        case 0x2:   R[0] = R[n]         ; break ;  /* LD */
        case 0x3:   R[n] = R[0]         ; break ;  /* ST */
        case 0x4:   R[0] = M[R[n]]      ; break ;  /* LD@ */
        case 0x5:   M[R[n]] = R[0]      ; break ;  /* ST@ */
        case 0x6:   R[0] = M[R[n]]      ; R[n]= R[n]+1   ;  break ; /* POP with post-increment of pointer Rn  */
        case 0x7:   R[n]= R[n]-1        ; M[R[n]] = R[0] ;  break ; /* PSH with pre-decrement of pointer Rn */  
        case 0x8:   R[0] &= R[n]        ; break ;  /* AND */
        case 0x9:   R[0] |= R[n]        ; break ;  /* OR  */
        case 0xA:   R[0] += R[n]        ; break ;  /* ADD */     
        case 0xB:   R[0] -= R[n]        ; break ;  /* SUB */
        case 0xC:   R[n] = ~R[n]        ; break ;  /* INV */
        case 0xD:   R[n] =  R[n]-1      ; break ;  /* DEC */
        case 0xE:   R[n] =  R[n]+1      ; break ;  /* INC */
        case 0xF:   R[0] ^= R[n]        ; break ;  /* XOR */        
        }

        /* Conditional Branches and I/O Group */
        
        A = R[0] ;
        if (op == 0) {      // do an unconditional jump back to enclosed address

    switch (n) {

           case 0x0:  PC = PCM + addr ;              break ;   // BRA Branch Always
           case 0x1:  if(A>0)  { PC = PCM + addr ; } break ;   // BGT Branch if Greater
           case 0x2:  if(A<0)  { PC = PCM + addr ; } break ;   // BLT Branch if Less Than
           case 0x3:  if(A>=0) { PC = PCM + addr ; } break ;   // BGE Branch if Greater or Equal
           case 0x4:  if(A<=0) { PC = PCM + addr ; } break ;   // BLE Branch if Less Than or Equal
           case 0x5:  if(A!=0) { PC = PCM + addr ; } break ;   // BNE Branch if Not Equal to zero
           case 0x6:  if(A==0) { PC = PCM + addr ; } break ;   // BEQ Branch if Equal to zero
           case 0x7:  PC++ ; addr = M[PC] ;  PC = addr ; break ;  // 16-bit JMP
           case 0x8:  R[15]= R[15]-1 ; M[R[15]] = PC ; PC = addr ; break ;   // CALL (zero page) use R15 as Return Stack Pointer 
           case 0x9:  PC = M[R[15]]      ; R[15]= R[15]+1   ; break ;        // RET
           case 0xA:  R[0]= R[0] + addr ; break ;        // ADI add the immediate 8-bit contained in the address field
           case 0xB:  R[0]= R[0] - addr ; break ;        // SBI subtract the immediate 8-bit contained in the address field
           case 0xC:  Serial.write(R[0]) ;  break ;                            // OUT  - output a character to the Serial port
           case 0xD:  i =0 ; while (i < (63)) {          // GETCHAR - get '0' terminated character string into buffer
                      while (!Serial.available());
                      char ch = Serial.read();
                      if (ch == '\r' || ch == '\n') break;
                      if (ch >= ' ' && ch <= '~') {  M[512 + i] = ch;  i++;  }  
                      }  
                      M[512+i] = '\n' ; M[512+1+i] =   0 ; // Terminate the buffer with zero
                      break ;
           case 0xE:  PC = R[0] ; R[0]= M[PC] ; break ;  // JMP @R0   - useful for indexing and table look-up 
                                                                        ( curious but useful pipeline effect here)
           case 0xF:  break ; R[0] &= R[0] ;             // NOP   AND accumulator with itself
    }    
        } 
           }

EdS · September 29, 2020, 3:52pm

Very nice Ken! We should note that @Revaldinho was also at the curry night in question (and One Page Computing might even have been his idea, come to think of it.)

A laudable restraint! I think there’s always a temptation to fill out the opcode map, as if that were primary. But what’s really important is making the thing work, and for that, the less machinery the better.

It would be very nice to see the same thing done as an embedded emulator, an FPGA design, and a board of TTL.

oldben · September 29, 2020, 4:52pm

I consider a 2901 bitslice TTL, as it is a ALU unit.
Prom decoding may need to be a bit wider over a 74181
alu setup. For things like R++ the 2901 is very nice.
Ben.

Michael_Barry · September 30, 2020, 4:39am

Re: The underutilized payload byte.

Yeah, that sorta goes with the territory when you have a fixed length instruction. My “under-implemented” 65m32 has a 14-bit inherent payload and its bigger brother 65m36 has an 18-bit inherent. Both have a 4-bit conditional predicate that could also be seen as underutilized … until those elements come into play at just the right moment. It’s a balancing act of machine code bit density vs. instruction versatility.

For me personally, it’s all about a simple assembly language and a straightforward instruction decoder … memory and secondary storage bits are far less expensive than they used to be, and I’m not particularly off-put when I see lots and lots of 0s and Fs populating my hand-assembled .lst files.

monsonite · September 30, 2020, 10:04am

Yes, that was what first appealed when I saw Woz’s Sweet16 instruction set, there were only 31 instructions to learn, and it seemed to have a regular structure to it.

I have not copied it verbatim, because I had the need for more logical operations, increment, decrement and some means to implement input and output.

I still have some reservations about the accumulator becoming a bottleneck, and the destination of all ALU operations, but I felt that this simplification was necessary to keep the TTL hardware as simple as possible.

I have given some thought to register to register moves eg. MOV R3, R4 and this could be done with a little further hardware complexity having an additional latch to store the output of the source register before writing it back to the destination register.

As it stands, register to register moves are done through the accumulator in two instructions eg.

LD R3
ST R4

Additionally, the instruction set loses it’s uniformity with the INC, DEC and INV as the result of these operations is written back to the source register in a read, modify, write sequence. This is trivial to include in a simulation program, but takes a bit more head scratching when it comes to the design of the data paths and control unit in TTL.

Keeping the opcodes simple means that hand assembly is fairly straightforward, and I wrote all of the character input, output and hex and decimal conversion and printing routines without the use of an assembler. Often just working out the mechanics of the routine would take significantly longer than the actual assembly.

Writing these elementary routines was a good way to gain experience in the “alien” instruction set and it also helped point out some deficiencies in the original draft - so the add/subtract 8-bit immediate ADI SBI, and JP@ a table look-up instruction were added. These extra instructions vastly simplified the mechanics of the number conversion routines.

As an example of hand assembled code I include the PRINTHEX routine which prints the contents of the accumulator to the terminal in the form of a 4-digit hex number.

As Suite-16 has no shift left or shift right instructions it makes the usual method of isolating the nybbles and assigning them an ASCII character, somewhat awkward. Instead, I do successive reduction of the number by powers of 16, starting with 4096, then 256, then 16 then 1. It’s slightly more involved, but based on code for my decimal printing routine.

It helps illustrate the relative code size compared to other cpus that share a 16-bit wide instruction. With PRINTHEX working and the equivalent GETHEX hexadecimal entry routine working it is a simple task to get a monitor program operational with a hex-dump format output.

I should point out that the hand assembled code was effectively placed into the RAM of the Arduino/Teensy by initialising an integer array M as below.

Keeping track of addresses was done manually by writing a comment every 16-instructions to remind oneself regularly of the address!

#define MEMSIZE  2048
unsigned int M[MEMSIZE] = {

     // Hexadecimal number entry, converted to integer and printed out in decimal
 
     // 0x0000 -----------------------------GET-TEXT----------------------------- 

        0x1B00,     // SET R11, 0x0000     memory buffer start
        0x0000,
        0x08A0,     // CALL Hexdump
        0x1000,     // SET R0, 0x0000
        0x0000,

After about a month of part-time hand assembly, a friend Frank Eggink, provided me with a customised version of TASM32 and a hex loader program, that made the whole process somewhat easier.

    // 0x0070 ---------------------------PRINTHEX-------------------------------- 
        
    //  Prints out contents of R0 as a 4 digit hexadecimal number to the terminal 
    //  Leading zeroes are not suppressed
    //  R1 = Heximation Value
    //  R2 = digit 
    //  R3 = 0x30
    //  R4 = temporary storage for accumulator (Heximated value)
    //  R6 = temporary store for output character
        
        0x1200,     // SET R2, 0x0000
        0x0000,
        0x1300,     // SET R3, 0x0030
        0x0030,
                   
        0x1100,     // R1 = 4096
        0x1000,
        0x088C,     // CALL Heximate
        
        0x1100,     // R1 = 256
        0x0100,
        0x088C,     // CALL Heximate
        
        0x1100,     // R1 = 16
        0x0010,
        0x088C,     // CALL Heximate

        0x0A30,     // ADI 0x30 to make a number
        0x3600,     // Store in R6
        0x0B3A,     // SBI  0x3A  - is it bigger than ascii 9

     // 0x0080 ---------------------------------------------------------
        
        0x0284,     // BLT 0x84 -  Print decimal digit
        0x0A41,     // ADDI 0x41 - make it a hex digit 
        0x0C00,     // putchar R0
        0x0086,     // BRA CRLF

        0x2600,     // LD from R6
        0x0C00,     // putchar R0
        0x1000,     // SET R0, CR
        0x000D,
        
        0x0C00,     // putchar R0, CR
        0x0B03,     // SBI 0x03 Set R0, LF   
        0x0C00,     // putchar R0, LF
        0x0003,     // BRA START     
        
        
     // 0x008C ------------------------Heximate--------------------------------
      
        0xB100,     // SUB R1,     :Heximate 
        0x0290,     // BLT 0x90    
        0xE200,     // INC R2
        0x008C,     // BRA 0x08C

     // 0x0090 ---------------------------------------------------------    

        0x3400,     // Store R0 in R4   temporary store the remainder
        0x2200,     // MOV R2, R0  get the count from R2
        0x0A30,     // ADI 0x30 to make a number
        0x3600,     // ST R0, R6  - temporary save to R6
        
        0x0B3A,     // SBI  0x3A  - is it bigger than ascii 9
        0x0299,     // BLT 0x99 Print decimal digit
        0x0A41,     // ADI 0x41 - make it a hex digit 
        0x0C00,     // putchar R0

        0x009B,     // BRA 0x9B   Restore R0
        0x2600,     // Get R0 back from R6
        0x0C00,     // putchar R0 Print it as a decimal digit
        0x2400,     // Get R0 back from R4

        0xA100,     // ADD R1 adds DEC value to restore R0       
        0x1200,     // SET R2,0    Reset R2
        0x0000,
        0x0900,     // RET

monsonite · September 30, 2020, 10:32am

Ed - that was the initial plan, and as a means of gaining new experience in those 3 design areas.

jecel · September 30, 2020, 3:37pm

I think I have mentioned the ZPU before. Though it has a poor DMIPS/MHz ratio, I would seriously consider it as an option if I were building a TTL computer.

The only visible registers are the program counter and the stack pointer.