6502/65C02/65CE02 assembly "updater"

I’m wondering if anyone has come across an assembler or related utility that reads 6502 assembler and/or machine code and suggests ways to implement that code using the new opcodes in the 65C02 and 65CE02?

Nothing too advanced, I’m not expecting it to re-arrange register usage or change existing code into base-page form, but things like looking for ADC #1 which can become INC A, or PHA/TXA/PHA patterns → PHX or JMP with a nearby address to BRA and so forth seem like things that could be automated.

I have the source for the Atari OS and original BASIC, and I’m curious about how much of a difference they would have if they had moved to the 65C02. The last time I got curious I wrote RetroBASIC, so I’m hoping in this case someone has already done it :slight_smile:

2 Likes

Sounds like a peephole optimiser! But not the kind attached to the back end of a compiler, more of an after-the-fact analysis.

I vaguely recollected some such thing, and sure enough, there was one - over in the land of Stardot, and back in 2016, SteveF wrote a python script. Which he has kindly shared, with permission to post here - it is minimal but perhaps illustrative of an approach:

from __future__ import print_function
import sys

start = 0x8000
image = bytearray(open(sys.argv[1], "rb").read())

opcode_jmp_abs = 0x4c

for i in range(0, len(image)):
    if image[i] == opcode_jmp_abs:
        target = image[i + 1] + (image[i + 2] << 8)
        # If we replace this JMP with BRA, its displacement will be calculated
        # from this point.
        branch_base = start + i + 2
        branch_displacement = target - branch_base
        if branch_displacement >= -128 and branch_displacement <= 127:
            print("%s JMP &%s -> BRA (%d)" % (hex(start + i), hex(target), branch_displacement))

I see also there’s a program opt65 by Daniel Dallmann which might be worth a look. There’s a version (V0.12) here
and one here

The problem with combinations of opcodes is really, it’s either extremely simple so that it’s more about a regular expression for searching a line with TXA followed by another one with PHA, or it requires the kind of contextual awareness, you’d expect from an optimizing compiler. With most constructs, there are multiple ways to do this and also to do this at various locations in the instruction order. Experienced programmers will choose the one which provides the most side effects to exploit. (Yes, we may have to do a TXA, but we can get this additional instruction back – or even more than one –, by exploiting a side effect. The same is pobably true for any instances that may suggest a STZ instruction. Speaking of BASIC and OS code, MS/Commodore BASIC is full of this, because it really tries to be dense to make the most of the available ROM.)

E.g., take TXAPHA: Now we have 3 copies of that value (in A, X, and on the stack) and we have set flags, as well, and chances are that we may use more than one of this down-stream. It may be the raison d’être of this construct, to begin with, and, as we set flags, various conditions downstream may depend on this. And the only way to find out, is to trace the code for state changes and see, if any of this is used, until this context is entirely lost. Otherwise, a utility like this may be more of a trap than of a helping hand.

(E.g., I’ve seen code, esp. patch code, which performs an initial PHP and then a PLP at the end, again, just to preserve flag state and to minimize side effects.)

And I really think, there isn’t much middle ground to this, with the notable exception of subroutines (esp. interrupt routines) pushing the current state to stack for backup.

2 Likes

Yeah too many different side effects with too many potential downstream effects.

Another thing is really semantics. E.g., in the tokenizer routine (of Commodore BASIC, which I had an eye on recently), there’s a mode flag, which receives all kinds of values, but the only aspect, we care about, is bit #6 (as tested by a BIT instruction and BVS). It even is initialized to 0x04, because this is what happens to be available in the registers, and the important part is that bit #6 is not set.

Now, if we start using STZ to initialize or to reset this mode flag, we loose context for what’s else in the code. If it is further down to recieve 0x22, this must mean a thing, because it’s either zero or some deliberately set value! But this is not what’s actually the case – we have lost meaning, as conveyed by this peculiar non-zero reset, which introduced a theme and focused our attention regarding the kind of significance of any values in this mode falg.

I can imagine a tool that’s somewhat iterative or interactive, like a modern code coverage tool, where you get to annotate things which you can’t change, to avoid false positives. But because of side-effects, you’d need to be figuring out the call sites and jump targets, so it might help to have an interactive disassembly too.

I can imagine low-hanging fruit in the sense of “interesting” patterns - the above-mentioned TXA:PHA for example, or LDY #0 ahead of indexed zero page accesses.

Okay, but … why? What is the point? If it’s some sort of simple automated utility, it’s kinda fun to see how much a speed difference it could make.

But if we’re talking manual effort, then … what is the point? No amount of hand optimization is going to match the impact of cranking up an emulator speed up a little bit.

Of course, some of us find it amusing to hand optimize 6502 code just to see what we can do within the limits of, for example, a C64’s hardware. But that’s different, I think.

It might be fun, and one might care. Case in point, see
Saving bytes, byte by byte - an example
which links to and aims to explain the long stardot thread which (as it happens) SteveF posted his optimiser to.

(I’m tempted to say that asking “what’s the point” is never a good thing to do here. People have their motivations and interests, and if those are not shared, or not understood, that’s not for them to defend, but for the critic to understand or ignore.)

Recently, I spent a lot of time optimising my 6502 TinyBasic interpreter to add more commands and functionality - “Code Golfing” as it is. I succeeded and was very pleased with the end results - but it took weeks of part-time effort. At some point, I think you have to stop and call it the end of a day.

… but is is somewhat addictive and I’ve been looking at it again with a view to implementing a machine code monitor as well as the TinyBasic interpreter. I have 4KB of ROM…

-Gordon

1 Like

I’m actually playing with doing this in Python using Lark as a parser. Lark solves a lot of problems I had in flex/bison, like early matching, and Python largely eliminates the cross-platform and building issues. So I’m using this project as a stepping-stone to a possible reimplementation of RetroBASIC in python.

But jebus, I’m going absolutely nuts with the caching. I change code and run it and nothing changes because it’s cached somewhere and I can’t find it (nope, not pycache, no .pyc either!) so debugging is driving me nuts.

I’m on the Mac, has anyone else done Python work there and have any advice? I’m using VSC as my IDE, but I have no better luck in terminal and a text editor.

I’m using BBEdit, which is just a text editor, but it has a “run” menu and you can run any scripts (in Ptython, Perl, etc.) as text filters. The latter may be an idea for a project like this…

(Just put a script in a dedicated folder and it will be available in the text editor, where you can run this on any text selection.)

BBEdit is a paid product, but it has a free mode, in which most functionality will be still available after the evaluation period. – If you don’t use this for HTML, you probably won’t be missing much in free mode. Also, BBEdit is about the best you can get on Mac, or on about any platform, in terms of running Perl-like regular expressions, over large files, or accross the file system, which is really its selling point to me.