It’s a little odd that OCR has trouble. This is a fixed-spacing 7x9 dot matrix font that uses only upper case, numerals, and a few symbols. What could be easier? I feel like the early 1980’s OCR we did at Cognex could handle it.
Microsoft Copilot has given me a workflow using third party tools designed specifically for converting legacy source code listings to machine readable text. I may try it out, unless Lars has really gone and typed it all in by hand, in which case I’m amazed at his work ethic. I’m way too lazy to do such a thing.
I’m only at page 8, so at 30% or so. I have done a lot of listing transcriptions over the years, and I actually enjoy it. It’s not overly hard work; even if it’s 100 pages, I’ll just do a few every day and eventually it will all be done. In the process I will have learned something about the code.
In some cases, I have recruited volunteers to ensure all pages were independently typed twice to catch typos.
Modern OCR seems to lean heavily on machine learning that is trained on regular prose. Hence it will not do well with typical computer listings. People say a new batch of LLM based OCR is doing better.
I’ve had a long and fascinating conversation with Copilot about this. I’ve uploaded the PDF scans and it’s apparent that Copilot is quite capable of directly producing perfect or nearly perfect machine-readable text. It understands that it’s looking at a fixed-spacing dot matrix font, limited to upper case. It further understands that it’s looking at a PAL8 listing file, and what the organization and syntax of such a file is. It has produced perfect transcriptions of small sections of the code for me to look at, and asked me how I’d like the text output to be formatted. There is one serious problem, however. It is not allowed to simply transcribe the entire scan because it cannot verify the copyright. It can only do small pieces that I identify. I’m trying to develop a non-tedious workflow that would get the whole thing. We could then compare to what Lars is producing for confirmation.
The problem, with both Copilot and Lars’ effort, is that I don’t see how it scales to the full listing, which is ten times larger than what I’ve scanned so far. Since scanning the first 24 pages was much smoother than I had feared, I certainly can and probably will scan the rest.
Copilot has also offered to help me use third party tools designed for legacy listings, or even create a C# program to do the OCR. Both of those options seem fairly complex, however, from the brief descriptions Copilot has given me. I’ve even considered writing my own C# program mostly because I like writing code and C# is my second favorite language (after C++). I’ve made no decisions so far, just kicking around ideas.
A post was merged into an existing topic: Timesharing on the MIT Weather Radar PDP-8/IX
(I’ve split this OCR discussion off from the parent, with permission, but as ever there could be rough edges, so please refer to both topics for full understanding.)
I notice that we have several previous threads touching on OCR difficulties:
Custom OCR for printer listings
Some tips for LLM OCR code scanning
OCRing, a nice surprise
I’ll just add as a footnote the same advice I gave when we started the Edinburgh Computer History Project 30+ years ago: scan in as high a res as you can manage in colour or greyscale. Future OCR will be better than current OCR and the more data it has to work with, the better. You can always down-res from a higher resolution master if needed.
I’ve written a custom OCR program designed to read fixed spacing, dot matrix fonts that have been scanned at much higher resolution than the dots. The Weather Radar kernel, PMIO, and TI-980 Emacs listings I’ve posted use 7x12 dot matrix fonts printed at 100 DPI and scanned at 600 DPI.
The program is under development. The results I have so far are excellent. Using the kernel listing, I’ve compared Lars’s hand-typed file with my OCR. The OCR got a total of 6 characters wrong, which I intend to fix, and 2 cases where it detected a period but there was just a small smudge on the page.
Lars hand-typing is remarkable. There are just 11 significant errors in the entire listing, most just in comments. There are also 10 trivial mismatches involving extra or missing spaces, missing periods at the end of a comment, and the like. These are often just reasonable editorial choices.
It would have been extremely difficult to debug the OCR without reference to Lars’s work.
Reading all 24 pages in kernel currently takes 40 seconds. Working on improvements that I expect will eliminate the 6 wrong characters before I try the rest of PMIO. I’ll post the OCR algorithms at some point.
Seven days so far to write the code, 40 seconds to run. I had enormous help from Microsoft Copilot. Not to write code, but to act as a super C# reference. I’ve written tons of C#, but not recently. Who can remember every damn thing about the language, and who has time to look it up? Copilot just knows every obscure detail.
Good work.
Another OCR project might be ASR 33 printouts.
Thinking here of small BASIC programs like the HUNT THE WUMPUS.
I took the liberty of dropping the extra characters at the end of wrapped lines, which seem to have been introduced by some kind of bug. Other than that, I aim to match 100% of the text (including your original typos) and formatting. I verify the code by assembling it and fixing errors like mistyped symbols, and also checking against the octal data in the listing. I’m curious what typos you found.
Agreed. I bet there’s a large number of listings out there that could benefit.
This is a solved problem. All the old Basic games in Creative Computing, 101 computer games, 101 more, etc. were transcribed years ago and can be freely found online.
Wumpus?
https://unicorn.drogon.net/wumpus.rtb
-Gordon
Your handling of wrapped lines is correct, I don’t count those as mismatched. Here is the corrected kernel listing. File diff will tell you what I found. Note that I’m working with kernel.lst from your repository as of March 23. Most mismatches are in comments and wouldn’t be caught by assembling. But a few would be I think.
The methods I’m using wouldn’t work for an ASR33. It’s got to be a simple dot matrix font. Not that ASR33 OCR couldn’t be done, but not with my current methods. I’ve got some listings from a late 1970s Centronics printer that would be suitable, for example.
I’m taking advantage of every constraint on the text that I can, which is why it is so accurate and relatively quick to write the code.
Here’s a picture of a portion of this particular font as defined in C#
Other dot matrix fonts can be defined as needed.
Very encouraging new tool! It’s an oddity that normal OCR has to be very generous and fuzzy, but these cases are looking for something very constrained.
Dot matrix OCR has two phases, character locating and character recognition. Recognition is under development, but locating is solid. Here is how it works.
I assume fixed character and line spacing, and that each page was scanned at some small angle due to manual placement on the scanner. At a selectable page angle I generate X and Y 1D projections, which look like this:
This is the most time-consuming part of character locating because it has to look at the entire 20 megapixel image. The projections are filtered by vaguely Laplacian filters that are roughly tuned to the character and line spacing. The filter kernels look like this
The exact filter kernel shape is not critical, just these rough dimensions. These are constant-time filters—execution time is independent of the width of the kernel. The Y kernel at 56 pixels runs in the same time as the X kernel at 18 pixels.
The results of the filtering look like this:
Notice the very sharp peaks separating the lines and characters. A simple Fourier-like analysis gets the period, phase, and total amplitude of the peaks, yielding a grid like this:
Remember the selectable projection angle? A dual-resolution hill-climbing search runs the above at various angles and finds the angle with the strongest total amplitude. I have found angle variations ±0.3 degrees. Not much, but enough to throw off the grid over 5000 pixels, so finding the right angle is both important and easy.
Now the problem reduces to recognizing a dot matrix inside a small box.
One thing that makes this method robust is that there are no thresholds of any kind, only the raw grey values are used. There are hardly any parameters, just a period range for the Fourier analysis, and the two angle search resolutions (I use 0.10 and 0.02 degrees). No parameter should need to be carefully tuned. My rule of thumb is that you should be able to set a parameter by intuition once and never need to change it. If you have a fussy parameter, you have the wrong algorithm.
“Algorithm” is so quaint. All this rules-based vision methodology that was my career is totally obsolete. It just can’t compete with AI. (Mostly but not entirely obsolete today, but “totally” is coming.)




