Custom OCR for printer listings

MauryMarkowitz · April 1, 2024, 2:18pm

I have been looking for such a project for some time, I can’t wait to try this out.

I came across a commercial dot-matrix OCR program a few months back that had magazine-quality text as its demo inputs. I would have been perfectly happy to pay for a copy if it worked. Sadly, while the product page is still online, emails to the contact dropbox went unanswered.

May I be so bold as to suggest re-hosting the code on your GitHub account? I found it difficult to navigate the ftp via a browser, an downloading with SiteSucker was slow. Or perhaps I missed a ZIPped version of the whole build somewhere?

I also did not see a license anywhere in the code.

EnthusiastGuy · April 1, 2024, 8:52pm

Indeed. What I’m worried about is that I expect that the raw text you are having in mind will address several different flavors of computer syntax.
Let me ask a blunt cruel question. What is it worth to be preserved from the raw library you are indicating to? Are there any situations where the actual scanned code will suffice just left plain as an image? Such as (inventing here) an algorythm for machine X that sorts something? If so, it might just be sufficient to provide the original raw material as public and let anyone willing to reuse that to sort out the code. Anyone with interest will certainly get to their objective if the raw fundament is worth it to them.
What comes to my mind here is a stupid analogy of going in a museum of literature and expecting any text there to be copy-paste-able while interested people will take phone photos anyway.
I think if you already are sitting on scans of texts, you are already owning an invaluable wealth of knowledge without it needing reform.
A great first step (in my humble oppinion) would be to index all that in such a way that anyone needing that information would be able to get to it fast.

scruss · April 1, 2024, 9:20pm

Yes, monospaced. Quite nicely printed, too.

I’ll point you at this: A contribution to computer typesetting techniques : tables of coordinates for Hershey’s repertory of occidental type fonts and graphic symbols : Wolcott, Norman M. : Free Download, Borrow, and Streaming : Internet Archive

That page (A-45) contains character 3010. When I last looked at the data (2016), this character had an error in the Usenet Font Consortium release

oldben · April 1, 2024, 9:24pm

And make backups!.
9 track tape stored under ground in a Salt mine, would be nice but not practical any more.
The UK seems to be leading in having lost important information from Dr Who to the BBC Domesday Project.

MauryMarkowitz · April 2, 2024, 12:19pm

I’m sure there are. But in my particular case, I’m trying to make a single BASIC that will run all the early dialects - Dartmouth-clones, HP-clones and DEC-clones (which turns out to be a Tymshare clone).

I’ve been trying to find an OCR that will read 101 Computer Games and What To Do After You Hit Return. If this works it will solve a long-standing problem for me.

drogon · April 2, 2024, 1:10pm

101 Computer Games is a “solved problem”.

I think the link to a ZIP file obtaining them (and the next book, Beyond?) is on archive.org which I can’t access right now due to being away from home and my mobile provider blocking it as “adult content” but it is out there, somewhere - I know this as I have it on my system at home…

But searching should find it fairly easily.

-Gordon

drogon · April 2, 2024, 1:17pm

And the hardware to read those tapes - the old machines with the rubber bands that perish, the tapes where the magnetic coating “prints through” or just delaminates, or the video disc players failing in obscure ways or the video discs themselves failing to just not being made to last (cf. The Domesday project)

You (whoever) needs to make a concientios effort to copy and maintain archives at each new generation of technology and it just takes one change of management and a generation of forget and it’s all but gone.

At least with paper tape/listings something remains.

An intersting example recently was restoration of a 6502 version of Focal as a paper listing of the source code and binary file to run on (I think) the Kim-1. A small team took some OCRd versions of the paper sources and re-created it and assembled the result to ensure the resulting binary matched the original binary version… That’s possible for larger projects but takes time and the good-will of enthusiasts to donate their time and energy to the project.

Hats off to @gtoal for doing this - I appreciate reading some of it as while I was never at Edinburgh Uni, I “touched” it in several places using Imp at Moray House and a summer placement in George Square where I had the chane to meet and work with some of the good folks there…

-Gordon

gtoal · April 2, 2024, 1:45pm

I’ll put it on github when its more usable. I was going to create a zip but even with just one scanned page of data its quite large and I’ve been having hassles from my hosting service about how much disc space I’m using so I was reluctant to double the space used for this project by zipping the scan data. Maybe I’ll just zip up the source code. Give me an hour or two to work something out.
Just be aware it’s still very much at the ‘R&D in progress’ stage and not usable for anything practical yet. I’m right in the middle of working on getting the recognition accuracy up.

gtoal · April 2, 2024, 1:50pm

Those pages look very recoverable. I’ll extract a page image and add it to the demos.

MauryMarkowitz · April 2, 2024, 1:57pm

The PDF of the 101 BASIC Games is available on archive.org, but I cannot find the files themselves?

I believe you may be referring to the later BASIC Computer Games, which has a similar collection, but converted to MS/DEC. It is easy to find the source for these online.

There are many differences between the two collections, including one game in the Dartmouth dialect, which is my primary interest. I have been looking for any collection of Dartmouth code in source form, but so far I have failed. This seems like a real shame. Can-Am is the only long-form program I have found in that format.

It should be stated loudly and constantly that 101 and BCG are NOT the same, no matter what people say.

I believe by “Beyond” you may be referring to “More BASIC Computer Games”, circa 1979?

gtoal · April 2, 2024, 1:59pm

GitHub - GReaperEx/bcg: Original programs of the book "BASIC Computer Games" Vintage BASIC - Games GitHub - coding-horror/basic-computer-games: An updated version of the classic "Basic Computer Games" book, with well-written examples in a variety of common MEMORY SAFE, SCRIPTING programming languages. See https://coding-horror.github.io/basic-computer-games/ Table of Contents: BASIC Computer Games
one of the guys in the Vectrex group ported some of the games to Vector format.

larsbrinkhoff · April 2, 2024, 2:04pm

Sign me up as very interested. Just the other day I photographed 600 pages of PDP-11 listings that I’d like OCR’d. Can I join your beta testing programme?

gtoal · April 2, 2024, 2:06pm

I can’t access the site just now but a pal from my Acorn days has every variant of the old star trek game at http://www.dunnington.u-net.com/public/startrek/ and maybe there’s a Dartmouth BASIC version in there? (If its not coming up for you either then maybe via Wayback?)

gtoal · April 2, 2024, 2:09pm

You can certainly use it once it’s usable, and I try to keep a relatively up to date copy online. Nothing as organised as formal testing (it would be pre-alpha at this point!) I guess I had better start adding a version number to the code. The most useful thing you can do to help at the moment is supply an image of one typical page for testing.

gtoal · April 2, 2024, 2:16pm

You can’t index scans without OCRing them first.

I’m not sitting on anything offline. Everything I have is on the Edinburgh history archive or other public sites like bitsavers. https://gtoal.com/history.dcs.ed.ac.uk/archive/

Priority for OCRing is software that can be recompiled and run again, either under emulation or with a new compiler etc.

MauryMarkowitz · April 2, 2024, 2:18pm

101 and BCG are separate collections. I’ve updated the en.wiki article to make the distinction between the two clear.

Dunnington’s page quotes mine I’ve yet to find a Dartmouth version, the earliest that I have found is the HP version and the 101 version of that, for the PDP-8, spcwr.

I’m wondering if I might convince you to try your OCR on a couple of listings? I have one good quality one from 101 that seems like a good target, and another from What to Do that is lower quality and a better stress test.

gtoal · April 2, 2024, 2:19pm

PS I hope you didn’t literally mean photographed? A flatbed scanner is essential to automate accurate segmentation. Photos will most likely have too much distortion even after image correction tools. I need to do subpixel resolution imposition of a grid for this to work.

gtoal · April 2, 2024, 2:20pm

Maury, just point me at a sample page, I’ll add it to a set of test examples I’m assembling.

MauryMarkowitz · April 2, 2024, 2:26pm

Easy enough!

This is a good one from 101. The source is in column 1 of a 2-column page. I assume this will require a manual clip to extract the code?

For WTD, I think this one might be good, it’s typical of the listings in this book and many others from the era including practically everything in Creative Computing, where it was common for the editors to write notes in the code. This, I expect, will be a challenge without simply removing the bits by hand.

I have always wanted to get a better scan of WTD, but I have yet to find one on Abebooks. Which reminds me, I will ask about that…

gtoal · April 2, 2024, 3:31pm

Those pages look like they were not level when they were photographed. This code relies entirely on scans having been made with a flatbed scanner. I’m not hopeful, but I’ll try anyway. But what I need you to do is extract a single image (png or whatever you can manage) of each relevant page.