Preservation: scanning and OCR tactics for old printouts

The whole thing takes about one second per page (20 megapixel image), C# for x64 target, 2 GHz laptop. Could be a lot faster, but no need.

2 Likes

The complete RADAR field 0 listing is 92 pages. Lars and I have what should be a perfect text version of the first 41 pages, perfect because every case where what Lars typed disagrees with OCR results has been manually checked.

For each character position on a page, OCR produces a best guess and a warning if it is not sufficiently confident in the guess. I have a GUI that shows every mismatch or warning for manual inspection and correction.

The OCR results for these 41 pages can be summarized as follows:

30,512 non-space characters.
5 where the best guess is wrong, all with warnings
55 where the best guess is right but warned

167,775 spaces (character positions where there should be nothing)
8 where the best guess, always a “.”, is wrong, all with warnings
1 where the best guess is right but warned

So in 41 pages, manual checking would be required for only 69 cases, and OCR never made a mistake without warning.

Next step is to run the remaining 51 pages. Since I have nothing yet from Lars for comparison, I’ll be relying on OCR confidence only. Assembling the result, as Lars does, should find any errors that matter.

At some point I’ll post details of the recognition algorithm.

1 Like

Here is all of RADAR field 0.

I haven’t checked it with the assembler.

5 posts were split to a new topic: Getting the RADAR code to assemble

Good work. On similar data I found that imposing a grid over an entire page required sub-pixel resolution on both x and y axes otherwise the characters drifted slightly off-center for the grid locations at the extremes. Have you noticed anything like that happening?

Yes, extreme subpixel and angle accuracy is required. The images I’m working with are around 5000x4000 pixels, so tiny errors accumulate. The method I posted earlier gets a grid period accuracy of around ±0.05 pixels and angle accuracy around ±0.02 degrees. The key is the sharp peaks produced by the filters.

1 Like

Due to a bug, a handful of characters in the radar.lst link previously posted are wrong. The link now points to the corrected version. All of these characters are in comments, and were not in the font definition until I encountered them for the first time. They are all recognized correctly by the OCR, the bug has to do with previously unrecognized characters that were then added to the font definition.

That matches pretty well with my observations. My solution isn’t quite as computer-sciency, I just did a quick hack to get something going and was going to think about efficiency only after I confirmed the accuracy of the character matching. Looks like you have a good handle on everything and I doubt there’s anything in my own little experiment that you haven’t already considered, but if you haven’t already had a look at the code, at least have a quick read of the writeup at Custom OCR for mono-spaced line-printer listings in case there’s anything we can share since we appear to be heading in similar directions. (And yes, I’m afraid I haven’t done much since I last posted on the OCR stuff, been working fulltime on writing a compiler and it’s not the sort of project that rewards split focus… but the OCR is something I’ld like to get back to once my current compiler project is done, as our history project has a lot of listings still to convert…)

1 Like

Update on OCR results. Lars has assembled the originally posted radar.lst and found two OCR errors on the 51 pages where I had nothing to compare and relied only on OCR confidence. Both of these were without warnings and were only found by checking the binary output of the assembler. One was a dot (“.”) that was actually a smudge in just the right position–should have been a space. The other was a badly degraded 6 that was misread as a 0. The originally posted link has been corrected.

1 Like

Good to see the process converge! My only similar experience is transcribing Acorn’s 800 line ARM model in BBC Basic, from a shakey video of the program scrolling on a small monitor. I did a first pass personally, then recruited my partner for a second pass: one of us read the transcribed code aloud while the other checked against the video.

That’s really interesting work, thanks for sharing. There are some similarities with what I’ve done and some significant differences.

Perhaps the most significant difference is that you’re working on a much more difficult problem than I am. My OCR only works for dot matrix printers, an assumption built in to the algorithms at all levels as you’ll see when I post details.

We handle what you call image straightening differently. As I understand what you wrote, you determine the angle manually and then deskew the image. I determine the angle automatically from the Fourier analysis and I leave the image alone.

We both compute a grid to segment the image into individual character positions. We both compute X and Y projections. You wrote, “The computer science way to infer the spacing from this is to apply an FFT and look for the strongest signal. However there’s an easier way! Simply loop over the array of row totals, using different strides (eg every 20 pixels, every 21 pixels, every 22 pixels etc) and find the row spacing that gives the largest total amount of whitespace.” This is very close to what I’m doing. You don’t want an FFT, it doesn’t have enough grid period resolution. You are, however, doing a Fourier analysis, just not the n*log(n) FFT version. You’ve upgraded from 0.1 pixel period step size to 0.01; I use 0.04 pixels. It’s not clear what you mean by “largest total amount of whitespace”, i.e. number of binary thresholded pixels, or sum of grey scale values. Using raw grey scale is much more accurate, and the filter I described makes a huge difference. You wrote that segmenting 300 pages took days, my version needs about 1.25 seconds per page including finding the angle, and it has never needed any manual correction.

I don’t have any complex font preparation to do, I just type in pictures of the dot matrix for each character. You can see examples in a previous post. Note that dots are not pixels. With the images I’ve been using, dots are about six pixels in diameter.

One big difference for character recognition is that you’re working with a binary image using a hard threshold. I detest binary thresholds and never use them on images. Instead I use a soft threshold that produces a value between 0 and 1 representing the likelihood that a pixel is ink, and then I apply a dot detector that likewise produces a value between 0 and 1 representing the likelihood that a dot exists at a given position. All of my recognition work uses floating point likelihood values, never Boolean yes/no values.

We both do spiral scans around a target location, and both have special processing for spaces since they are the majority of grid positions.

You consider missing and extra pixels, I consider missing and extra dots. As you suggested, I weigh missing and extra differently. I weigh extra dots more heavily than missing, as you suggest but apparently didn’t implement.

You hint at adjusting pixel weights to handle confusing character pairs. I analyze the font automatically to find potential confusions and do special processing to recognize them.

My recognition speed is sub one millisecond per character, but I’m working with dots, not pixels.

Everything I do with dots is much faster and easier than what you’re doing with pixels. The characters I’ve been working with have only 84 dots; your characters have what looks like a thousand pixels.

1 Like

That’s a pretty accurate summary. Two minor (but not important) points: the deskewing is automatic (that program I pointed to is pretty good) - the manual tweaking is only in a small number of cases where there remains a tiny bit of skew, maybe 5 to 10 pixels in a 5000 pixel wide page; and I did implement various tweaks to adjust the weights for extra/missing pixels, but I haven’t been too fastidious about updating the doc on the web page to match! Anyway it’s good to see that we’ve come to similar conclusions about quite a bit of the process, tells me we’re both on the right track. I wish more of my group’s saved listings were dot matrix as then we could take advantage of your much faster code, unfortunately the majority of our listings are band printer and daisy wheel so I’ll have to keep plodding along at it when I pick this project up again, but maybe at some time in the distant future we can merge our efforts and have a suite that automatically uses whichever fits best with the type of listing.

G

1 Like

Having a known golden transcription makes training the models much easier.

To make the OCR algorithms easier to describe and understand, I’m going to use the specific example of the Weather Radar 7x12 dot matrix font. Keep in mind that nearly all of the concrete numbers in the descriptions that follow are parameters that can be different for other dot matrix fonts.

The input to the OCR is a set of high-resolution grey-scale images, one per listing page. As previously described, a precise character grid is superimposed on each image, reducing the problem to recognizing a single dot matrix inside a small box. The goal is to produce, for each box, a best guess and a second-best guess, with confidence scores in the range [0 .. 1] for both. These scores determine when confidence is insufficient and human verification is required.

To be useful the best guess should be correct nearly always, and where incorrect the confidence should nearly always call for human verification. An incorrect guess with no confidence warning (false positive) is bad and should be exceedingly rare. A correct guess but with insufficient confidence (false negative) causes unnecessary human verification—it is fine, even desirable, as long as there are not too many.

A font character includes a character code and, in our example, an 84-element vector of Booleans indicating whether a dot is expected for each of the 7x12 dot positions. All of the character recognition processing operates on 84-element vectors of various values. There is no 2D structure to the vectors—each dot position is treated independently. There are ways in which the 2D structure of a character could be used advantageously, but I have kept it simple and not done so. I’ll suggest one such use later.

A font character also includes a list of deltas, one for each font character that is similar in appearance. Each delta includes a list of the dots that are different, and for each such dot which character expects it to be present. Deltas are produced automatically at font initiation time for every character pair that differs in <= 4 dots. I chose the parameter 4 by intuition and have never changed it, but I suspect a higher value might be better.

Almost all of the characters in our example font lie in the upper 7x9 rows. The bottom 3 rows are for lower-case and punctuation marks that extend below the baseline. The listings I have don’t use lower case—only comma and semicolon fall below the line. At font initiation time a dot use matrix is computed showing how many characters use each dot:

 20 25 33 36 31 27 16
 31 12  8 10  7  8 30
 29 10  7  9  7 11 20
 28  9  9 16 13  7 20
 25 20 22 33 22 17 17
 25  9 14 18 10  8 24
 22 13  6  9  9  8 19
 30  9  7 11  8  8 23
 17 24 33 37 26 23 14
  .  1  3  1  .  .  .
  .  .  2  .  .  .  .
  .  2  .  .  .  .  .

The dots with non-zero use counts are called active dots. Inactive dots are ignored. The use counts are not otherwise used at present, but there may be ways to productively do so.

OCR would be simple if every character was nearly perfectly printed, but many are far from that. For example:

can't read 7
can't read S

To be continued…

1 Like

The confidence scores mentioned in the previous pose represent the likelihood that a given character exists in a grid box. I use the term likelihood frequently. These are always floats in the range [0 .. 1]. They are not probabilities—they are a way of avoiding yes/no decisions that throw away important information when computed values are near whatever hard threshold is used to decide yes/no. Using likelihood instead of yes/no is significantly less brittle and more robust. It satisfies the principal of graceful degradation—small changes in input should lead to small changes in output.

Character recognition includes a function called recOne that determines, for a given position in the image, the likelihood scores for every character in the font. Starting in the center of a grid box, recOne is called in a spiral scan pattern to determine the character with the highest score. The spiral continues out to a radius of 5 pixels in X and 8 pixels in Y. However, once a radius of 2 pixels in X and Y has been reached, if the best score exceeds 0.6 the scan ends. All of these numbers are settable parameters that satisfy my non-fussy rule—choose by intuition and no fine-tuning needed. The spiral scan radii parameters are called capture radii.

During the spiral scan each character in the font records the highest score it found and where in the scan it was found. When the spiral scan ends, the character with the second-highest score is identified from the recorded bests. If the best overall score does not exceed 0.5, and does not exceed the second best by 0.2, it is judged to be of insufficient confidence and will be displayed for human judgement.

The 92-page PMIO listing has 67,177 non-space characters and 368,900 spaces. With the parameters as described, of the non-space characters there were 121 false negatives and 12 true negatives, light work for human review. There was one false positive that was only found by assembling the OCR output and checking the binary output against what’s in the listing. For the spaces there was 6 false negatives, 13 true negatives, and 1 false positive that was found the same way. If there are any false positives in comments they would only be found by a tedious human review, but don’t affect the integrity of what the assembler sees.

I think the 0.5 accept threshold was too low, and should be more like 0.7 to avoid the false positives. This would give too many false negatives as the algorithm currently exists. I’ve got a plan to fix that which I’ll describe later, but we already have an OCR that has produced a 100% binary match with the original listing so I’m in no hurry to continue development.

In recOne, a 7x12 array of dot detectors is applied to the grey scale image at the given position to estimate the likelihood that a dot is present in each of the active dot positions. The array looks like this.

Each individual dot detector examines a roughly 4-pixel diameter aperture of pixels. Its position on the image is computed to floating point precision using the grid period determined by the Fourier analysis, and then rounded off to integer pixel coordinates. The exact shape and diameter of the dot detectors is a settable non-fussy parameter.

The result of dot detection is a 84-element vector of floats, each indicating the likelihood that a dot is present. The value for inactive dots doesn’t matter. The dot likelihood vector will be used to get scores for each character in the font.

The dot detector estimates the likelihood that each pixel in its aperture is ink. The estimate is a soft threshold on grey level:

The likelihood of a dot is just the average ink likelihood. The soft threshold is not very fussy, but I did some rough adjustment based on looking at actual grey scales in various images. It was not fine-tuned.

The relative pixel positions in the aperture of a dot detector are stored as X and Y offsets from an anchor point. For efficient pixel access, these relative positions are combined with the Y stride of the given image to produce relative 1D offsets into the raw 1D image array. This is done once per image and avoids all of the (X,Y) to array index calculations that would otherwise be done to access image pixels. So the dot detection loop in C# is trivial:

  foreach (int offset in pixelOffsets_)
    score += image[p0 + offset] < inkThr_;

Here inkThr_ is the soft threshold, and “<” is an overloaded operator that produces the [0 .. 1] floating point likelihood value.

Here is an example of dot detector output. For display purposes the dot likelihoods are printed in a [0 .. 100] range, with “.” as usual meaning 0.

   .  96 100 100 100  98   .
 100   1   .   .   2   3 100
 100   .   .   .   . 100 100
 100   .   .   . 100   . 100
 100   .   .  47   .   . 100
 100   .   .   .   .   .  15
  86   .   .  15   .   .  59
 100   .   .   6   .   .  76
   .  96   .   .   7  99   .
   .   .   .   .   .   .   .
   .   .   .   .   .   .   .
   .   .   .   .   .   .   .

Straying from Bill’s particular use case with dot matrix printed material, I wanted to point out another technology which is fairly commont: line printers. They tend to produce listings where columns are slightly misaligned compared to each other. See example below.

Just to expand on that, on line printers the alignment of the columns themselves is great since it is physically determined by the construction of the machine. It is the vertical alignment of the characters relative to the base line that varies due to timing differences. It also varies between different characters in the same column but tends to be consistent for a character/column combination (see “T” in “CATALOG” and then “T” in “THE” under it, or “T” in “GET” and in “TEMPO” under it for two examples).

If anyone’s looking for a big OCR project, I scanned all 78,424 pages of N50CH printouts from the Ohio State University Radio Observatory. It’s an early SETI study. The best-known of these pages is known as the Wow! signal in pop culture. Anyway, it’s a horrendous OCR challenge due to variances in paper type, ribbon consumption, overprinting of numerals with hyphens to represent negative numbers, and more. And I fear the files are not being well stewarded after all the work I did to produce them.

I did write some backpropagation code and scaffolding to get a start on this specific OCR problem, but there would be more to do.

Nothing from me here is an endorsement of the Wow! cult. My best hypothesis is it was intermodulation distortion between a human-flown aircraft and an earth-based source.

Marc

2 Likes

Welcome, @marc! Great work on the scanning. OCR technology will, hopefully, always be improving, so preservation must be the first concern. But it’s somewhat interesting that machine output, with its own idiosyncrasies, might be an ever smaller area of attention for OCR.

I’m mildly surprised we haven’t any previous discussion on the different print mechanisms: chain, train, band, drum, and then ball, dot matrix, wheel… each of which have their own imperfections of alignment and impression. It seems likely that the best OCR solution might recognise and be specific to the kind of printer used.