Text analysis. Find unique values and uneven duplicates in sections

Does anybody know good text editors where one can find unique words (or octal words) and/or multiples of uneven duplicates or even logical patterns.

The best I’ve found is the “Duplicate Word finder” (online). But that can’t find unique words and only duplicates with a multiple of 2.

Excel should work (at least for unique values).
Even better would be an “AI tool”.
I want to easily find or highlight unique values or eg 9 sections starting with the same (unknown) value within a section (PDP-8 memory page) and if possible ignoring the address column.

I still can’t separate good enough PDP-8 code and data sections with all tools and disassemblies. Obviously most is data.

Edit: I now see that the online tool also displays uneven hits on the right (although with setting 8x). I have to recheck that. Still not perfect and not unique values. But this way I would have found the table in the last 4 lines.

It may not be the best tool but for me awk would be something to try. You can use it like a C with string functions - that is, it doesn’t have to be used as a pattern-action language.

3 Likes

Thanks, but that seems not easy enough to me.

The online tool is great and can be set to 9.
So with drag and drop one page after another it’s very easy and fast to find sections of 9. (for 9 keys I have).

Edit: I found 3 pages. (One I found myself). Very helpful.
The 2 new pages have shorter sections. And one obviously has another data/table below with many words with 7. I now might find keyboard representations and its error codes.

(I wouldn’t have found the mnemonics this way).

1 Like

I’m having a little trouble following what you’re saying, but if the underlying problem is disassembling pdp8 code, there’s some disassembly code at https://history.dcs.ed.ac.uk/archive/pdp8/d8tape/dasm.c and a technique I’ve always found useful in disassembly is doing a tree-walk starting at a known entry-point and following all branches. This usually finds all code and ignores data, unless your binary includes a lot of jumps with calculated target addresses (or jump tables where the limits of the table are not explicit). If you find yourself thinking of writing a tree-walking disassembler for the pdp8 you might get some hints from this project (although personally I’ve only done a 6809 module so far, but the overall design is intended to be generic. Then of course there’s that NSA disassembler, Ghidra, which probably does the same thing better.)

1 Like

I’ve tried many emulators, disassemblers and tools.

There’s no really confirmed entry point. (Also due to unknown/different Intersil 6100 behaviours and unknown IOTs). I assume(d) the first word being the reset vector. And many ROM contents are very soon overwritten (especially on the Zero page).
I currently assume 80% or more being data. Maybe there’s even no PDP-8 code at all (for that data). At least I found after a long time the correct ROM combination and finally the text in an unusual encoding. I’m currently finding more and more tables and I hope to find any code calling those. The device is a set of two and I don’t have the main part.

Please read my earlier Festo posts, mainly

I think I found some keyboard representations (maybe for error codes). Obviously also stored in half words. Maybe pointers for the other PLC. The order is again different and splitted on 2 pages.

Here is a part of the new table, separating word is 4023. Some just have one word (or rather 2 half words). In 2 pages which are related, I have four of them. That are probably the 4 keys with special, fixed instructions (3x brackets + OR). I don’t know what these words are standing for. 6241, 6246, 6252, 5563. I assume the 3 starting with 62 are the 3 brackets which have one common section on the panel. So 62xx the section/column, xx46 the key/row. Maybe what key is allowed after another key.

4023 NSB?
6075 7330 3077 3103

4023 ESB
2407 5215

4023 ODR
6246 (also at 3537 KLZ, ODR)

4023 del?
5563 2112
1104 7710 5230
1107 7640 5230 2113 5271

Some progress, an important table and I think I found out how a table is read.

I found a table with 7 sections (I have 7 operation mode keys). The table is directly after the mnemonics table.
I’m confident that this is indeed the table for these keys and after the separator word (here 3137, what might be the temp return address ).

I noticed 7344 what is the same words as at 0. I first thought that this is an address, but probably it’s a PDP-8 instruction here. A common clear instruction. Earlier words (1166) more looking like data or bit patterns or half word values. Note 1105 to 1101 downwards.

Sim and test both start with 7344. Control starts with 7346. These 3 needs interaction with the PLC. Sim and control have the longest section. Delete the shortest. That all make sense. Palbart identifies most 3137 as labels.

previous page after mnemonics
7340
1102 3102
4023
4532 3142 0000 0000

1105 4172 --------(new page 6200)
1104

3137 ------------write, 1st op. mode key ?
1103 4172 1356
3136 1137 7006 7004

3137 ------------read, 2nd key
1137 7004 0035 4172
2136 5210 5020
1122 5226
1102

3137 ------------delete, 3rd key? (at 6226)
1101 5205 1104

3137 ------------copy, 4th key
5206
1166 5232
1167 5232
1165 5317
7330 1137 1137

3137 ------------sim, 5th key
7344 ------ x at 6246
3136 1537 7002 …

Some appears as plausible PDP-8 code. But there are also JMP and JMS and some would enter other sections.
Starting with the sim key, it seems like it would check for 2 keys and even load another table

[6246] IRQ,DLY,IE=0,1,0 L/AC:0/0000 MQ:0000 IR:7344 CLL CLA CMA RAL;Clear L, Set AC to 7777, rotate AC & L left
[6247] IRQ,DLY,IE=0,1,0 L/AC:1/7776 MQ:0000 IR:3136 DCA 0136   ;Deposit AC to memory then clear AC, ZP 0136
[6250] IRQ,DLY,IE=0,1,0 L/AC:1/0000 MQ:0000 IR:1537 TAD I 0137 ;Add operand to AC, Indexed ZP 0137
[6251] IRQ,DLY,IE=0,1,0 L/AC:1/5777 MQ:0000 IR:7002 BSW        ;Byte Swap AC
[6252] IRQ,DLY,IE=0,1,0 L/AC:1/7757 MQ:0000 IR:0357 AND @@57   ;AND operand with AC, Current page @@57
[6253] IRQ,DLY,IE=0,1,0 L/AC:1/0057 MQ:0000 IR:7450 SNA        ;Skip on AC <> 0
[6255] IRQ,DLY,IE=0,1,0 L/AC:1/0057 MQ:0000 IR:7110 CLL RAR    ;Clear L, Rotate AC & L right
[6256] IRQ,DLY,IE=0,1,0 L/AC:1/0027 MQ:0000 IR:7420 SNL        ;Skip on L <> 0
[6260] IRQ,DLY,IE=0,1,0 L/AC:1/0027 MQ:0000 IR:1360 TAD @@60   ;Add operand to AC, Current page @@60
[6261] IRQ,DLY,IE=0,1,0 L/AC:1/0067 MQ:0000 IR:4175 JMS 0175   ;Jump to subroutine ZP 0175

With a byte swap there are checks for 57 and 27, probably keys. I’ve changed the AC to 0 and continue without skipping

[6254] IRQ,DLY,IE=0,1,0 L/AC:0/0000 MQ:0000 IR:5771 JMP I @@71 ;Jump Indexed Current page @@71
[3561] IRQ,DLY,IE=0,0,0 L/AC:0/0000 MQ:0000 IR:1547 TAD I 0147 ;Add operand to AC, Indexed ZP 0147 =1000 = funct. unit list 2
[3562] IRQ,DLY,IE=0,0,0 L/AC:0/4017 MQ:0000 IR:1602 TAD I @@02 ;Add operand to AC, Indexed Current page @@02
[3563] IRQ,DLY,IE=0,0,0 L/AC:0/6157 MQ:0000 IR:7071 CML        ;Complement L
[3564] IRQ,DLY,IE=0,0,0 L/AC:1/4710 MQ:0000 IR:0000 AND 0000   ;AND operand with AC, ZP 0000 
[3565] IRQ,DLY,IE=0,0,0 L/AC:1/4300 MQ:0000 IR:2400 ISZ I 0000 ;Increment operand and skip if zero, Indexed ZP 0000
[3566] IRQ,DLY,IE=0,0,0 L/AC:1/4300 MQ:0000 IR:1125 TAD 0125   ;Add operand to AC, ZP 0125
[3567] IRQ,DLY,IE=0,0,0 L/AC:0/3700 MQ:0000 IR:7450 SNA        ;Skip on AC <> 0
[3571] IRQ,DLY,IE=0,0,0 L/AC:0/3700 MQ:0000 IR:7002 BSW        ;Byte Swap AC
[3572] IRQ,DLY,IE=0,0,0 L/AC:0/0037 MQ:0000 IR:7110 CLL RAR    ;Clear L, Rotate AC & L right
[3573] IRQ,DLY,IE=0,0,0 L/AC:1/0017 MQ:0000 IR:1212 TAD @@12   ;Add operand to AC, Current page @@12
[3574] IRQ,DLY,IE=0,0,0 L/AC:1/0417 MQ:0000 IR:3060 DCA 0060   ;Deposit AC to memory then clear AC, ZP 0060
[3575] IRQ,DLY,IE=0,0,0 L/AC:1/0000 MQ:0000 IR:5020 JMP 0020   ;Jump ZP 0020
[0020] IRQ,DLY,IE=0,0,0 L/AC:1/0000 MQ:0000 IR:1126 TAD 0126   ;Add operand to AC, ZP 0126

The index at 147 is 1000 what is the start of the 2nd table with units, containing the separator word 4017. I haven’t found more values loaded, though. I don’t know what 3700 could be. And maybe I’m completely wrong.

Edit: The copy key is used in 3 ways
F0 input from TTY or cassette
F1 output to TTY or cassette (incl. print disassembly with mnemonics)
F2 CRC check.
And there are 3 lines. Output (5317 ? ) needs more operands. Next to baud rate also the output format (disassembly, 8 or 2 instructions per line or papertape)
1166 5232
1167 5232
1165 5317

Another mystery solved.
I have inspected every page and grouped by similar values. At the end of page 0, I found (I checked page 0 very late)

7775 7760 0014
0020 0000 0100
0756 2132 1000
7356 2135 2000
7356 2135 3000
0356 2132 3400
2356 2132 0031
0061 2140 0032
0062 2140 0033
0063 2140 0010
0020 2140 0000

I noticed that table before. 7356 is (also) the fixed value for 3 underlines of one LCD position. I wondered what 1000 and especially 3400 is. Knowing that 0000 is used to fill up a page, and the next page starting with 1014 and to match 31 with 61, I did some re-grouping and I’m sure having found out some of the meaning.

 0014 0020 0000  
 0100 0756 2132       SAZ unit pointer, key 3                            
 1000 7356 2135       EAS (I/O cards) key 1            
 2000 7356 2135       MER marker key 2           
 3000 0356 2132       ZAL counter key 5                                
 3400 2356 2132       VOW preselect key 6
 0031 0061 2140       ZSK  .1 sec key 7               
 0032 0062 2140       SEK         key 8                     
 0033 0063 2140       MIN         key 9                      
 0010 0020 2140 0000  PTE  key 4

First line probably belongs to previous data.
First column 0100-3400 is the rest of the “functional units”. The others with 2 digits I had in table 1. The ones here are also the keys on the panel (in different order). 31-33 are the timers .1 sec, sec, min. 10 must be PTE section end. Result in 9 keys.

The 2nd column is BCD encoded and the length or better the end values of these units. 0756 is 1 11101110 = 1-7-7- =177. So start is 100 end is 177. These are the 77 section pointers I have.

I’m not sure what the 3rd column is. The ones with 2140 don’t allow more operands. 21xx might be the numpad. xx40 the enter key. xx32 allow 2 numbers. xx35 allow 3 numbers. Maybe BCD.

Still very hard to detect PDP-8 code. I think it starts at 0 (not at a reset vector). I still have to find keyboard values and simulate them in single step mode.

I found very plausible code calling the recent table (units/keys of units) and one also calling the start of the mnemonics table.

I also have later removed the data for the mnemonics (->NOP) to have much fewer wrong labels and pointers. I now have many lines of code and can so exclude many words from being data.

Going back from where to reach that, I came soon to 0 with just one branch. Or to 20/24 what is often a jump target (after the auto index reg). Even sooner when start at 200 (the usual start).

It’s still extremely hard to follow. Index of an index and these are calculated and often changing.
And still unknown subroutine return behavior, overwritten ROM and unknown keyboard mapping.

Maybe I can find some keyboard tables. d8tape currently showing just 20% data. But probably some misinterpretations.

I now found code reading a table from ROM and copies to another RAM location. But everything is indexed and it’s still unknown where’s the start and what value the AC has. By default it reads 12 (dec) words and later 61 words. The word count is indexed as well like the start and target addr.

I can’t identify what these words are standing for. I also searched for possible other word counts for my other tables.
At least I now know how it should work. It uses the auto index register as counters. And 2s complement for word count.

As said, one of my tables has 2 bytes with different functions put into 1 word. I think there should be an AND 0077 (0077 should be nearby) and a byteswap instruction. I have some of them but non with plausible code.

I think there are lookup tables and calculated target addresses. Or maybe some other strange encodings. Some word pattern rather look like single digits what would match the machine’s dedicated code.
Keyboard values are still unknown. Same for custom IOTs and the printout logic (although I found plausible ASCII values for CR+LF).

Some interesting data (with that I found the code)

 1367                                                          
 1360 4175 1537 2136 5252 2137 5246                                                            
 1361 7104 7421 1115 3135 6002 1116 3134 2134 5277 7501 7110 7421 7010                          
 1362 6415 2135 5275 6001 5575                                                                  
 1363 4175 5572                                                                                 
 
 1104 7002 3137 7344 5207        (3137 read 12 values)                                        
 1104 7004 7006 3137 7346 5207                       x    x   x    x   read                         
 1104 7012 7010 0035 5345 1052 7006 7006 3052 1052 7004 0364 4172 5020                               
 
 1365 4175 1141 4172 5771                                                           
 1366 5245 7774 0077 0040 1600 2100 0060 0017 0105 4163 0040 4017 3561                                             
 1357 7744 7155 2333 0600 0000

7344 is also at 0 and like 7346 probably a jump target.
The 1104 lines look like a printout mask with 70xx for addresses. 2x7006 for 0000 and 6 words as placeholder for the CRC. But could be wrong.