Thoughts on characters and strings

cjs · March 19, 2020, 2:06am

One issue I’ve thought a bit upon but not really talked about here is the handling of characters and character strings. However, a question just posted to the Retrocomputing Stack Exchange, “Could we have avoided the whole UTF-16 fiasco?”, just crystalized one thing in my mind: I would completely drop the idea of eight- or even sixteen-bit characters, and make all characters 32 bits.

(32 bits is selected based on the assumption that “most modern computers” are using Unicode, ISO/IEC 10646, or something substantially similar, as they do in our current world. The actual width can be tweaked as necessary to accomodate other systems.)

This has an immediate implication on the word width of our simple/verifiable computing systems: it must be at least 32 bits. Any less leads to the same problems were were trying to get rid of by putting (almost) all numbers into single words, but probably even worse.

I’m still contemplating how good or bad an idea it is to store multiple characters in a single word (e.g., 2 characters per word in a 64-bit system). I still for some reason find attractive the idea of sub-word values (e.g., as used in the CDC 6600 instruction set, which fills a 60-bit word with a mix of 15- and 30-bit instructions), but I also have this feeling that that might lead in the end to problems bigger than the space savings justify. Is there any justification for this technique other than saving space?

oldben · March 19, 2020, 2:31am

Reminds of the line “The good thing about standards is that there are so many to choose from.” I see unicode as lets add more features so your web page has more crap to display. I think if one created a standard from printing of books, we would have a cleaner standard.

cjs · March 19, 2020, 2:46am

If you’re talking about stuff like emoji, that’s a relatively recent move in the world of Unicode. Originally it was indeed based on printing of books (well, newspapers and magazines, actually):

Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988)…Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.

But for me, a broad character set with many symbols and graphics is important so that I can do as much as possible in text, rather than restorting to images. Given that for this system we’re looking for simplicity over all else, I would anticipate an expansion of symbols if anything. (E.g., I’d really like to be able to do basic electronics schematics as “ASCII graphics,” but there seems to be no good way to draw an op-amp. Just adding a set of schematic symbols for gates to the existing line-drawing characters in Unicode would go a long way to reducing the amount of graphics I generate.)

oldben · March 19, 2020, 2:56am

But the whole point of unicode is wrong, a 1:1 mapping of letters to icons is the wrong way to go.I want bold gold chinnese ‘tree’ icon, do I need a whole new chinese font
? Text and layout NEED to be two different things, some thing the WEB never had.
(Gumble – the kindle can’t even do a 80 characters per line)

cjs · March 19, 2020, 3:16am

Unicode fully agrees with the latter statement, which is why in principle it has been not a mapping to glyphs but a mapping to characters from the very start:

A clear and all-important distinction is made between characters, which are abstract text content-bearing entities, and glyphs, which are visible graphic forms. This model permis the resolution of many problems regarding variant forms, ligatures, and so on. (p. 3)

So no, you don’t need a “whole new Chinese font.” Unicode isn’t even concerned with which font you use.

oldben · March 19, 2020, 4:00am

The My Text editor has got me thinking that way.
Got a new printer and spent the whole day fighting windows just to print a text file. Had to download
a print to printer program to get 66 lines per page.
For now I will stick to stuff before 1990 for computers
and 1980 for audio equipment. If I do a 32 bit cpu,
it will it might be like this.

3332 2222 2222
2109 8765 4321
±—±---±—±---±—±---±—±---+
|COOO:MAAA:0XXX|####|####:####:####|####| BYTE
±—±---±—±---±—±---±—±---+
|COOO:MAAA:1XXX|####|####:####:####|###H| WORD/HALF
±—±---±—±---±—±---±—±---+
OP aa ix
0 sub/c a a
1 add/c b b
2 ld c c
3 axw/h d d
4 and x x
5 or y y
6 xor s s
7 st p p

cjs · March 19, 2020, 4:49am

That brings in its own set of problems as soon as you need characters outside the ASCII character set. The hacks for various logic and mathematical characters from other character sets (e.g., logical not ¬ from EBCDIC), and region-specific characters (e.g. ñ and Euro €, ) are already problematic, and as soon as you get into something like Japanese you end up with a lot of problems on your hands. That’s why I suggested just making all characters 32-bits and sticking with that. (Further replies about this should probably go on the Thoughts on characters and strings thread.)

BTW, your 32-bit CPU format is unreadable. It would be worth editing your post to put triple-backticks (```) before and after it, or indent all those lines by four spaces, to make the columns line up.

jhi · March 23, 2020, 6:15am

I see cjs already replied but I feel obliged to, too.

I am sorry but you seem to have gotten Unicode all wrong. It requires none of the things you say, and fixes the problems you claim.

Most importantly: Unicode is mostly about “data processing”, the text, the numbers behind the strings. The “layout” is discussed, too, it is unavoidable, especially in more complex writing systems, and yes, Unicode coordinates the “emojis”. But those are in the periphery. I suggest e.g.

And more relevantly to this forum: Symbols for Legacy Computing - Wikipedia (if you don’t see the characters, well, that’s the font issue… every platform must provide their own fonts). But the proposal shows them: https://www.unicode.org/L2/L2018/18235-terminals-prop.pdf

(EBCDIC was taken care long time ago, IBM being one of the founding members)

NoLand · March 23, 2020, 3:57pm

I think, Unicode is great for most purposes: If you ever lived in a country with non-US-ASCII characters in the alphabet, it’s the best invention since sliced bread. No more choosing code pages and lossy conversion of text. (E.g., previously, for multilingual text, you had to pick a certain code page, but probably some punctuation, etc, commonly found in some language wasn’t in this. It may have umlauts and basic accents, but no French quotes, etc – and this while just using European languages!)
Still, it’s compatible to ASCII and produces in the most cases fairly compact data.

On the other hand, it’s somewhat non-deterministic. E.g., the only way to reverse a Unicode string, I can think of, is by crafting a special font with a dot-matrix of 0xFFF x 0xFFF, featuring non-overlapping, unique patterns for each code, print this, parse the output from right to left, and recompile it.

And it’s somewhat the PHP of type setting mechanisms. You can do awful things with it. E.g., some of the newer installations of MS Word have a fault, where it works around Unicode by using Unicode. Meaning, it transforms accent characters into atoms of a base character, a negative space and any modifiers, like accent, umlaut, etc, in overprint. However, this isn’t text anymore. It doesn’t parse, search doesn’t work anymore, locales and collations do not work anymore, as well as sorting. Moreover, the accents and umlauts will be off in print, and accent characters aren’t just combinations of these atoms, they are specially crafted glyphs. And as a bonus, in a browser and in many other display systems, these combined characters won’t show in the chosen font, but in a fall-back font. I’ve seen this happening both on Mac and Windows versions of Word. One of my clients is exporting PDFs that way (preferably press releases in French) and I had to write a program to convert this peculiar amalgam back to text so that you can actually copy and paste and process this as text. (Fine achievement for a word processor! This is probably some kind of fall-back to provide a “reasonable” representation in a basic character set, and Word fails to recognize that it is installed on a system with Unicode support. However, it provides a glimpse into a world without Unicode.)

oldben · March 23, 2020, 5:58pm

Well ASCII was ment for the printed page. Anything fancy you backspaced and printed over that character.
It is too bad they could not do that with common terminals. As for code pages, I saw that as a marketing plot… need a acent mark… new keyboard and printer and terminal just for your country. And the same price as the US. Just convert the $ to Pound sign or the Euro mark sign, and that is the price. Please wait 6 to 8 months for delivery fro
m china.

cjs · March 23, 2020, 11:20pm

There seems to be a lot of confusion on this point, but coding systems along the lines of ASCII, EBCDIC and Unicode were specifically not designed to represent glyphs (i.e., here’s what the letter “a” should look like) but characters (i.e., the idea of the letter “a”.)

Well, no. Character encodings such as ASCII and EBCDIC (both the same general idea: assign code numbers to characters) were used not just with printers and printing terminals, but also for data storage in memory and I/O on punched cards, paper tape, magnetic tape, drums and disks. With cards and paper tape backspacing and overpunching would produce nonsense (or in one special case for paper tape, a “DEL” character whose meaning was “ignore this character”, used for corrections) and of course in memory, on magnetic tape and on drum or disk “going back” and writing another character would entirely replace the character previously there, as on a video terminal.

Err…no. Unicode is explicitly designed for easy processing in memory (or relatively easy, given the issues it addresses). So long as you have a list of the character classes so you can identify combining characters, reversing a string is not hard, though of course you should use a decent Unicode library that provides this if you want to do it right. Same goes for case conversion and the like.

If you’re doing something that you think is about Unicode and you ever find yourself concerned with how a character prints, you’ve gone off the rails somewhere. Unicode is about code points and characters, not about glyphs.

NoLand · March 23, 2020, 11:26pm

The problem here is that while you can handle Unicode in a safe way, it also allows you to do some rather insane things, or, things that should be simple in insane ways. To the extent that “how do you reverse a Unicode string?” has become one of the most famous interview trick question. (No, there’s no universal way, guaranteed to give a canonical result. There isn’t even a definition for this.)

Regarding code points versus glyphs: I think, first, it’s about representing written language. Both are aspects of the same thing. If your code points break both the ability to represent proper human readable glyphs and the ability to process strings, you’re probably not on the correctest of all considerable paths. (Compare the MS Word anecdote.)

oldben · March 23, 2020, 11:52pm

APL comes to mind for use use with over strike characters.IMP 77 used underlining to display keywords.
And shift in/out could give red or black text. ALGOL often had stange printed letters too. And lets not forget 1977 playboy bunnies umm snoopies that showed up as text art. In 1977 we had 16k x 1 Drams.
Today we have 8 meg x 16 Drams.A bigger character code makes sense today. I still think NAPLPS was better enoding system for LATIN based langauges.

cjs · March 24, 2020, 12:46am

I disagree: what Unicode has done is merely expose many areas where people without a fairly full understanding of something say “X should be simple” when it is not. Digging further will usually show that they cannot even say what “X” really is, in general, just what they think a result of doing X should be only a narrow set of specific cases.

“[H]ow do you reverse a Unicode string?” has become one of the most famous interview trick question. (No, there’s no universal way, guaranteed to give a canonical result. There isn’t even a definition for this.)

Well, if you can’t provide a clear definition of what “reversing a string” is, one that everybody can agree on (which nobody can), how can you write a program to implement that? And how can you say “X should be simple” when you can’t even say what X is? This is a classic example of a terrible specification: you ask someone, “Do something like this,” which could be any of a dozen different things, and then complain when you get results you don’t like from your poorly defined problem.

A more clear example of this, if you come across someone who can’t understand that “reversing a string” can have many different meanings, is to ask someone what it means to “reverse a number.” Take the number 4. Is the reversal “4”, since that representation is a single “point”? Perhaps “6” is the reversal, as reading “IV” right to left gives “VI”? Or perhaps “1”, since “100” read right to left is “001.”

It’s not as if you can reasonably argue that those different representations of the number 4 have different meanings, when talking about numbers in any normal sense of the word, unless you’re going to start to argue that, when you see Alice, Bob and Charles saying the following:

Alice: I have 4 apples.
Bob: I have Ⅳ apples.
Charles: I have 四 apples.

you claim that they have different numbers of apples.

All that said, there are two fairly simple definitions of “reverse” in Unicode that serves the most common purposes served by “reversing” an ASCII string.

The first starts with disallowing combining characters. (Or even disallowing all non-ASCII characters.) Then simply make a list of each code point in that string, reverse the list, and generate a new string from that list of code points.

The second, if you’re going to allow all of the Unicode code points, is as follows. 1. Ensure that every group of combining characters in your string has a base character by inserting a space in front of every group of isolated combining characters. (This is not necessary if your output is to be a list of character groups rather than a string.) 2. Group sets of characters as each base character followed by all its combining characters and make a list of these groups. 3. Reverse the list. 4. Generate the output string as a sequence of all the characters in this list.

Note that this reliably reproduces the same problems with reversal as produced when “reversing” ASCII strings. E.g., “ABC” produces “CBA”, now “shifting out” B rather than “shifting in” B. (Hmm! It seems that “reverse” a string can have the same problems with definition in ASCII as it does in Unicode!)

Sure, the second form is more complex than its ASCII equivalent, but that’s just the nature of a system that allows you to do perfectly sensible things, such as write German¹, that the designers of ASCII never imagined.

¹ Yes, I’m aware that some people would claim that writing German is never a sensible thing to do. :-)

cjs · March 24, 2020, 1:04am

APL does not need to use “over strike characters” and does not in Unicode and several other representations, particularly those commonly used on video terminals.

Do not confuse “my input method has me code entry for character X as character Y followed by an overstrike command followed by character Z” with “that’s how it’s internally represented.” Both lexically in APL itself, and In Unicode (and most other encodings that can represent APL), quote quad is a single character, and not a separate quote and quad, though a combining form may be an alternate representation of that single character.

I still think NAPLPS was better enoding system for LATIN based langauges.

NAPLPS is a graphics language, not a character set representation.

NoLand · March 24, 2020, 1:56am

On the topic of reversal: Mind that this is both a rather academic question and, at the same time, historically a real thing, hinting at the core of many writing systems. For example, Latin and Greek script didn’t have a preferred writing direction originally (and for quite a long time, right into the Middle Ages). Writing directions were chosen for esthetical reasons and context and would often change from line to line (which is, where verse comes from). The same is true for many other scripts. So, direction and position is inherently relative.

However, my criticism would be about something completely different, namely, what’s actually in a given character class. E.g., block elements enumerate eigths of a full block (which should be a square, but most likely isn’t in practice) in a manner increasing upwards and decreasing from right to left. However, there is no concept of a reverse image, so half of the class is missing and the entire class therefore pretty useless. There are, however, two complementary half block elements as an exception to the rule and a bit of a teaser. (Moreover, we may ask again, whether these elements should better be defined relative to the writing direction or as absolute directional entities, e.g. in order to use them for dynamic display elements, like signal bars. Which are, of course, again subject to culture and convention. What does a reversal mean here?)
I’m not sure, why you would come up with such a merely halfways definition. (Provided, that there were many mature definitions for block elements before, like in Prestel/Minitel/Teletext. And, as a result, you can’t represent texts in neither of those formats.) It’s more like someone looked at a Commodore keyboard and decided to implement a few of those keys, regardless of the underlying functionality (namely, complementary glyphs by reverse video). And there’s a number of other classes, which provide a start, but not a suitable implementation.

cjs · March 24, 2020, 4:40am

The class may be “useless” for one particular application of many that you’re thinking of, but for other applications, it’s just the thing. I’ve used those characters for bar-graph displays and they did a fine job. (The complementary half-block was useful to avoid having to use a full-character space at the “start end” of a bar.)

If you want sub-character block graphics, that is in Unicode, and in fact in the very Unicode block you’re speaking of. The quadrant characters support all possible 4×4 graphic layouts within a character, without inverse.

But in the end, Unicode is not about glyphs. That’s what everybody seems to keep missing here. It provides a few friendly hacks towards semi-graphical displays that are basically a minor improvement on “ASCII art,” which I think is reasonable to do, but demanding it go beyond that is just a bad idea, at least according to me and the all the people who designed Unicode. If you want NAPLPS, go ahead and use that, rather than asking Unicode to solve all the problems in the world.

That’s a good point about writing directions, but just makes it even more clear that “reversing a string” is not a thing, but many different things. I had considered another answer to the “how do you reverse a string in Unicode” question, which was “insert a right-to-left mark at the front of the string.” But of course, that changes only the display of the string (and even then only if it wasn’t already right-to-left and the display subsystem is interpreting that code and there is a display of it at all), while leaving the logical order in memory as “start to end,” which Unicode strings always are, regardless of writing direction.

And keep in mind that writing directions are another one of those compromises that would’t be in “hardcore” Unicode, since that would suggest you use separate markup for language and other things of that nature to give the display system the information it needs to do the rendering. And in fact, Unicode is that hardcore when it comes to horizontal vs. vertical text direction: there is no “top-to-bottom mark” to indicate that change of direction.

oldben · March 24, 2020, 5:25pm

Read the standard, how it encodes text and graphics.
NAPLPS was from the same era as Videotex so low res TV graphics where normal.
As for APL, I think of the IBM 1130 version, I/O from the console typewriter.
Then you had the TV pretending to be a computer,
now you have the computer pretending to be a computer and not a TV. Computer displays are all landscape mode, not portrait mode like Xerox Alto.

EdS · March 24, 2020, 7:20pm

Going back to the head post:

I notice in that stackexchange post a link to an interesting document from 1998 - a ten year retrospective of Unicode. And in there I see the confident expectation that 16 bits would be enough for “the world’s living languages”. That’s an interesting piece of history, I think. Extra interesting perhaps because UTF-8 dates from 92/93.

cjs · March 25, 2020, 12:16am

As I note in my RC.SE answer, that confident expectation wasn’t shared by everyone. ISO 10646 started design in 1989 and the designers at that point felt that 16 bits would be nowhere near enough. (They provided for 31 bits.) The debate between the two started in 1990, when the Unicode and ISO 10646 proposals were both on the table. Unicode’s 16-bit proposal won that round, but that didn’t last; within a few years Unicode 2.0 expanded the number of character codes to just over 20 bits.

Extra interesting perhaps because UTF-8 dates from 92/93.

UTF-8 isn’t actually relevant to 16- vs. 32-bits of character code. It was based on an earlier encoding from the ISO 10646 draft and, like the UTF-1 it was based on, from the start could encode 31 bits of code points, far more than Unicode has ever offered.