Thoughts on characters and strings

Sorry, but this is nonsense. There is only one definition that makes sense. (Reversing the extended grapheme clusters, in case you care.) In case there are some overly clever interview hazing inquisitors, they can all go sit in the corner and be ashamed.

And I sense another disturbance in the Force here: “Unicode string” is meaninglessly vague and under-defined. What is missing is the encoding.

1 Like

Well, the question said “Unicode” rather than “UTF-8” or any other encoding name, so I think it’s reasonable for this question to assume you are using a language/library that represents Unicode strings as a sequence of code points, without you having to worry about the underlying representation. (Or just assume you have or can convert to/from UTF-32, a fixed-width encoding of code points.)

(Reversing the extended grapheme clusters, in case you care.)

That sounds like what I was proposing, and “grapheme cluster” was the term I needed to search on to find the details of what I had in mind. Thanks for that.

There is only one definition [of “reversing a Unicode string”] that makes sense.

I wouldn’t say that this is the only definition that makes sense, though I think it’s the most reasonable default choice if you have no further information beyond “reverse a string.”

I still see the real problem as one of under-specification. It’s just not as obvious that it’s under-specified as if someone asked you to “rotate a string,” where most good developers would immediately ask, “what the heck does that mean?”

In case there are some overly clever interview hazing inquisitors, they can all go sit in the corner and be ashamed.

Indeed!

I’m with you - it reminds me of a comment I made on 6502.org a couple of years ago: I wish every “byte” machine used 32 bits as a unit instead of 8 bits. Everything 32 bits, including characters. Wishful thinking, but it would have made everything simpler, except performance/storage for earlier systems. My office desktop PC has a million times more RAM than the first IBM PC… it should be good for more than being able to keep a few Chrome tabs open, I should hope.

As for Unicode - I used to be a sceptic back in the previous century, but now I couldn’t be without it. It does create some problems though - one is where different encodings can translate to characters that look the same. This is used to spoof e.g. web links to fool people to click on them. Other problems, or at least controversies, are caused by Unicode policy decisions (e.g. Han unification. Also see I can’t write my name in Unicode. And, as a programmer said on HN, “Han unification makes things really hard for programmers. You end up with code that tries to guess what language a string is in to pick out which character set should be used!
The various European languages aren’t much affected by that though. Unfortunately not everything supports Unicode, which is more and more of an issue. It’s bad not being able to write a name.

I can’t speak to Mukerjee’s complaint about Unicode’s handling of Bengali; it sounds bad. But having spent some time in China as a child, and come to Japan almost twenty years ago to do internationalization of computer software, I do know something about Han unification.

This type of complaint is common, but is caused by confusion about what Unicode does and does not encode by design: in this case, encoding characters rather than glyphs.

Western languages are indeed affected by the same problem; the issue with certain Cyrillic forms mentioned on Wikipedia is almost exactly parallel.

But this affects even western languages written entirely in Roman alphabet: consider the sentence, “I admire her sophistication and savoir faire.” This cannot be rendered with Unicode alone; you must know that this is an English sentence with a French phrase embedded in it and use non-Unicode markup in order to properly render savoir faire in italic characters; Unicode, for good reasons, has “unified” the roman and italic versions of characters.

This is not the place to explain and argue the design decisions behind Unicode, but I hope that this at least gives anybody interested a flavour of what people claim are problems with Unicode. It’s interesting work to more deeply investigate and understand the above example.

I say is all IBM’s and MICROSOFT fault.
ASCII was defined in 1963, and they fucked up after
that. Where is my right arrow and up arrow?
Europe has NO right to ask for accent marks and
other letters, Why. ASCII American code for Information Interchange a 7 bit code. In 1970’s they could have gone
to 8 bit code Greek, Accented characters and computer symbols. ISCII International standard code for Information Interchange.
What happened, IBM and MICROSOFT both tweeked
ASCII for different users. French keys are diferent from
English keys and different printers and American stuff.
Unicode is mess because “we can include every thing”
Ben.

But everbody left MS behind a long time ago, when indeed 8-bit took over. It’s understandable that they allocated one bit for parity for ASCII at the time - unreliable transmission and all, but when that wasn’t needed anymore all eight bits were used, and we got ISO instead of ASCII. ISO-8859-1 and all that. But it’s not good enough, not even for Europe - you can’t write an article where you write in one language and include certain words from certain other languages and whatever ISO variant you choose doesn’t cover both. Unicode solved that, and despite the Han unification disagreements, which are quite academic for most of us, Unicode solves all of the problems. I can do my multi-language writeups, I can include my wife’s name while writing e.g. English, sometimes that’s a good thing…
Now, the OP started this thread with the suggestion that characters should simply have been 32 bits from the start, and none of the trouble you mentioned about IBM and Microsoft and right/up arrows would have happened. And why wouldn’t that have happened? Because a 32-bit Unicode would include all of those. No need to switch to a different mapping for French, and no need for myself to have to choose between writing [|] or ÆØÅ (no way to include both in the same paper, back then).
This would have worked fine back then too, if one could have ignored the size issue. 4 times as much memory, 4 times as much bandwidth (and waaaay much more storage for character sets), all real issues back in '63. And for quite a while after.

(We should thank the Japanese though - early in the microcomputer revolution they saw that they would need more memory and more ROM space in order to write (even simplified, kana-based) Japanese on the computers, so they focused on improving that situation. Which helped everybody in the end.)

I have a term for this. It’s a credo.

“Code has momentum.”

However “bad” the decision to extend the ASCII set may or may not have been, it’s was far better than having to toss out or rewrite years and years of accumulated software.

One may argue that a singular benefit of UTF-8 is that, in the default case, fresh out of the box, it’s ASCII, so it’s implicitly backward compatible in the base sense and works with legacy software and tools.

Java is much easier to work with as a language environment because UTF was built in from the get go, in contrast to languages like C and C++ where working with wider characters is much, much more complicated.

Yet, I interoperate with ASCII day in and day out trivially with Java because of the base feature set of UTF. (Java uses UTF-16 internally, but readily codes out to UTF-8.)

Yup, although the ASCII compatibility you’re talking about is more a UTF-8 encoding thing than a Unicode character set thing. But one of the design goals of Unicode itself was to have full round-trip compatibility between Unicode and the various national character sets, which is why you will find some code points violate the unification rules: even if unification would put two characters that are technically the same at a single code point, they cannot be unified and two code points must be used if a national character set distinguishes them.

To bring things back more towards retrocomputing, there’s also been some work on adding legacy microcomputer character sets (PETSCII, ATASCII, etc.) to the list of fully supported character sets in Unicode: the Symbols for Legacy Computing block. (Not a single character in it displays in my current web browser font, but this excerpt from the Unicode 13.0 standard shows suggested glyphs for that block.)

Wikipedia still claims on its PETSCII page that " Not all of the characters encoded by PETSCII…have a corresponding Unicode representation." I’m not sure if that’s still true now that we have the Legacy Computing block; the designer of The Ultimate Commodore Font seems to say that they are now all there.