I disagree: what Unicode has done is merely expose many areas where people without a fairly full understanding of something say “X should be simple” when it is not. Digging further will usually show that they cannot even say what “X” really is, in general, just what they think a result of doing X should be only a narrow set of specific cases.
“[H]ow do you reverse a Unicode string?” has become one of the most famous interview trick question. (No, there’s no universal way, guaranteed to give a canonical result. There isn’t even a definition for this.)
Well, if you can’t provide a clear definition of what “reversing a string” is, one that everybody can agree on (which nobody can), how can you write a program to implement that? And how can you say “X should be simple” when you can’t even say what X is? This is a classic example of a terrible specification: you ask someone, “Do something like this,” which could be any of a dozen different things, and then complain when you get results you don’t like from your poorly defined problem.
A more clear example of this, if you come across someone who can’t understand that “reversing a string” can have many different meanings, is to ask someone what it means to “reverse a number.” Take the number 4. Is the reversal “4”, since that representation is a single “point”? Perhaps “6” is the reversal, as reading “IV” right to left gives “VI”? Or perhaps “1”, since “100” read right to left is “001.”
It’s not as if you can reasonably argue that those different representations of the number 4 have different meanings, when talking about numbers in any normal sense of the word, unless you’re going to start to argue that, when you see Alice, Bob and Charles saying the following:
- Alice: I have 4 apples.
- Bob: I have Ⅳ apples.
- Charles: I have 四 apples.
you claim that they have different numbers of apples.
All that said, there are two fairly simple definitions of “reverse” in Unicode that serves the most common purposes served by “reversing” an ASCII string.
The first starts with disallowing combining characters. (Or even disallowing all non-ASCII characters.) Then simply make a list of each code point in that string, reverse the list, and generate a new string from that list of code points.
The second, if you’re going to allow all of the Unicode code points, is as follows. 1. Ensure that every group of combining characters in your string has a base character by inserting a space in front of every group of isolated combining characters. (This is not necessary if your output is to be a list of character groups rather than a string.) 2. Group sets of characters as each base character followed by all its combining characters and make a list of these groups. 3. Reverse the list. 4. Generate the output string as a sequence of all the characters in this list.
Note that this reliably reproduces the same problems with reversal as produced when “reversing” ASCII strings. E.g., “ABC” produces “CBA”, now “shifting out” B rather than “shifting in” B. (Hmm! It seems that “reverse” a string can have the same problems with definition in ASCII as it does in Unicode!)
Sure, the second form is more complex than its ASCII equivalent, but that’s just the nature of a system that allows you to do perfectly sensible things, such as write German¹, that the designers of ASCII never imagined.
¹ Yes, I’m aware that some people would claim that writing German is never a sensible thing to do. :-)