Forum Discussion
EnTerr
10 years agoRoku Guru
Looking for something unrelated, i ran into this: http://utf8everywhere.org/
Executive summary:
(^) e.g. Koala emoji
U+1F428 will be represented as 2 UTF-16 "characters" (0xD83D 0xDC28 or 0x3DD8 0x28DC depending on byte-sexuality, yet another PITA) and string length() will return 2. Asking for the 1st or 2nd character will return a high or low surrogate half, .reverse() creates "invalid" string, sorting is not lexicographic... things are not rosy
(^^) Under the (wrong) early belief that all Unicode characters would fit in 16 bits, the early adopters of Unicode - Qt framework (1992), Windows NT 3.1 (1993) and Java (1995) started using a 2-byte encoding, UCS-2. Couple of years later, the dream of fixed-width encoding were shattered
Executive summary:
- "UTF-16 is the worst of both worlds, being both variable length^ and too wide. It exists only for historical reasons and creates a lot of confusion."
- Always use UTF8 for external representation (wire protocol, files) of text.
- For historical reasons^^, the internal representation is more complicated. Use UTF8 if you can but beware the API might be dictated by the host platform (e.g. Java, .Net, Qt).
(^) e.g. Koala emoji

(^^) Under the (wrong) early belief that all Unicode characters would fit in 16 bits, the early adopters of Unicode - Qt framework (1992), Windows NT 3.1 (1993) and Java (1995) started using a 2-byte encoding, UCS-2. Couple of years later, the dream of fixed-width encoding were shattered