The majority of this is not specific to Rust and not exactly groundbreaking stuf...

jmcomets · on June 15, 2023

+1, the link to Joel Spolky's post on Unicode is probably the most interesting read I've found this year: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

chrismorgan · on June 15, 2023

I was never much impressed with that article (given its title, I say—given its title) from the first time I saw it, probably around 2010, and it has aged poorly. Some of my complaints about it:

• It tells a verbose story, rather than just telling you what you need to know succinctly. (Seriously, the 3,600-word article could be condensed to under a thousand words with no meaningful loss—and the parts that I consider useful could be better expressed in well under 500 words. As a reminder of where I’m coming from in this criticism: the title of the article is “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”, so I expect it to be, y’know, absolute-minimumy.)

• It spends way too much time on history that even in 2003 wasn’t particularly relevant to a lot of people (though yeah, if you were working in the Windows ecosystems he was working in, you would benefit from knowing about some of it), and which is now (twenty years later) completely irrelevant to almost everyone.

• It portrays UCS-2 and UTF-16 as equivalent, which is disastrously wrong. (As in, “if you do that, you stand a good chance of destroying anything beyond U+FFFF”.)

• As far as I can tell (I was a young child at the time), its chronology around UTF-8 and UCS-2/UTF-16 is… well, dubious at best, playing fast and loose with ordering and causality.

• Really, the whole article looks at things from roughly the Windows API perspective, which, although what you’d expect from him (as a developer living in Microsoft ecosystems), just isn’t very balanced or relevant any more, since by now UTF-8 has more or less won.

• It doesn’t talk about the different ways of looking at Unicode text: code units (mostly important because of the UTF-16 mess), scalar values, extended grapheme clusters, that kind of thing. A brief bit on the interactions with font shaping would also be rather useful; a little bidirectional text, a little Indic script, these days even a bit of emoji composition. These days especially, all of this stuff is far more useful than ancient/pre-Unicode history.

• The stuff about browsers was right at the time, but that’s been fixed for I think over a decade now (browser makers agreeing on the precise algorithms to use in sniffing). (He’s absolutely right that Postel’s Law was a terrible idea.)