[Info-vax] 8-bit characters
Lawrence D’Oliveiro
lawrencedo99 at gmail.com
Thu Nov 11 17:23:11 EST 2021
On Friday, November 12, 2021 at 8:36:00 AM UTC+13, Craig A. Berry wrote:
> For some reason I had thought they'd blown the 4-byte limit with emojis ...
Nowhere near. Unicode currently only officially has room for about a million “code points” (not the same as “characters”), and the emojis I think only number a few hundred at most.
Also, they did a clever thing with the representation of ISO 3166 national/regional flag codes, using just 26 code points.
> but it doesn't seem UTF-32 has any provision for surrogate pairs.
Surrogates were a hack to turn UCS-2 into UTF-16. Remember, back when Unicode was young, it was only a fixed-width 16-bit code, and I’m pretty sure there were even assurances given that it would remain that way. So Microsoft took them at their word when incorporating Unicode into the heart of Windows NT, and so did Sun with Java.
So now, they have this horrible “UTF-16” thing baked into them. It’s not an encoding anybody would adopt voluntarily.
More information about the Info-vax
mailing list