[Info-vax] Character sets
Johnny Billquist
bqt at softjar.se
Wed Sep 7 09:08:49 EDT 2022
On 2022-09-06 20:42, Arne Vajhøj wrote:
> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>> which is... a mess, and is also ill-suited for UTF-8. Probably better
>> to use char16_t and char32_t, if you do need fixed-width wide
>> character storage.
>
> wchar_t is a typical C vague definition where char16_t and char32_t are
> much more clearly defined.
wchar_t was an invention from before Unicode came about. And it's fairly
incompatible with the ideas in Unicode.
> But wchar_t got runtime support.
For some definition of runtime support, sure...
> C (and for that matter also C++) IO functions does not not
> make writing/reading UTF-8 easy.
Looking at the follow up comments here, what you mean is that string
processing functions lack UTF-8 variants, which is true. Especially if
we talk about the standards. As for as I/O goes, C have no problem at
all. It can read/write UTF-8 without any problems at all.
However, UTF-8 is actually not a character set. UTF-8 is an *encoding*
of Unicode. And if you were to do this properly, the canonical format is
just Unicode characters, which needs 21 bits. Which probably means you'd
like to store them as arrays of 32-bit values. And then you should have
functions that take a Unicode string and converts it to UTF-8
representation and back if needed.
But the problem is uglier than that. Since Unicode handling also means
you should know/handle multiple codepoints that should be considered
equivalent, and for some you have combinations of characters that are
equivalent to another single character. And of course the actual
collation of it all is also language dependent, so it's not even
possible to do without some additional information. Unicode is a mess,
and we are now stuck with it, just as we're pretty stuck with x86. Not
because it's good, but because everyone use it.
Even trying to think how ustrcmp() should be implemented makes me sick...
Or we could do as other languages, and pretend the problem don't exist...
> Newer languages does much better.
Sortof.
Johnny
More information about the Info-vax
mailing list