[Info-vax] Character sets

Johnny Billquist bqt at softjar.se
Wed Sep 7 09:08:49 EDT 2022


On 2022-09-06 20:42, Arne Vajhøj wrote:
> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>> Pedant notes: yes, I do know about wchar_t and friends in C and C++, 
>> which is... a mess, and is also ill-suited for UTF-8.  Probably better 
>> to use char16_t and char32_t, if you do need fixed-width wide 
>> character storage.
> 
> wchar_t is a typical C vague definition where char16_t and char32_t are
> much more clearly defined.

wchar_t was an invention from before Unicode came about. And it's fairly 
incompatible with the ideas in Unicode.

> But wchar_t got runtime support.

For some definition of runtime support, sure...

> C (and for that matter also C++) IO functions does not not
> make writing/reading UTF-8 easy.

Looking at the follow up comments here, what you mean is that string 
processing functions lack UTF-8 variants, which is true. Especially if 
we talk about the standards. As for as I/O goes, C have no problem at 
all. It can read/write UTF-8 without any problems at all.

However, UTF-8 is actually not a character set. UTF-8 is an *encoding* 
of Unicode. And if you were to do this properly, the canonical format is 
just Unicode characters, which needs 21 bits. Which probably means you'd 
like to store them as arrays of 32-bit values. And then you should have 
functions that take a Unicode string and converts it to UTF-8 
representation and back if needed.

But the problem is uglier than that. Since Unicode handling also means 
you should know/handle multiple codepoints that should be considered 
equivalent, and for some you have combinations of characters that are 
equivalent to another single character. And of course the actual 
collation of it all is also language dependent, so it's not even 
possible to do without some additional information. Unicode is a mess, 
and we are now stuck with it, just as we're pretty stuck with x86. Not 
because it's good, but because everyone use it.

Even trying to think how ustrcmp() should be implemented makes me sick...

Or we could do as other languages, and pretend the problem don't exist...

> Newer languages does much better.

Sortof.

   Johnny



More information about the Info-vax mailing list