[Info-vax] Character sets

Tue Sep 6 20:14:28 EDT 2022

On 2022-09-06 23:32:46 +0000, Arne Vajhj said:

> On 9/6/2022 4:31 PM, Stephen Hoffman wrote:
>> On 2022-09-06 18:42:53 +0000, Arne Vajhj said:
>> 
>>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++, 
>>>> which is... a mess, and is also ill-suited for UTF-8.  Probably better 
>>>> to use char16_t and char32_t, if you do need fixed-width wide character 
>>>> storage.
>>> 
>>> wchar_t is a typical C vague definition where char16_t and char32_t are 
>>> much more clearly defined.
>>> 
>>> But wchar_t got runtime support.
>> 
>> Run-time support which is less than useful for most purposes, 
>> particularly given the definition and the ~portability issues.
> 
> I think I would miss wcs*, isw*, w versions of IO functions.

Other than those functions are wchar_t and thus still problematic at 
best, sure.

Something akin to u8_strtok or u_strtok_r (UTF-8 variants of strtok or 
wcstok) works rather better for most uses I have, though.

Yeah; those particular libunistring and ICU calls are not part of the C 
standard.

C23 does add null-terminated multibyte calls, but the existing 
selection of standard string-handling calls for UTF-8 is just... bad.  
But then C string handling is bad.  OpenVMS itself is also bad at UTF-8.

>>> C (and for that matter also C++) IO functions does not not make 
>>> writing/reading UTF-8 easy.
>> 
>> The C I/O functions do ~mostly fine.
>> 
>> Semi-recent Clang, else-platform:
>> 
>> 
>> $ cc x.c -o x
>> $ ~/x
>> hello 🗺
>> $ cat x.c
>> #include <stdio.h>
>> #include <stdlib.h>
>> 
>> int  main(void)
>> {
>>  printf("hello 🗺\n");
>>  exit(EXIT_SUCCESS);
>> }
> 
> That is C IO processing bytes where the application has put UTF-8 in.

char (or also soon char8_t), yes. Which holds UTF-8 strings just fine.

Both the typical C string stuff and OpenVMS string descriptors can need 
to carry the language and encoding separately.

> What is needed is something where the application passes unicode 
> (wchar_t* or char16_t* or char32_t*) to an IO function and it convert 
> to a specified encoding UTF-8 or otherwise.

Which would usually be the C character functions handling UTF-8, and 
which would preferably seldom involve wchar_t, and probably not all 
that much of char16_t or char32_t more generally.

Objective C and Swift are just vastly better at this stuff.

-- 
Pure Personal Opinion | HoffmanLabs LLC