[Info-vax] Character sets
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Tue Sep 6 20:14:28 EDT 2022
On 2022-09-06 23:32:46 +0000, Arne Vajhj said:
> On 9/6/2022 4:31 PM, Stephen Hoffman wrote:
>> On 2022-09-06 18:42:53 +0000, Arne Vajhj said:
>>
>>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>>> which is... a mess, and is also ill-suited for UTF-8. Probably better
>>>> to use char16_t and char32_t, if you do need fixed-width wide character
>>>> storage.
>>>
>>> wchar_t is a typical C vague definition where char16_t and char32_t are
>>> much more clearly defined.
>>>
>>> But wchar_t got runtime support.
>>
>> Run-time support which is less than useful for most purposes,
>> particularly given the definition and the ~portability issues.
>
> I think I would miss wcs*, isw*, w versions of IO functions.
Other than those functions are wchar_t and thus still problematic at
best, sure.
Something akin to u8_strtok or u_strtok_r (UTF-8 variants of strtok or
wcstok) works rather better for most uses I have, though.
Yeah; those particular libunistring and ICU calls are not part of the C
standard.
C23 does add null-terminated multibyte calls, but the existing
selection of standard string-handling calls for UTF-8 is just... bad.
But then C string handling is bad. OpenVMS itself is also bad at UTF-8.
>>> C (and for that matter also C++) IO functions does not not make
>>> writing/reading UTF-8 easy.
>>
>> The C I/O functions do ~mostly fine.
>>
>> Semi-recent Clang, else-platform:
>>
>>
>> $ cc x.c -o x
>> $ ~/x
>> hello 🗺
>> $ cat x.c
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> int main(void)
>> {
>> printf("hello 🗺\n");
>> exit(EXIT_SUCCESS);
>> }
>
> That is C IO processing bytes where the application has put UTF-8 in.
char (or also soon char8_t), yes. Which holds UTF-8 strings just fine.
Both the typical C string stuff and OpenVMS string descriptors can need
to carry the language and encoding separately.
> What is needed is something where the application passes unicode
> (wchar_t* or char16_t* or char32_t*) to an IO function and it convert
> to a specified encoding UTF-8 or otherwise.
Which would usually be the C character functions handling UTF-8, and
which would preferably seldom involve wchar_t, and probably not all
that much of char16_t or char32_t more generally.
Objective C and Swift are just vastly better at this stuff.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list