[Info-vax] Character sets

Arne Vajhøj arne at vajhoej.dk
Wed Sep 7 19:19:56 EDT 2022


On 9/7/2022 9:08 AM, Johnny Billquist wrote:
> On 2022-09-06 20:42, Arne Vajhøj wrote:
>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++, 
>>> which is... a mess, and is also ill-suited for UTF-8.  Probably 
>>> better to use char16_t and char32_t, if you do need fixed-width wide 
>>> character storage.
>>
>> wchar_t is a typical C vague definition where char16_t and char32_t are
>> much more clearly defined.
> 
> wchar_t was an invention from before Unicode came about. And it's fairly 
> incompatible with the ideas in Unicode.

It is crazy vague in the C standard.

But on common platforms it is just utf-16 or utf-32.

>> But wchar_t got runtime support.
> 
> For some definition of runtime support, sure...

There are a bunch of w functions including wcs* and the w IO functions.

One may not like the API, but wchar_t* is approx. as supported as char*.

>> C (and for that matter also C++) IO functions does not not
>> make writing/reading UTF-8 easy.
> 
> Looking at the follow up comments here, what you mean is that string 
> processing functions lack UTF-8 variants, which is true. Especially if 
> we talk about the standards. As for as I/O goes, C have no problem at 
> all. It can read/write UTF-8 without any problems at all.

I must be bad at explaining what I mean.

I know that C char* IO can read and write bytes containing UTF-8 - those
functions just pass the bytes on.

I am looking for a decoupling between internal Unicode representation
and external encoding - aka a transparent encode/decode.

That may still sound confusing.

But it should become more clear with a few examples.

Java:

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;

public class J {
     public static void main(String[] args) throws IOException {
         String s1 = "ÆØÅæøå";
         String s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
         try(PrintWriter pw = new PrintWriter(new File("j1.txt"), 
"iso-8859-1")) {
             pw.printf("%s = %s\n", s1, s2);
         }
         try(PrintWriter pw = new PrintWriter(new File("j2.txt"), 
"utf-8")) {
             pw.printf("%s = %s\n", s1, s2);
         }
     }
}

C#:

using System;
using System.IO;
using System.Text;

public class N
{
     public static void Main(string[] args)
     {
         string s1 = "ÆØÅæøå";
         string s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
         using(StreamWriter sw = new StreamWriter("n1.txt", false, 
Encoding.GetEncoding("iso-8859-1")))
         {
             sw.WriteLine("{0} = {1}", s1, s2);
         }
         using(StreamWriter sw = new StreamWriter("n2.txt", false, 
Encoding.GetEncoding("utf-8")))
         {
             sw.WriteLine("{0} = {1}", s1, s2);
         }
     }
}

Python:

s1 = "ÆØÅæøå";
s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
with open("p1.txt", "w", encoding="iso-8859-1") as f:
     f.write("%s = %s\n" % (s1, s2))
with open("p2.txt", "w", encoding="utf-8") as f:
     f.write("%s = %s\n" % (s1, s2))

One simply specify what encoding the file should be
in and the IO code handles the encode.

And that Java and .NET uses UTF-16 while Python use UTF-8
does not matter.

Arne



More information about the Info-vax mailing list