[Info-vax] Character sets
Arne Vajhøj
arne at vajhoej.dk
Wed Sep 7 19:19:56 EDT 2022
On 9/7/2022 9:08 AM, Johnny Billquist wrote:
> On 2022-09-06 20:42, Arne Vajhøj wrote:
>> On 9/3/2022 3:30 PM, Stephen Hoffman wrote:
>>> Pedant notes: yes, I do know about wchar_t and friends in C and C++,
>>> which is... a mess, and is also ill-suited for UTF-8. Probably
>>> better to use char16_t and char32_t, if you do need fixed-width wide
>>> character storage.
>>
>> wchar_t is a typical C vague definition where char16_t and char32_t are
>> much more clearly defined.
>
> wchar_t was an invention from before Unicode came about. And it's fairly
> incompatible with the ideas in Unicode.
It is crazy vague in the C standard.
But on common platforms it is just utf-16 or utf-32.
>> But wchar_t got runtime support.
>
> For some definition of runtime support, sure...
There are a bunch of w functions including wcs* and the w IO functions.
One may not like the API, but wchar_t* is approx. as supported as char*.
>> C (and for that matter also C++) IO functions does not not
>> make writing/reading UTF-8 easy.
>
> Looking at the follow up comments here, what you mean is that string
> processing functions lack UTF-8 variants, which is true. Especially if
> we talk about the standards. As for as I/O goes, C have no problem at
> all. It can read/write UTF-8 without any problems at all.
I must be bad at explaining what I mean.
I know that C char* IO can read and write bytes containing UTF-8 - those
functions just pass the bytes on.
I am looking for a decoupling between internal Unicode representation
and external encoding - aka a transparent encode/decode.
That may still sound confusing.
But it should become more clear with a few examples.
Java:
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
public class J {
public static void main(String[] args) throws IOException {
String s1 = "ÆØÅæøå";
String s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
try(PrintWriter pw = new PrintWriter(new File("j1.txt"),
"iso-8859-1")) {
pw.printf("%s = %s\n", s1, s2);
}
try(PrintWriter pw = new PrintWriter(new File("j2.txt"),
"utf-8")) {
pw.printf("%s = %s\n", s1, s2);
}
}
}
C#:
using System;
using System.IO;
using System.Text;
public class N
{
public static void Main(string[] args)
{
string s1 = "ÆØÅæøå";
string s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
using(StreamWriter sw = new StreamWriter("n1.txt", false,
Encoding.GetEncoding("iso-8859-1")))
{
sw.WriteLine("{0} = {1}", s1, s2);
}
using(StreamWriter sw = new StreamWriter("n2.txt", false,
Encoding.GetEncoding("utf-8")))
{
sw.WriteLine("{0} = {1}", s1, s2);
}
}
}
Python:
s1 = "ÆØÅæøå";
s2 = "\u00C6\u00D8\u00C5\u00E6\u00F8\u00E5";
with open("p1.txt", "w", encoding="iso-8859-1") as f:
f.write("%s = %s\n" % (s1, s2))
with open("p2.txt", "w", encoding="utf-8") as f:
f.write("%s = %s\n" % (s1, s2))
One simply specify what encoding the file should be
in and the IO code handles the encode.
And that Java and .NET uses UTF-16 while Python use UTF-8
does not matter.
Arne
More information about the Info-vax
mailing list