[Info-vax] 8-bit characters
Arne Vajhøj
arne at vajhoej.dk
Thu Nov 11 13:53:11 EST 2021
On 11/11/2021 1:17 PM, Craig A. Berry wrote:
> On 11/11/21 10:21 AM, Arne Vajhøj wrote:
>> On 11/10/2021 11:48 PM, Lawrence D’Oliveiro wrote:
>>> On Thursday, November 11, 2021 at 3:33:33 PM UTC+13, Arne Vajhøj wrote:
>>>> The biggest problems with UTF-8 is that the byte length is not
>>>> necessarily the character length ...
>>>
>>> That would be true of any Unicode encoding, even UCS-4.
>>
>> No.
>>
>> It is a practical problem in UTF-8 as everything not in ASCII is more
>> than 1 byte.
>>
>> It is a theoretical problem in UTF-16 because there are defined unicode
>> code points that become more than 2 bytes (they are just extremely
>> rare).
>>
>> It is not a problem for UTF-32 as everything is 4 bytes.
>
> Back when it was called UCS-4, I think that was true. But as far as I
> know, all the ones with UTF in the name are varying width. I think
> there are a couple of emojis that take more than 4 bytes and would need
> two UTF-32 chunks to represent a single character.
A few quotes from the standard:
<quote>
In the Unicode Standard, the codespace consists of the integers from 0
to 10FFFF16, comprising 1,114,112 code points available for assigning
the repertoire of abstract characters.
</quote>
<quote>
Each Unicode code point is represented directly by a single 32-bit
code unit. Because of this, UTF-32 has a one-to-one relationship
between encoded character and code unit; it is a fixed-width character
encoding form.
</quote>
> But even if the
> encoding is not varying width, the number of characters displayed might
> not match the number of code points because of things like combining
> characters.
Display is another issue - a way more complex issue.
Arne
More information about the Info-vax
mailing list