[Info-vax] 8-bit characters

Thu Nov 11 13:53:11 EST 2021

On 11/11/2021 1:17 PM, Craig A. Berry wrote:
> On 11/11/21 10:21 AM, Arne Vajhøj wrote:
>> On 11/10/2021 11:48 PM, Lawrence D’Oliveiro wrote:
>>> On Thursday, November 11, 2021 at 3:33:33 PM UTC+13, Arne Vajhøj wrote:
>>>> The biggest problems with UTF-8 is that the byte length is not
>>>> necessarily the character length ...
>>>
>>> That would be true of any Unicode encoding, even UCS-4.
>>
>> No.
>>
>> It is a practical problem in UTF-8 as everything not in ASCII is more 
>> than 1 byte.
>>
>> It is a theoretical problem in UTF-16 because there are defined unicode
>> code points that become more than 2 bytes (they are just extremely
>> rare).
>>
>> It is not a problem for UTF-32 as everything is 4 bytes.
> 
> Back when it was called UCS-4, I think that was true.  But as far as I
> know, all the ones with UTF in the name are varying width.  I think
> there are a couple of emojis that take more than 4 bytes and would need
> two UTF-32 chunks to represent a single character.

A few quotes from the standard:

<quote>
In the Unicode Standard, the codespace consists of the integers from 0
to 10FFFF16, comprising 1,114,112 code points available for assigning
the repertoire of abstract characters.
</quote>

<quote>
Each  Unicode  code  point  is  represented directly by a single 32-bit
code unit. Because of this, UTF-32 has a one-to-one relationship
between encoded character and code unit; it is a fixed-width character
encoding form.
</quote>

>                                                  But even if the
> encoding is not varying width, the number of characters displayed might
> not match the number of code points because of things like combining
> characters.

Display is another issue - a way more complex issue.

Arne