[Info-vax] 8-bit characters

Craig A. Berry craigberry at nospam.mac.com
Thu Nov 11 14:35:55 EST 2021


On 11/11/21 12:53 PM, Arne Vajhøj wrote:
> On 11/11/2021 1:17 PM, Craig A. Berry wrote:
>> On 11/11/21 10:21 AM, Arne Vajhøj wrote:
>>> On 11/10/2021 11:48 PM, Lawrence D’Oliveiro wrote:
>>>> On Thursday, November 11, 2021 at 3:33:33 PM UTC+13, Arne Vajhøj wrote:
>>>>> The biggest problems with UTF-8 is that the byte length is not
>>>>> necessarily the character length ...
>>>>
>>>> That would be true of any Unicode encoding, even UCS-4.
>>>
>>> No.
>>>
>>> It is a practical problem in UTF-8 as everything not in ASCII is more 
>>> than 1 byte.
>>>
>>> It is a theoretical problem in UTF-16 because there are defined unicode
>>> code points that become more than 2 bytes (they are just extremely
>>> rare).
>>>
>>> It is not a problem for UTF-32 as everything is 4 bytes.
>>
>> Back when it was called UCS-4, I think that was true.  But as far as I
>> know, all the ones with UTF in the name are varying width.  I think
>> there are a couple of emojis that take more than 4 bytes and would need
>> two UTF-32 chunks to represent a single character.
> 
> A few quotes from the standard:
> 
> <quote>
> In the Unicode Standard, the codespace consists of the integers from 0
> to 10FFFF16, comprising 1,114,112 code points available for assigning
> the repertoire of abstract characters.
> </quote>
> 
> <quote>
> Each  Unicode  code  point  is  represented directly by a single 32-bit
> code unit. Because of this, UTF-32 has a one-to-one relationship
> between encoded character and code unit; it is a fixed-width character
> encoding form.
> </quote>

Hmm.  You're right.  For some reason I had thought they'd blown the
4-byte limit with emojis, but it doesn't seem UTF-32 has any provision
for surrogate pairs.

>>                                                  But even if the
>> encoding is not varying width, the number of characters displayed might
>> not match the number of code points because of things like combining
>> characters.
> 
> Display is another issue - a way more complex issue.
> 
> Arne
> 




More information about the Info-vax mailing list