[Info-vax] 8-bit characters
Craig A. Berry
craigberry at nospam.mac.com
Thu Nov 11 14:35:55 EST 2021
On 11/11/21 12:53 PM, Arne Vajhøj wrote:
> On 11/11/2021 1:17 PM, Craig A. Berry wrote:
>> On 11/11/21 10:21 AM, Arne Vajhøj wrote:
>>> On 11/10/2021 11:48 PM, Lawrence D’Oliveiro wrote:
>>>> On Thursday, November 11, 2021 at 3:33:33 PM UTC+13, Arne Vajhøj wrote:
>>>>> The biggest problems with UTF-8 is that the byte length is not
>>>>> necessarily the character length ...
>>>>
>>>> That would be true of any Unicode encoding, even UCS-4.
>>>
>>> No.
>>>
>>> It is a practical problem in UTF-8 as everything not in ASCII is more
>>> than 1 byte.
>>>
>>> It is a theoretical problem in UTF-16 because there are defined unicode
>>> code points that become more than 2 bytes (they are just extremely
>>> rare).
>>>
>>> It is not a problem for UTF-32 as everything is 4 bytes.
>>
>> Back when it was called UCS-4, I think that was true. But as far as I
>> know, all the ones with UTF in the name are varying width. I think
>> there are a couple of emojis that take more than 4 bytes and would need
>> two UTF-32 chunks to represent a single character.
>
> A few quotes from the standard:
>
> <quote>
> In the Unicode Standard, the codespace consists of the integers from 0
> to 10FFFF16, comprising 1,114,112 code points available for assigning
> the repertoire of abstract characters.
> </quote>
>
> <quote>
> Each Unicode code point is represented directly by a single 32-bit
> code unit. Because of this, UTF-32 has a one-to-one relationship
> between encoded character and code unit; it is a fixed-width character
> encoding form.
> </quote>
Hmm. You're right. For some reason I had thought they'd blown the
4-byte limit with emojis, but it doesn't seem UTF-32 has any provision
for surrogate pairs.
>> But even if the
>> encoding is not varying width, the number of characters displayed might
>> not match the number of code points because of things like combining
>> characters.
>
> Display is another issue - a way more complex issue.
>
> Arne
>
More information about the Info-vax
mailing list