[Info-vax] 8-bit characters

Craig A. Berry craigberry at nospam.mac.com
Thu Nov 11 13:17:53 EST 2021


On 11/11/21 10:21 AM, Arne Vajhøj wrote:
> On 11/10/2021 11:48 PM, Lawrence D’Oliveiro wrote:
>> On Thursday, November 11, 2021 at 3:33:33 PM UTC+13, Arne Vajhøj wrote:
>>> The biggest problems with UTF-8 is that the byte length is not
>>> necessarily the character length ...
>>
>> That would be true of any Unicode encoding, even UCS-4.
> 
> No.
> 
> It is a practical problem in UTF-8 as everything not in ASCII is more 
> than 1 byte.
> 
> It is a theoretical problem in UTF-16 because there are defined unicode
> code points that become more than 2 bytes (they are just extremely
> rare).
> 
> It is not a problem for UTF-32 as everything is 4 bytes.

Back when it was called UCS-4, I think that was true.  But as far as I
know, all the ones with UTF in the name are varying width.  I think
there are a couple of emojis that take more than 4 bytes and would need
two UTF-32 chunks to represent a single character. But even if the
encoding is not varying width, the number of characters displayed might
not match the number of code points because of things like combining
characters.



More information about the Info-vax mailing list