[Info-vax] Does OpenVMS Use Unicode?
Johnny Billquist
bqt at softjar.se
Mon Jun 13 07:33:35 EDT 2016
On 2016-06-13 13:04, Jan-Erik Soderholm wrote:
> Den 2016-06-13 kl. 12:30, skrev lawrencedo99 at gmail.com:
>> On Monday, June 13, 2016 at 10:15:54 PM UTC+12, Jan-Erik Soderholm wrote:
>>
>>> Python uses 7-bit for it's basic "string" data type.
>>
>> Python strings are Unicode
>> <https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals>.
>>
>>
>
> Yes, but in Unicode, you can not encode single byte characters in the
> "upper" half of the 8 bit space. Anything in the "extended ASCII" part
> will be multi byte characters in Unicode. So characters like those in
> the example (that you snipped) will create errors in (som parts of)
> Python.
Uh. You are confusing Unicode with the encoding of Unicode characters in
UTF-8. The first 256 characters in Unicode is identical to ISO 8859-1.
However, if you choose to encode your string using UTF-8 then yes, UTF-8
can only encode the low 128 code points as a single byte.
The rest will require multiple bytes. That that is not Unicode itself,
but the UTF-8 encoding. If you instead use UTF-16, for example, then you
can definitely code all the first 256 characters as a single word each,
along with a whole lot more characters.
Unicode itself do not have encodings. It's a code set with lots and lots
of characters. As mentioned, the range goes between 0x0 and 0x10FFFF.
Exactly how you represent this in memory is a different story. There
have been many different encoding schemes... The most obvious and easy
one is to just use 32 bits for each character. But that is a bit
wasteful most of the time...
And Python is using Unicode, as illustrated by your code. What do you
think u'\xe5' means?
Johnny
More information about the Info-vax
mailing list