[Info-vax] Does OpenVMS Use Unicode?
Johnny Billquist
bqt at softjar.se
Mon Jun 13 05:58:01 EDT 2016
On 2016-06-13 11:27, Neil Rieck wrote:
> On Monday, June 13, 2016 at 3:49:32 AM UTC-4, Johnny Billquist wrote:
>> On 2016-06-13 03:58, lawrencedo99 at gmail.com wrote:
>>> Funny story about Unicode: initially it was going to be a 16-bit code, sufficient to cover all the world’s *current* writing systems. Then I guess its architects got ambitious, and decided to add in all the *historical* writing systems as well. So nowadays it is officially a 20-bit code. But who knows how much more it might grow in future?
>>
>> No, it's not officially a 20 bit character coding.
>> It's range is actually 0x0 to 0x10FFFF. It's rather weird, but it's
>> fixed, and I don't think it will be extended from that.
>> But there are plenty of free space left, so they can continue to make
>> life miserable for many years to come even with the current definition.
>>
>> Fun story #2: The Unicode book have Hieroglyphs on the face of it, but
>> it wasn't until Unicode V5.2 in 2009 that you finally had Unicode
>> encoding of Egyptian Hieroglyphs. So for a long time, the book about
>> Unicode, which aimed at encoding all type of writing, used a text on the
>> front page that was not possible to encode in Unicode.
>>
>> Johnny
>>
>> --
>> Johnny Billquist || "I'm on a bus
>> || on a psychedelic trip
>> email: bqt at softjar.se || Reading murder books
>> pdp is alive! || tryin' to stay hip" - B. Idol
>
> Not wanting to be pedantic but we programmers must always remember that there is a huge difference between unicode and UTF-8 (which is one "unicode encoding" of many). IIRC, UTF-8 contains the 0x10FFFF limit just mentioned and that is what Johnny was referring to.
The Unicode defines the range 0x0 to 0x10FFFF. It has nothing to do with
the encoding. UFT-8 could in theory encode any sized bitstring.
So no, I was not referring to any limit in UTF-8, but the actual
definition of Unicode.
> What is worse is this: in the OpenVMS world you will find many instances of people storing ASCII, ISO-8859-1, and Windows-1252 (sometimes called ANSI; it implements 32 more characters than ISO-8859-1; the Euro symbol first springs to mind). This all works properly until you start programming for browsers or email. In fact, you might think things are working properly for years until you try to run the stuff through an XML parser which will stop dead in its tracks whenever you declare UTF-8 like this
> <?xml version="1.0" encoding="UTF-8"?>
> but send any unencoded 8-bit data from ISO-8859-1 or Windows-1252.
This is partly because of broken HTTP standards. If you say that a page
is encoded in ISO-8859-1, in reality it means that the browser should
interpret it as Windows-1252. This is incredibly stupid, and happened
because so many Windows developers abused the system in this way, that
someone felt compelled to make it standard.
(It's all in some RFC that I could dig up if needed. I discovered this
while writing the web server for RSX.)
So, if you have a web page that is encoded in ISO-8859-1, I suggest you
instead claim that it's using ISO-8859-15, which gets thing right again.
The only difference between 8859-1 and 8859-15 is that the generic
currency symbol in 8859-1 were replaced by the Euro symbol in 8859-15.
Everything else is the same.
> p.s. note that XML properly calls this "encoding" but HTML refers to this as "chrset" which can be misleading. BTW, whenever any popular browser encounters an ISO-8859 declaration it pretends it saw Windows-1252.
Right. Well, HTTP calls it "charset", not "chrset", but otherwise you
got it.
See above for the ISO-8859-1, Windows-1252...
Johnny
More information about the Info-vax
mailing list