[Info-vax] Does OpenVMS Use Unicode?

Mon Jun 13 05:27:01 EDT 2016

On Monday, June 13, 2016 at 3:49:32 AM UTC-4, Johnny Billquist wrote:
> On 2016-06-13 03:58, lawrencedo99 at gmail.com wrote:
> > Funny story about Unicode: initially it was going to be a 16-bit code, sufficient to cover all the world’s *current* writing systems. Then I guess its architects got ambitious, and decided to add in all the *historical* writing systems as well. So nowadays it is officially a 20-bit code. But who knows how much more it might grow in future?
> 
> No, it's not officially a 20 bit character coding.
> It's range is actually 0x0 to 0x10FFFF. It's rather weird, but it's 
> fixed, and I don't think it will be extended from that.
> But there are plenty of free space left, so they can continue to make 
> life miserable for many years to come even with the current definition.
> 
> Fun story #2: The Unicode book have Hieroglyphs on the face of it, but 
> it wasn't until Unicode V5.2 in 2009 that you finally had Unicode 
> encoding of Egyptian Hieroglyphs. So for a long time, the book about 
> Unicode, which aimed at encoding all type of writing, used a text on the 
> front page that was not possible to encode in Unicode.
> 
> 	Johnny
> 
> -- 
> Johnny Billquist                  || "I'm on a bus
>                                    ||  on a psychedelic trip
> email: bqt at softjar.se             ||  Reading murder books
> pdp is alive!                     ||  tryin' to stay hip" - B. Idol

Not wanting to be pedantic but we programmers must always remember that there is a huge difference between unicode and UTF-8 (which is one "unicode encoding" of many). IIRC, UTF-8 contains the 0x10FFFF limit just mentioned and that is what Johnny was referring to.

There are internationalization routines built into OpenVMS which can do character conversions as well as timezone stuff (something else which was tacked onto OpenVMS rather than having builtin support) but I have found them woefully inadequate. Check out this Internationalizational demo:
http://www3.sympatico.ca/n.rieck/demo_vms_html/internationalization_demo_101_c.html

What is worse is this: in the OpenVMS world you will find many instances of people storing ASCII, ISO-8859-1, and Windows-1252 (sometimes called ANSI; it implements 32 more characters than ISO-8859-1; the Euro symbol first springs to mind). This all works properly until you start programming for browsers or email. In fact, you might think things are working properly for years until you try to run the stuff through an XML parser which will stop dead in its tracks whenever you declare UTF-8 like this
<?xml version="1.0" encoding="UTF-8"?>
but send any unencoded 8-bit data from ISO-8859-1 or Windows-1252.

p.s. note that XML properly calls this "encoding" but HTML refers to this as "chrset" which can be misleading. BTW, whenever any popular browser encounters an ISO-8859 declaration it pretends it saw Windows-1252. 

Back to Internationalization routines: many programmers rely upon their own functions to first convert from ISO+Windows to Unicode then convert again to UTF-8 before transmission which is what I had to do here:

http://www3.sympatico.ca/n.rieck/demo_vms_html/openvms_demo_index.html#utf-8

Of course you have to reverse every thing whenever you wish to store single byte data

Neil Rieck
Waterloo, Ontario, Canada.