[Info-vax] Does OpenVMS Use Unicode?

Jan-Erik Soderholm jan-erik.soderholm at telia.com
Mon Jun 13 06:15:53 EDT 2016


Den 2016-06-13 kl. 11:27, skrev Neil Rieck:
> On Monday, June 13, 2016 at 3:49:32 AM UTC-4, Johnny Billquist wrote:
>> On 2016-06-13 03:58, lawrencedo99 at gmail.com wrote:
>>> Funny story about Unicode: initially it was going to be a 16-bit
>>> code, sufficient to cover all the world’s *current* writing systems.
>>> Then I guess its architects got ambitious, and decided to add in all
>>> the *historical* writing systems as well. So nowadays it is
>>> officially a 20-bit code. But who knows how much more it might grow
>>> in future?
>>
>> No, it's not officially a 20 bit character coding. It's range is
>> actually 0x0 to 0x10FFFF. It's rather weird, but it's fixed, and I
>> don't think it will be extended from that. But there are plenty of
>> free space left, so they can continue to make life miserable for many
>> years to come even with the current definition.
>>
>> Fun story #2: The Unicode book have Hieroglyphs on the face of it, but
>>  it wasn't until Unicode V5.2 in 2009 that you finally had Unicode
>> encoding of Egyptian Hieroglyphs. So for a long time, the book about
>> Unicode, which aimed at encoding all type of writing, used a text on
>> the front page that was not possible to encode in Unicode.
>>
>> Johnny
>>
>> -- Johnny Billquist                  || "I'm on a bus ||  on a
>> psychedelic trip email: bqt at softjar.se             ||  Reading murder
>> books pdp is alive!                     ||  tryin' to stay hip" - B.
>> Idol
>
> Not wanting to be pedantic but we programmers must always remember that
> there is a huge difference between unicode and UTF-8 (which is one
> "unicode encoding" of many). IIRC, UTF-8 contains the 0x10FFFF limit
> just mentioned and that is what Johnny was referring to.
>
> There are internationalization routines built into OpenVMS which can do
> character conversions as well as timezone stuff (something else which
> was tacked onto OpenVMS rather than having builtin support) but I have
> found them woefully inadequate. Check out this Internationalizational
> demo:
> http://www3.sympatico.ca/n.rieck/demo_vms_html/internationalization_demo_101_c.html
>
>  What is worse is this: in the OpenVMS world you will find many
> instances of people storing ASCII, ISO-8859-1, and Windows-1252
> (sometimes called ANSI; it implements 32 more characters than
> ISO-8859-1; the Euro symbol first springs to mind). This all works
> properly until you start programming for browsers or email. In fact, you
> might think things are working properly for years until you try to run
> the stuff through an XML parser which will stop dead in its tracks
> whenever you declare UTF-8 like this <?xml version="1.0"
> encoding="UTF-8"?> but send any unencoded 8-bit data from ISO-8859-1 or
> Windows-1252.
>
> p.s. note that XML properly calls this "encoding" but HTML refers to
> this as "chrset" which can be misleading. BTW, whenever any popular
> browser encounters an ISO-8859 declaration it pretends it saw
> Windows-1252.

We use the Python port to run our web applications. And Python
uses 7-bit for it's basic "string" data type. So I simply made
a short function to change into the HTML variants like:

def html_esc(string):
   tmpx1 = string.replace(u'\xe5','å')
   tmpx1 = tmpx1.replace(u'\xe4','ä')
   tmpx1 = tmpx1.replace(u'\xf6','ö')
   tmpx1 = tmpx1.replace(u'\xf8','ø')
   tmpx1 = tmpx1.replace(u'\xd8','Ä')
   tmpx1 = tmpx1.replace(u'\xc7',' ')
   tmpx1 = tmpx1.replace(u'[','Ä')
   tmpx1 = tmpx1.replace(u']','Å')
   tmpx1 = tmpx1.replace(u'\\','Ö')
   return tmpx1

Maybe there is something built-in in Python for this also,
I do not know and I never looked for it. This works OK.

Jan-Erik.




More information about the Info-vax mailing list