[Info-vax] Does OpenVMS Use Unicode?

Johnny Billquist bqt at softjar.se
Wed Jun 15 06:59:34 EDT 2016


On 2016-06-15 12:15, Jan-Erik Soderholm wrote:
> Den 2016-06-15 kl. 11:42, skrev Johnny Billquist:
>> On 2016-06-15 08:38, lawrencedo99 at gmail.com wrote:
>>> On Tuesday, June 14, 2016 at 12:37:54 AM UTC+12, Stephen Hoffman wrote:
>>>
>>>> String descriptors — a very primitive and limited form of an object —
>>>> lacks any sort of character encoding tag, and the file system similarly
>>>> lacks encoding-related metadata mechanisms.
>>>
>>> That’s not fatal. Anything formerly was defined to hold ASCII bytes can
>>> simply be redefined to be UTF-8. Lots of things on other platforms have
>>> done that.
>>>
>>> Just so long as the code for handling it is 8-bit clean. :)
>>
>> Mostly true. Almost all managing of strings will work just as well if you
>> suddenly decide that you use UTF-8, and all will work with no changes.
>>
>> There are only a couple of cases when things break:
>> 1) Figuring out string lengths. The old assumption that one byte is one
>> character is no longer true.
>
> OK.
>
>> 2) String collating. The sorting order of strings suddenly become very
>> complex, and you can not at all depend on just sorting based by byte
>> values
>> any more.
>
> That has never work for Swedish with "åäöÅÄÖ" anyway even with
> 7-bit ASCII or DEC-MCS. Not if the sort routines was not specificaly
> written to deal with it. The same with (simple) upper() and lower()
> funcions.
>
> Hm, just tested and f$edit with "upcase" or "lowercase" *does* handle
> åäöÅÄÖ correctly, didn't thought that before... :-)
>
> But a simple SORT of a textfile gets it wrong. It sorts Ä->Å->Ö
> while the correct order is Å->Ä->Ö.

Yeah... I know. :-)
And if you're in Germany, Ä and Ö are sorted differently than if you are 
in Sweden. And Unicode makes it even more interesting, as it will offer 
you several ways to create an Ö (for example). They will end up as 
different Unicode codepoints, but they should be considered equivalent. 
That is actually also true for simple ASCII characters. They also have 
several code points that mean the same character, which should be 
considered equal from a collating point of view.

	Johnny




More information about the Info-vax mailing list