[Info-vax] Does OpenVMS Use Unicode?
Jan-Erik Soderholm
jan-erik.soderholm at telia.com
Wed Jun 15 06:15:06 EDT 2016
Den 2016-06-15 kl. 11:42, skrev Johnny Billquist:
> On 2016-06-15 08:38, lawrencedo99 at gmail.com wrote:
>> On Tuesday, June 14, 2016 at 12:37:54 AM UTC+12, Stephen Hoffman wrote:
>>
>>> String descriptors — a very primitive and limited form of an object —
>>> lacks any sort of character encoding tag, and the file system similarly
>>> lacks encoding-related metadata mechanisms.
>>
>> That’s not fatal. Anything formerly was defined to hold ASCII bytes can
>> simply be redefined to be UTF-8. Lots of things on other platforms have
>> done that.
>>
>> Just so long as the code for handling it is 8-bit clean. :)
>
> Mostly true. Almost all managing of strings will work just as well if you
> suddenly decide that you use UTF-8, and all will work with no changes.
>
> There are only a couple of cases when things break:
> 1) Figuring out string lengths. The old assumption that one byte is one
> character is no longer true.
OK.
> 2) String collating. The sorting order of strings suddenly become very
> complex, and you can not at all depend on just sorting based by byte values
> any more.
That has never work for Swedish with "åäöÅÄÖ" anyway even with
7-bit ASCII or DEC-MCS. Not if the sort routines was not specificaly
written to deal with it. The same with (simple) upper() and lower()
funcions.
Hm, just tested and f$edit with "upcase" or "lowercase" *does* handle
åäöÅÄÖ correctly, didn't thought that before... :-)
But a simple SORT of a textfile gets it wrong. It sorts Ä->Å->Ö
while the correct order is Å->Ä->Ö.
> 3) String comparisons. If you compare two strings, they might actually be
> considered equal even though the byte values are totally different. This is
> a property of Unicode, but as such, it gets reflected in the storage of the
> bytes even if encoded as UTF-8 (this is actually the issue with point 2 as
> well).
>
> So, for things that don't care about the actual content of a string, the
> current string descriptors will hold a UTF-8 encoded string just as well as
> a current Latin-1 string. No changes.
> For code that manipulate and examine strings, there are subtle problems.
>
> Johnny
>
More information about the Info-vax
mailing list