[Info-vax] Does OpenVMS Use Unicode?

Wed Jun 15 07:56:14 EDT 2016

On Tuesday, June 14, 2016 at 2:08:27 PM UTC-4, Johnny Billquist wrote:
> On 2016-06-14 13:28, Neil Rieck wrote:
> > Not wanting to engage in a flame war, the following quote from a popular web site says it all:
> >
> > The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all five- and six-byte sequences, and almost half the four-byte sequences.
> 
> Where did you get that from? Unicode started out as a 16-bit character set.
> 
> Unicode was expanded beyond 16 bits only in 1996 by Unicode 2.0.
> 
> Are you confusing Unicode with ISO 10646 perhaps?
> 
> Oh, you might actually be reading the page on UTF-8. Maybe it's worth 
> repeating: UTF-8 is an encoding scheme. The character set is Unicode.
> 
> > ###
> >
> > This restricts UTF-8 (which is a unicode encoding) to a subset of the entire unicode map. BTW, there are large holes (called planes) in the unicode map which allow for future growth. But new codes will not appear in UTF-8 unless RFC-3629 is superseded.
> 
> But the Unicode "map" only covers 0x0 to 0x10FFFF anyway, so it's not a 
> subset. It's just that UTF-8 was "trimmed" to just cover what was needed 
> to encode all Unicode code points.
> 
> 	Johnny

We did our own research to prove that other organizations were sending us improperly declared data (some still send us raw single-byte French characters (Windows-1252) out of their database but tell us it is UTF-8; I think their UTF-8 stuff came from Microsoft boilerplate produced by a different author).

Much of our starting material came from here:

http://www.unicode.org/versions/Unicode8.0.0/ch01.pdf

quote: Unicode characters are represented in one of three encoding forms: a  32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The  8-bit,  byte-oriented  form, UTF-8, has been designed for ease of use with existing ASCII-based systems. The Unicode Standard is code-for-code identical  with International Standard ISO/IEC 10646. Any implementation that is conformant to Unicode is therefore conformant to ISO/IEC 10646. The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters. The majority of the common characters used in the major languages of the world are encoded in the first 65,536 code points ... 

Our OpenVMS World: We store single-byte Windows-1252 characters in our RMS databases but properly ensure the data is mapped to unicode then encoded to UTF-8 before transmission. In the case of people sending us bad data, we wrote our own routines to do a best effort detection / conversion.

Neil Rieck
Waterloo, Ontario, Canada.