[Info-vax] Portable OpenVMS binary data format?

Stephen Hoffman seaohveh at hoffmanlabs.invalid
Tue Aug 7 11:58:36 EDT 2018


On 2018-08-06 21:27:50 +0000, John E said:

>> 
>> 
>>> The nicest example I can think of is that stats packages like SAS & 
>>> stata have data formats that are binary and portable and what's great 
>>> is not the efficiency of storage and transfer but the fact that the 
>>> whole thing is seamless b/c it preserves all the metadata along the way 
>>> so you never worry about specifying any data formats.
>> 
>> A fondness for premature optimizations, maybe?
> 
> Stephen,
> 
> I don't follow the "premature optimization" criticism at all here.  If 
> you have SAS or Stata on multiple platforms the fact that you can just 
> FTP binary data (including metadata) across the platforms and 
> read/write with no additional effort is very handy.

Okay.  More words.  You clearly do not understand this corner of 
Fortran.  You do not understand binary data and related debugging.  You 
also either do not understand network security, or you're willingly 
headed for trouble with some of the local assumptions of (in)security 
given the use of FTP.   Don't take that the wrong way, either.  We all 
started here, too.

Transferring data via removable media and files is a classic approach 
in computing.  For the older folks here, it's an updated form of 
sneakernet with floppies.  Here using a network.

To get this stuff working, a developer can...

...Assume that this is the first time this requirement has ever 
surfaced.  This can be an approach used by some developers and whether 
experienced and inexperienced, and it's an approach that's endemic in 
some organizations and communities.  Sometimes because the folks don't 
know or don't think to look.  Or sometimes because they have a strong 
streak of what's sometimes known as Not Invented Here.  Sometimes 
organizational requirements.  Sometimes the folks are getting paid to 
write code and writing extra code is beneficial for the developer; the 
incentives are misaligned.  With most questions involving Fortran and 
OpenVMS, the general requirements are probably not new.  Which means 
there are existing approaches and options available, for the porting.  
Marshaling and unmarshaling data is not an unusual task.

...If you are going to create the bespoke code, create simple code 
first, and get that working.  Leave the understanding of binary dumping 
(xxd, OpenVMS DUMP or otherwise) until later, and work with a data 
format you can directly read and write, not just with your applications 
but with text editors and other tools.  Build the app incrementally, 
and get it working in hunks.   Then tie those hunks together.  In this 
case, getting text transfers working will give you some benchmarks for 
transfer times and effort, and will let you know whether switching to 
binary will provide a significant performance gain.   It might.  Or it 
might not.

You have been exceedingly stingy at providing details such as the 
volume of data involved, which makes providing performance data more 
difficult.  Binary is a denser encoding than text, which means it'll 
transfer and process faster.  But how much faster depends on whether 
we're discussing thousands or millions or billions, for instance.

> E.g. you can process and whittle down a large Stata dataset on a huge 
> linux machine with tons of disk & memory, then download the smaller 
> dataset to you laptop for continued work.  Super convenient and 
> something that is pretty common nowadays with many cross platform apps. 
>  Not too relevant here as Stata is not on VMS at all and the SAS 
> version on VMS is really old (I think even more out of date than 
> Fortran on VMS).

Something to aspire to?  Well, that's a pretty bad design, actually.  
Primitive.  Clumsy.  I'd expect these and other tools — and the case 
you're working on here — would work as well with remote access to text 
or binary data over SSL, without having to transfer the files around.

> The CDF format you mention is a good example of this sort of thing 
> along with HDF.

Writing code for generic problems is something that was popular prior 
to the availability of the Internet and open source, but that 
re-invention-of-code effort tends to detract from what can be invested 
in solving your specific problems; of making your app better.   This is 
part of premature optimization; on what you're spending your 
development time working on.  Everybody has different interests and 
requirements, and yours may well involve learning arcane Fortran syntax 
and binary encoding.  Or it might be working on or optimizing whatever 
processing is happening with these floating point numbers, or providing 
a better user interface or documentation or whatever.  But marshaling 
and unmarshaling is glue code.  Necessary, but not something an 
operating system or a language that a developer should need to get 
bogged down with.  OpenVMS is not good with keeping glue code to a 
minimum, unfortunately.

Now if your goal is to learn binary encoding, there are far easier 
tools than Fortran for that.  Though Fortran can work.  And if that's 
your goal, then go look at what the CDF code is doing, and how it is 
doing it.   Or look at SQLite as was mentioned earlier, for that 
matter.  That's some very nicely written C code, and with a completely 
gonzo test suite.  Lots of examples of marshaling and unmarshaling 
around.

> As noted these tend to be oriented to newer unix fortran versions and 
> I've never spent the time to try and get them working on VMS.  (ASCII 
> has so far been the easier path)

UTF-8, not that OpenVMS supports better than the ASCII subset of that.

ps: OpenVMS performance here will very likely trail that of the Linux 
environments, and for various reasons.

pps: You're also approaching this very much akin to the class of folks 
that want somebody to write their homework or to write their app for 
free.  That probably isn't the case here, but your whole approach is 
making helping you more effort.  You're not posting your Fortran code 
for folks to look at and debug — you've already received replies of 
folks that have written custom Fortran for you — and you're also being 
stingy about the details and requirements.   Asking questions is a 
skill, and one of the most important ones in development.  We all get 
to see that skill in practice, and from both sides.  Sometimes from 
both sides with the exact same person.

ppps: premature optimization?  Get the basic app working.  With text, 
to start with.  Then determine whether the improvements from switching 
to binary will be worth the effort.  Object-Oriented languages tend to 
help with these sorts of tasks, not that I haven't encountered 
spaghetti objects.  Or if you want to learn how to do this, get text 
working.  Then start replacing hunks with binary.  In an OO language, 
you'd rework the marshaling and unmarshaling code.  In Fortran, you'd 
rework the equivalent subroutines.  Pragmatically, these same routines 
are where you'd remove the file I/O calls and replace them with network 
calls.  REST via HTTPS, or SSL for a home-grown network protocol that 
eliminates the use of the file.  And yes, if FTP is in play here then 
so is REST via HTTP and cleartext via TCP.  The apps you're pointing to 
are using a fairly old and comfortable approach; a data export and 
import.  That's good for backups, but it's often going to be exporting 
duplicate data, and the file transfer is involved.  Other approaches 
available here include accessing live data — that's what REST or ilk or 
an SSL connection can provide — and it often involves transferring or 
archiving duplicate data; transaction logs and deltas can be useful for 
some of these cases.  This also gets into transaction processing, if 
you're going to be tossing data around and prefer not to have some of 
it lost during a server or app or network glitch.  This is also all 
further along past your text- or binary-formatted file transfers, and 
the live data access is what SAS (for instance) is built to host and 
serve.

TL;DR: premature optimization...  start simple, get the various pieces 
working, build up, then figure out whether further optimizations are 
required.  Get some code that can be profiled for performance, and 
transfers timed.  Or whether installing a gigabit Ethernet might be a 
cheaper approach, for that matter.  For questions, post example or 
reproducer or problematic source code and software versions and 
relevant requirements such as scale and scope, and do what you can to 
make it easier on and faster for folks to answer your question.


-- 
Pure Personal Opinion | HoffmanLabs LLC 




More information about the Info-vax mailing list