[Info-vax] Portable OpenVMS binary data format?
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Tue Aug 7 11:58:36 EDT 2018
On 2018-08-06 21:27:50 +0000, John E said:
>>
>>
>>> The nicest example I can think of is that stats packages like SAS &
>>> stata have data formats that are binary and portable and what's great
>>> is not the efficiency of storage and transfer but the fact that the
>>> whole thing is seamless b/c it preserves all the metadata along the way
>>> so you never worry about specifying any data formats.
>>
>> A fondness for premature optimizations, maybe?
>
> Stephen,
>
> I don't follow the "premature optimization" criticism at all here. If
> you have SAS or Stata on multiple platforms the fact that you can just
> FTP binary data (including metadata) across the platforms and
> read/write with no additional effort is very handy.
Okay. More words. You clearly do not understand this corner of
Fortran. You do not understand binary data and related debugging. You
also either do not understand network security, or you're willingly
headed for trouble with some of the local assumptions of (in)security
given the use of FTP. Don't take that the wrong way, either. We all
started here, too.
Transferring data via removable media and files is a classic approach
in computing. For the older folks here, it's an updated form of
sneakernet with floppies. Here using a network.
To get this stuff working, a developer can...
...Assume that this is the first time this requirement has ever
surfaced. This can be an approach used by some developers and whether
experienced and inexperienced, and it's an approach that's endemic in
some organizations and communities. Sometimes because the folks don't
know or don't think to look. Or sometimes because they have a strong
streak of what's sometimes known as Not Invented Here. Sometimes
organizational requirements. Sometimes the folks are getting paid to
write code and writing extra code is beneficial for the developer; the
incentives are misaligned. With most questions involving Fortran and
OpenVMS, the general requirements are probably not new. Which means
there are existing approaches and options available, for the porting.
Marshaling and unmarshaling data is not an unusual task.
...If you are going to create the bespoke code, create simple code
first, and get that working. Leave the understanding of binary dumping
(xxd, OpenVMS DUMP or otherwise) until later, and work with a data
format you can directly read and write, not just with your applications
but with text editors and other tools. Build the app incrementally,
and get it working in hunks. Then tie those hunks together. In this
case, getting text transfers working will give you some benchmarks for
transfer times and effort, and will let you know whether switching to
binary will provide a significant performance gain. It might. Or it
might not.
You have been exceedingly stingy at providing details such as the
volume of data involved, which makes providing performance data more
difficult. Binary is a denser encoding than text, which means it'll
transfer and process faster. But how much faster depends on whether
we're discussing thousands or millions or billions, for instance.
> E.g. you can process and whittle down a large Stata dataset on a huge
> linux machine with tons of disk & memory, then download the smaller
> dataset to you laptop for continued work. Super convenient and
> something that is pretty common nowadays with many cross platform apps.
> Not too relevant here as Stata is not on VMS at all and the SAS
> version on VMS is really old (I think even more out of date than
> Fortran on VMS).
Something to aspire to? Well, that's a pretty bad design, actually.
Primitive. Clumsy. I'd expect these and other tools — and the case
you're working on here — would work as well with remote access to text
or binary data over SSL, without having to transfer the files around.
> The CDF format you mention is a good example of this sort of thing
> along with HDF.
Writing code for generic problems is something that was popular prior
to the availability of the Internet and open source, but that
re-invention-of-code effort tends to detract from what can be invested
in solving your specific problems; of making your app better. This is
part of premature optimization; on what you're spending your
development time working on. Everybody has different interests and
requirements, and yours may well involve learning arcane Fortran syntax
and binary encoding. Or it might be working on or optimizing whatever
processing is happening with these floating point numbers, or providing
a better user interface or documentation or whatever. But marshaling
and unmarshaling is glue code. Necessary, but not something an
operating system or a language that a developer should need to get
bogged down with. OpenVMS is not good with keeping glue code to a
minimum, unfortunately.
Now if your goal is to learn binary encoding, there are far easier
tools than Fortran for that. Though Fortran can work. And if that's
your goal, then go look at what the CDF code is doing, and how it is
doing it. Or look at SQLite as was mentioned earlier, for that
matter. That's some very nicely written C code, and with a completely
gonzo test suite. Lots of examples of marshaling and unmarshaling
around.
> As noted these tend to be oriented to newer unix fortran versions and
> I've never spent the time to try and get them working on VMS. (ASCII
> has so far been the easier path)
UTF-8, not that OpenVMS supports better than the ASCII subset of that.
ps: OpenVMS performance here will very likely trail that of the Linux
environments, and for various reasons.
pps: You're also approaching this very much akin to the class of folks
that want somebody to write their homework or to write their app for
free. That probably isn't the case here, but your whole approach is
making helping you more effort. You're not posting your Fortran code
for folks to look at and debug — you've already received replies of
folks that have written custom Fortran for you — and you're also being
stingy about the details and requirements. Asking questions is a
skill, and one of the most important ones in development. We all get
to see that skill in practice, and from both sides. Sometimes from
both sides with the exact same person.
ppps: premature optimization? Get the basic app working. With text,
to start with. Then determine whether the improvements from switching
to binary will be worth the effort. Object-Oriented languages tend to
help with these sorts of tasks, not that I haven't encountered
spaghetti objects. Or if you want to learn how to do this, get text
working. Then start replacing hunks with binary. In an OO language,
you'd rework the marshaling and unmarshaling code. In Fortran, you'd
rework the equivalent subroutines. Pragmatically, these same routines
are where you'd remove the file I/O calls and replace them with network
calls. REST via HTTPS, or SSL for a home-grown network protocol that
eliminates the use of the file. And yes, if FTP is in play here then
so is REST via HTTP and cleartext via TCP. The apps you're pointing to
are using a fairly old and comfortable approach; a data export and
import. That's good for backups, but it's often going to be exporting
duplicate data, and the file transfer is involved. Other approaches
available here include accessing live data — that's what REST or ilk or
an SSL connection can provide — and it often involves transferring or
archiving duplicate data; transaction logs and deltas can be useful for
some of these cases. This also gets into transaction processing, if
you're going to be tossing data around and prefer not to have some of
it lost during a server or app or network glitch. This is also all
further along past your text- or binary-formatted file transfers, and
the live data access is what SAS (for instance) is built to host and
serve.
TL;DR: premature optimization... start simple, get the various pieces
working, build up, then figure out whether further optimizations are
required. Get some code that can be profiled for performance, and
transfers timed. Or whether installing a gigabit Ethernet might be a
cheaper approach, for that matter. For questions, post example or
reproducer or problematic source code and software versions and
relevant requirements such as scale and scope, and do what you can to
make it easier on and faster for folks to answer your question.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list