[Info-vax] Shadow Volume Copy

Michael Moroney moroney at world.std.spaamtrap.com
Mon Mar 30 13:37:59 EDT 2009


AEF <spamsink2001 at yahoo.com> writes:

>On Mar 29, 1:42 pm, moro... at world.std.spaamtrap.com (Michael Moroney)
>wrote:

>> I just checked.  The initial read does go to the random drive, and then
>> a datacheck IO is done to the other member(s), and if they are different,
>> then the disk identified as the master is used a source to make all
>> members identical.

>A "random" drive? How is it selected? It has to choose one somehow,
>even if via a sequence of pseudorandom numbers. But I would have
>thought that it would be done via some other "randomness". Perhaps the
>alphanumerically first drive of the set? (Come to think of it, in a
>three-member shadow set, if the original "master" doesn't come up, the
>merge has to start with one of the other two. Well, it may as well be
>random then. Silly me for not thinking of that before.)

It's actually not random.  Which drive is used depends on the device's
queue length (shorter=faster) as well as device type (biased toward the
faster) modified by SET DEVICE/READ_COST and SET DEVICE/SITE and whether
it's served to the node. If all this is equal I think it's a simple 1, 2,
3... rotation.  When you factor in all that can go on in a cluster, trying
to predict any of this in advance becomes near impossible which is why I
said "random".

>> There is some special code in the shadow driver for booting.  This code
>> does know enough to look for other members but that may wait until the
>> system disk is "mounted".  Before then there shouldn't be writes since
>> they aren't synchronized with anything else in the cluster, although
>> I think there may be some exceptions (the "current" parameters for one)
>>
>> >As for crash dumps, I don't think that the system can run the bugcheck
>> >code (or whatever it's called) and volume shadowing code concurrently.
>> >And I'd think that would mess up your dump, anyway. Consequently, it
>> >writes to the boot device unless you set it up to dump to some other
>> >disk. But you can't dump to a shadow set, as far as I know. (That's
>> >aside of the trivial case of a single-member shadow set, of course!)
>>
>> The bugcheck code is written by the crash dump code/console, which knows
>> nothing of shadowing.  The data goes to the boot device (ignoring the
>> dump_dev console parameter).  There is something funny within shadowing
>> or SDA that deals with the dump data only being on one drive.  I don't
>> remember the details.  A common problem was the merge code wiping out the
>> last dump.

>But if the machine has crashed, how can the volume-shadowing code
>still be running? Isn't it just set for the boot device ahead of time?

A restart from the console goes to the crash dump code, which doesn't
know much more than how to (selectively) dump memory to the dump_dev
or boot device, possibly after compression.  Shadowing is not involved.
The "magic" is recovering the data from the dump file when it was written
to only one of the shadow members of the system disk shadowset.

>It's amazing: A seemingly simple (at least to me) idea, that of
>shadowing, has all sorts of issues that have to be dealt with very
>carefully, on top of performance issues for merging and copying.

Oh gosh, it's an amazingly complex driver because of all that goes on
in a cluster, trying to synchronize everything across the cluster while
it's possible for nodes to leave or drive I/Os can fail at any time.



More information about the Info-vax mailing list