[Info-vax] The dangers of extended uptime. Was: Re: swap and page files
Paul Sture
nospam at sture.ch
Thu Jan 3 10:15:24 EST 2013
In article <kbqfko$k21$1 at dont-email.me>,
Stephen Hoffman <seaohveh at hoffmanlabs.invalid> wrote:
> Boxes that are up for months or years are a production risk, even when
> the operating system itself is working perfectly. If it's production
> servers, then Chaos Monkey
> <http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html>
> and similar can help spot production errors, avoiding single-point
> problems, and can help keep your servers online when Bad Things happen.
> Better to force the reboots and restarts and software updates â on a
> schedule, or automated, or entirely randomly â when staff is available
> to deal with the fallout.
A friend who is a *nix system administrator has the following problem:
2 Solaris boxes, one with an uptime of 3 years, the other with an uptime
of 6 years. These are configured together using Veritas Cluster Server
(VCS) such that for limited time periods one box should be able to cope
with the workload of both.
One of the mirrored system disks has failed on one of those boxes.
New disk ordered [1], but when it arrived they couldn't get the old disk
out.
Somewhere in this mix an extra file system was added to System A, but it
won't mount on System B (reason unknown).
A project to replace these systems was initiated about 4 years ago, but
got dropped in favour of something else.
My friend is now in the position of ensuring that the Unix team in
charge of these systems don't perform a reboot, in case either of them
or at worst case, both, don't come back. Oracle[2] have been called in
to have a look but they want to take the systems down to have a look at
them, and the customer is afraid that the boxes won't come up again. A
quick look at Wiki suggests that the version of Solaris mentioned
dropped off support last March.
I detected another factor at play here. The guy who set these systems
up has since left for pastures new, but there appears to be a refusal to
believe he could have missed something, simply because he had such a
good track record of getting things right. This reaction came when I
suggested checking firmware levels - it's surprising what can come
crawling out of the woodwork...
[1] Or as I understood it, "New 10 year old disk ordered".
[2] I still can't quite get used to thinking Oracle as a hardware company
--
Paul Sture
More information about the Info-vax
mailing list