[Info-vax] NetBackup Performance Woes

Wed May 20 15:32:05 EDT 2015

On 2015-05-20 18:13:05 +0000, Geek Nerdly said:

> OpenVMS 8.4
> 2-node cluster, RX2800 i2, 4 CPUs each, 32gb RAM each.
> T4 data is collected 24/7 from both nodes of the cluster.

Run the data and see what's bottoming out.

> NetBackup release 7.5
> 
> I believe I/O at that level is not where the problem is.

Might want to verify that.  Performance assumptions that can be 
reasonably tested probably should be tested.  Surprises happen, after 
all.

> I believe this has been happening for months, but user activity periods 
> have changed recently, so that users are now reporting it.

You have T4 data.  Run it.

> We matched the NetBackup documentation to set process quotas on the 
> account and are using a separate account for the NetBackup service so 
> that we could adjust settings on either account without messing with 
> other.
> 
> The network service for NetBackup agent has a limit of 10 connections; 
> when it's running I typically see anywhere from 2 to 8 network 
> processes running under that service.  All of those processes show up 
> super-heavy on Direct and Buffered I/O.  It is the only thing really 
> doing anything on that node at that time of the day.

That's typical of a backup tool.  The options are faster storage, fewer 
files, or throttling via the LIMIT_BANDWIDTH.   Whether reducing the 
permissible number of those processes helps any?

> Aside from the I/O slam that goes on, what happens when NetBackup runs 
> is at least 2 (of 4) CPUs on node B (where NetBackup agent is) are 
> heavy in Interrupt Mode the entire time (in the example I'm looking at, 
> cpu 0 ~75%, cpu 3 ~100%) .

Interrupt mode has various triggers; it's often secondary to device 
activity of some sort.   That could be network or storage here, or 
might be something else odd — any hardware errors getting logged?   
Disks can get flaky — even on expensive controllers — and I've seen a 
few Itanium servers tossing prodigious numbers of memory errors, for 
instance.

> This almost immediately (and for the entire time) affects user 
> activity/response times on node A, but I don't see a related anomaly in 
> T4 data for node A.  The effects are much more severe on node B.

Interrupt activity usually adversely effects other hosts secondary to 
locking contention, or secondary to device I/O contention (somewhere) 
in the system.

As an early and simple (simplistic?) indication of excessive I/O load, 
look for an average disk I/O queue length of more than 0.5, via MONITOR 
DISK/ITEM=QUEUE or via T4 data   A persistent disk queue length average 
of more than 0.5 is bad news.  That usually leads to reducing the I/O 
load, or distributing the I/O load across disks, and/or moving to 
faster storage.

> The LIMIT_BANDWIDTH setting is only barely documented and NetBackup 
> Support has not (yet) answered my question about whether the setting 
> applies to each process or if it throttles the collective lot of them.  
> I suspect that I would have to know what throughput I see, then divide 
> that by the number of connections the service allows to have an effect. 
>  They say you can set LIMIT_BANDWIDTH, but nothing of how to look for 
> an optimal value to use.

I suspect it's a "try it" setting.  The other question is whether the 
value is dynamic — I'd hope it is, but it'd be pretty easy to determine 
that with some quick testing.

> 
>> I'd suspect the backup will consume all possible resources until and> 
>> unless throttled due to skewed quota settings or I/O limits.
> That is what seems to be happening.  I don't know what quotas I should 
> consider adjusting.
> NetBackup support has admitted (in writing) that they have no expertise 
> or advice to offer to tune this (their own product) on this platform.  
> They have pretty much said we'll have to do this by trial and error.
> 
> Their other suggestion was to back up less data.

That'd probably lead me to start to investigate a migration to a 
different tool or to an alternative strategy.

Alan Fay of Symantec used to post around here on NetBackup, but I've 
not seen anything recent.

One other out-of-left-field case I've encountered with poor 
performance: sometimes the storage controller gets saturated from other 
activity on other platforms.  That doesn't look to be the most likely 
case here, but I'd still see how busy the storage controller might be.

Since this is Itanium, also run a check for excessive alignment faults. 
 <http://h71000.www7.hp.com/openvms/journal/v9/alignment_faults.html>

-- 
Pure Personal Opinion | HoffmanLabs LLC