[Info-vax] The Future of Server Hardware?

Wed Oct 3 22:23:06 EDT 2012

Stephen Hoffman <seaohveh at hoffmanlabs.invalid> wrote:
> JohnF said:
>> Stephen Hoffman <seaohveh at hoffmanlabs.invalid> wrote:
>>> 
>>> If you're looking to render images or brute-force passwords, definitely
>>> yes.  (This is why I was pointing to bcrypt and scrypt a while back,
>>> too.  But I digress.)
>>
>> Thanks for the info. So vectorizable, "definitely yes";
>> parallelizable, "not so much."

And thanks for the additional info...

> GPUs can run various tasks in parallel, as can the cores available 
> within most modern boxes.  Laptops are now arriving with two and four 
> codes, and a workstation can have 8 to 16 cores plus a couple of fast 
> GPUs.
> 
> How much data do you need to spread around?   And how often.  How 
> closely are the work-bits tied together or dependent.  (qv: Amdahl's 
> Law.)
> 
> Adding boxes into an application design - whether x86-64 or ARM or 
> mainframe or rx2600 - means that box-to-box latency and bandwidth are 
> in play, too.
> 
> The same bandwidth and latency issues arise within a multiprocessor 
> box, but those links are usually shorter and faster.  (There are also 
> some degenerate hardware cases around, where loading and unloading a 
> particularly GPU is relatively performance-prohibitive, for instance; 
> where the GPU is gonzo fast, but getting the data in and out just 
> isn't.)
> 
> If you have a good way to distribute that data or schedule the tasks, 
> or if you have fast links, then adding boxes scales nicely.
> 
> If not, then you're headed for more specialized hardware (mainframe, 
> etc), or toward a world of hurt.
> 
> For an example, VMS used to run into a wall around 8 CPUs or so; where 
> adding processors didn't help or could even reduce aggregate 
> performance.  That's gotten (much) better in recent VMS releases (due 
> in no small part to the efforts of some of the VMS engineers to break 
> up the big locks), but VMS still doesn't scale as high as some other 
> available platforms.  Clustering has some similar trade-offs here, too; 
> where the overhead of sharing a disk across an increasing number of 
> hosts runs into the proverbial wall, for instance.
> 
> One size does not fit all.

>> Is there a pretty standard whitebox configuration people typically
>> put together as a gpu platform -- MB make and model, power supply
>> and cooling options, memory, etc? And is nvidia/cuda the current
>> favorite among the gpu/"language" options to put in that box?
> 
> There are probably almost as many options as there are opinions.

There seem to be several out-of-the-box vendor-assembled solutions,
e.g., http://www.nvidia.com/object/personal-supercomputing.html
as well as build-your-own recommendations, e.g.,
http://www.nvidia.com/object/tesla_build_your_own.html
(and plenty of non-nvidia pages about all that, too).
But my personal experience assembling boxes from components
doesn't xlate comfortably to that regime, i.e., still hard
to choose wisely.

> I'd tend toward OpenCL, given the choice.  But then that's native on 
> the platforms that I most commonly deal with.

Thanks for that, too. I'll take a look. In particular, I'm trying
to spec out/prototype migrating a binomial tree solution for
Black-Scholes over to this architecture. The current system works
fine pricing along a monthly tree for 30 years (360 nodes),
but a daily tree (needed to better model rational exercise of
options) is beyond it. A few desktop Tflops might do it, and
justifying the $10K-or-so hardware needed to test that should be
easy. But I don't yet have a shovel-ready proposal.
Programming time is another story. Choosing the best software
library platform to encapsulate the gpu details and expose only
the math is the harder part (for me).

> But then this is not the best group for x86-64 HPTC questions, either.

-- 
John Forkosh  ( mailto:  j at f.com  where j=john and f=forkosh )