Some Revolutionary HPC Talk: 208 Cores+896GB < $60k, GPU’s, & ANSYS Distributed

imageThe last couple of weeks a bunch of stuff has gone on in the area of High Performance computing, or HPC, so we thought we would throw it all into one Focus posting and share some things we have learned, along with some advice, with the greater ANSYS user community.

There is a little bit of a revolution going on in both the FEA and CFD simulation side of things amongst users of ANSYS, Inc. products.  For a while now customers with large numbers of users and big nasty problems to solve have been buying lots of parallel licenses and big monster clusters. The size of problems that these firms, mostly turbomachinery and aerospace, have been growing and growing. And even more so for CFD jobs.   But also FEA for HFSS and ANSYS Mechanical/Mechanical APDL.  That is where the revolution started.

But where it is gaining momentum, where the greater impact is being seen on how simulation is used around the world, is with the smaller customers.  People with one to five seats of ANSYS products.  In the past they were happy with their two “included” Mechanical shared memory parallel tasks.  Or they might spring for 3 or 4 CFD parallel licenses.  But as 4, 6, and 8 core CPU chips become mainstream, even on the desktop, and as ANSYS delivers greater value from parallel, we are seeing more and more people invest in high performance computing. And they are getting a quick return on that investment.

Affordable High Value Hardware

208 Cores + 869 GB + 25 TB + Infiniband + Mobile Rack = $58k = HOT DAMN!

Yes, this is a commercial for PADT’s CUBE machines.  (www.CUBE-HVPC.com) Even if you would rather be an ALGOR user than purchase hardware from a lowly ANSYS Chanel Partner, keep reading. Even if you would rather go to an ANSYS meeting at HQ in the winter than brave asking your IT department if you can buy a machine not made by a major computer manufacturer, keep reading.

Because what we do with CUBE hardware is something you can do yourself, or that you can pressure your name brand hardware provider into doing.

We just got a very large CFD benchmark job for a customer.  They want multiple runs on a piece of “rotating machinery” to generate performance curves.  Lots of runs, and each run can take up to 4 or 5 days on one of our 32 core machines.  So we put together a 208 core cluster.  And we maxed out the RAM on each one to get to just under 900 GB. Here are the details:

Cores: 208 Total
    3 servers x48 2.3 GHz cores each server
    2 servers x32 3.0 GHz cores each server
RAM: 896 GB
    3 servers  x128GB DDR3 1333MHz ECC RAM each server
    2 servers  x256GB DDR3 1600MHz ECC RAM each server
DATA Array:  ~25TB
Interconnect: 5 x Infiniband 40Gbps QDR
Infiniband Switch: 8 port 40Gbps QDR
Mobile Departmental Cluster Rack – 12U

All of this cost us around $58,000 if you add up what we spent on various parts over the past year or so.  That much horsepower for $58,000.  If you look at your hourly burden rate and the impact of schedule slip on project cost, spending $60k on hardware has a quick payback. 

You do need to purchase parallel licenses. But if you go with this type of hardware configuration what you save over a high-priced solution will go a long way towards paying for those licenses.  Even if you do spend $150k on hardware, your payback with the hardware and the license is still pretty quick.

Now this is old hardware (six months to a year –  dinosaur bones).  How much would a mini-cluster departmental server cost now and what would it have inside:

Cores: 320 Total
5 servers x64 2.3 GHz cores each server
RAM: 2.56 TB
5 servers x512 GB DDR3 RAM each server
DATA Array: ~50TB
Interconnect: 5 x Infiniband 40Gbps QDR
Infiniband Switch: 8 port 40Gbps QDR
Mobile Departmental Cluster Rack – 12U

The cost?  around $80,000.  That is $250/core.  Now you need big problems to take advantage of that many cores.  If your problems are not that big, just drop the number of servers in the mini-cluster.  And drop the price proportionally. 

Same if you are a Mechanical user.  The matrices in FEA just don’t scale in parallel like they do for CFD, so a 300+ core machine won’t be 300 times faster. It might even be slower than say 32 cores.  But the cost drop is the same.  See below for some numbers.

Bottom line, hardware cost is now in a range where you will see payback quickly in increased productivity.

GPU’s

image

NVIDIA’s Tesla GPU
We think they should have some 80’s era super model
draped over the front like those Ferrari posters we
had in our dorm rooms.

For you Mechanical/Mechanical APDL users, let’s talk GPU’s.

We just got an NVIDIA TESLA C2075 GPU.  We are not done testing, but our basic results show that no matter how we couple it with cores and solvers, it speeds things up.  Anywhere from 3 times faster to 50% faster depending on the problem, shared vs. distributed memory, and how many cores we throw in with the GPU.  

This is a fundamental problem with answering the question “How much faster?” because it depends a lot on the problem and the hardware. You need to try it on your problems on your hardware. But we feel comfortable in saying if you buy an HPC pack and run on 8 cores with a GPU, the GPU should double your speed relative to just running on 8 cores.  It could even run faster on some problems. 

For some, that is a disappointment.  “The GPU has hundreds of processors, why isn’t it 10 or 50 times faster?”  Well, getting the problem broken up and running on all of those processors takes time.  But still, twice as fast for between $2,000 to $3,000 is a bargain. I don’t know what your burden rate is but it doesn’t take very many hours of saved run time to recover that cost.  And there is no additional license cost because you get a GPU license with an HPC pack.

Plus, at R14 the solver supports a GPU with distributed ANSYS, so even more improvements.  Add to that support for the unsymmetrical or damped Eigensolvers and general GPU performance increases at R14.

PADT’s advice? If you are running ANSYS Mechanical or Mechanical APDL get the HPC Pack and a GPU along with a 12 core machine with gobs of RAM (PADT’s 12 core AMD system with 64GB or RAM and 3TB of disk costs only $6,300 without the GPU, $8,500 with).  You can solve on 8 cores and play Angry Birds on the remaining 4.

Distributed ANSYS

For a long time many of users out there have avoided distributed ANSYS. It was hard to install, and unless you had the right problem you didn’t see much of a benefit for many of the early releases. Shared Memory Parallel, or SMP, was dirt easy – get an HPC license and tell it to run on 8 cores and go.

Well at R14 of ANSYS Mechanical APDL it is time to go distributed.  First off they make the install much easier.  To be honest, we found that this was the biggest deterrent for many small company users.

Second, at R14 a lot more things are supported in distributed ANSYS. This has been going on for some time so most of what people use is supported. At this release they added subspace eigensolving, TRANS, INFINI and PLANE121/230 elements (electrostatics), and SURF251/252. 

Some “issues” have been fixed like restart robustness and you now have control on when and how multiple restart files are combined after the solve. 

All and all, if you have R14, you are solving mechanical problems, and you have an HPC pack, you should be using distributed most of the time.

Conclusions

We get a ton of questions from people about what they should buy and how much.  And every situation is different. But for small and medium sized companies, the HPC revolution is here and everyone should be budgeting for taking advantage of HPC:

    • At least one HPC pack
    • Some new faster/cheaper multicore hardware (CUBE anyone?)
    • A GPU. 

STOP!  I know you were scrolling down to the comments section to write some tirade about how ANSYS, Inc overcharges for parallel, how it is on a moral equivalence with drowning puppies, and how much more reasonable competitor A, B, or C is with parallel costs.  Let me save you the time.

HPC delivers huge value.  Big productivity improvements.  And it does not write itself. It is not an enhancement to existing software.  Scores of developers are going into solver code and implementing complex logic that allows efficiency with older hardware, shared memory, distributed memory, and GPU’s. It has to be written, tested, fixed, tested again, and back and forth every release.  That value is real, and there is a cost associated with it.

And the competitors pricing model? The only thing they can do to differentiate themselves is charge nothing or very little. They have not put the effort or don’t have the expertise to deliver the kind of parallel performance that the ANSYS, Inc. solvers do.  So they give it away to compete.  Trust me, they are not being nice because they like you. They have the same business drivers as ANSYS, Inc.  They price the way they do because they did not incur as much cost and they know if they charged for it you would have no reason to use their solvers.

ANSYS users of the world unit!  Load your multicore hardware with HPC packs, feed it with a GPU, and join the revolution!