HPC on the Cheap: 48 cores for $11k

imageEvery once in a while computer hardware takes a big step forward, either in performance or lower cost.  Recently AMD, Intel’s biggest rival in CPU chips, and Super Micro Computers, a maker of high-performance motherboards and other hardware, have both come out with products that have changed the landscape of affordable High Performance computing forever.  About a month ago PADT put together a single system with 48 cores, 128 GB SDRAM, and 3TB of disk for just over $11,000.  And it is not too good to be true, the machine is fast and reliable.

We were able to get this type of price performance by leveraging the 12 Core server processors coupled with a motherboard that supports four chips: 4×12 = 48, something that only exists in the AMD world. We kind of stumbled upon this when our IT guy thought it would be good to see what AMD is up to.  We have been buying and using Intel based systems for a while now since they leapfrogged AMD the last time.  Once he started hunting around he found that AMD still supported four chips on a board, and that they had working 12 core chips.  So he went off to New Egg and built a shopping cart.  It came up at around $10,000 because we already had some of the hardware we needed.  Next came his hardest task, convincing management (me) that it wasn’t too good to be true.

After some research we found that others were benchmarking this chip and finding good performance on parallel tasks.  It is a slower clock speed than the top Intel chips, but because you can fit so many cores in one box, you get better performance for your money on applications that are parallel – like CFD.  After pulling the trigger on the order and sweating a bit the order showed up:

clip_image002

An Early Christmas at PADT:
A Barebones 1U Box, 4 Parts (48 cores), 32 SDRAM Sticks, and 3 Hard Drives

This created a bit of a buzz in the office, almost as big as when one of our customers brought his Tesla over for a test drive.  For those of us who started our career running on million dollar “super computers” with the same power as my cell phone, this was a pretty cool site to see.

The Box:

Here is what we purchased:

4 AMD Opteron 6174 Magny-Cours 2.2 GHZ 12 Core Server Processors 48 Core
32 4GB DDR3 Kingston 240 pin Registered SDRAM 128 GB RAM
3 1TB Western Digital RE3  7200 RPM, SATA 3.0 Hard Drives 3 TB Disk
1 80GB Intel X18-M Solid State Drive 80 GB / Drive
1 Supermicro AS-1042G-TF 1U Barebones Server 1 Box

We put it together, loaded it with CentOS and booted it up.  We got a little silly when we saw the boot screen:

clip_image002[4] 

None of us had seen a boot screen with a count of 48, nor one with that much RAM on a single system.  And we had to keep telling ourselves that it was only $11,000.

We had no issues with drivers or software.  ANSYS FLUENT, CFX and Mechanical APDL loaded up just fine and ran smoothly.

The Benchmarks:

The first thing we did is run a customer problem that was given to us as a test for a sale. The sales was being held up because we couldn’t get time on our existing systems to run it.  No surprise, it ran fast.  As we upped the number of CPU’s we could see that for a medium sized CFD problem, it was scaling well to 24 cores, then tapered off above that.  Once that problem was done we ran the ANSYS standard benchmarks to see how larger problems went.

First we ran the CFX benchmarks. They use something called “Speedup” which is simply the run time on one core divided by the run time on N cores. Or, how many times faster it is.

imageimage
imageimage

As you can see, you get the usual variation from problem to problem.  The Indy Car run actually did better than parallel, indicating that there is some sort of bottleneck with one processor and something about that problem that this system likes.  They other cases did well against a much more expensive IBM XEON system with a faster clock speed CPU.

Next, we looked at the performance on FLUENT.  They must have a bigger staff for benchmarking because we had a lot more machines to compare to.  But they use a different scale called a “Rating” It is 24 hours divided by the run time in hours on N Cores.  So a 1 would mean it runs in 1 day and 256 means it runs in 5.625 minutes.

imageimage

imageimage

Our takeaway from these studies is that 1) the system is pretty linear on large problems, and 2) that although it is slower than other systems, it is a lot less money, sometimes less than 25% as much.

The next thing to look at is what the speed is per core between Nehalem and AMD based systems.  For the data we have on FLUENT problems we compared a 2.9GHz Intel Nehalem based system to our 2.2GHz AMD system. These other systems have faster drives and cost 3 to 4 times as much.  What we found was that our AMD system was anywhere from 52% to 65% of the speed of the faster clock speed Intel chips.

Remember, to get to 48 cores on the Intel platform you will need six systems or blades, because each system or blade can only support two chips with four cores per chip, so 8 cores per board.  Let’s say the Intel chips are twice as fast.  So to get the same performance you would need 24 cores, or three 8 core boxes.  Then you need Infiniband with a switch and three cables.  So now you are talking three to four times the cost of the 48 core single box AMD system.

imageDon’t Forget Power

One of the knocks against the last time AMD had really competitive chips was heat and power consumption.  Well, these guys run cool and low power. We found that the power consumption maxed at 800 watts and average 760. When you consider that there are 48 cores chugging away at 100%, that is pretty impressive.  17watts per core!

  • Max Watts reached during test: 800
    • Watt usage settled down to around: 761
  • Number of hours for usage test: 167
  • Max Volts: 116 Volts
  • Amps: 6.60
  • $ rate used per/kWh: $0.11
  • Costs per/kWh:
    • Hour: $0.08
    • Day: $1.98
    • Week: $13.89
    • Month: $59.55
    • Year: $724
  • 200 Watts per part (CPU)
    • 17 Watts per core

Other Things to Consider

This particular configuration is very exciting, and it also opens up a lot of doors to other options and combinations.  The power of this system comes from parallel processing, so you need to consider the cost of adding parallel solves. Right now ANSYS, Inc. has a licensing option called HPC packs that are perfect for this application.  If you buy one pack, you can run on 8 cores, 2 packs is good for 32 cores, and 3 packs allow you to run on 128.

This doesn’t line up to well with 12 cores per chip.  But AMD does make a chip with 8 cores that run at a slightly higher clock speed (2.4 GHz vs. 2.2 GHz) that also cost less ($750ea vs. $1299ea)  So you could buy two HPC packs for 32 cores and then put together a system for less than $10,000.

Or, if that is not enough punch, you could also upgrade the motherboard to the next level up and purchase two barebones server boxes.  This allows you to use Supermicro’s Infiniband card to connect two boxes together over Infiniband without a hub.  Now you are talking 96 cores connected with a high speed interconnect, 256 GB of RAM, and all for under $27,000.

The last thing to consider is the fact that AMD has announced that they will be releasing a 16 core server chip to be widely available in 3Q11.  These will work with the motherboards we are talking about here, so you can upgrade your machines to have 4 more core per chip – so if you invest in a 96 core dual box system now, you can upgrade it to 128 cores next year.  Not bad.

Right for You?

After the initial excitement dies down for this much horsepower for this little of an investment, most users will ask themselves if this type of system is right for them.  You need to think it over and ask yourself some questions:

  1. Will I benefit from that much parallel?
    Remember, a single core actually runs slower than a top of the line 2.9GHz Intel system.  If you run Mechanical or have small problems, you may be better off paying for fewer, faster cores.
  2. Can I build my own system?
    We have the luxury at PADT of not having a corporate IT department that would freak out over a home-built system.  We also have an IT department with the skills to put systems together.  Run this idea by your IT before you go to far.  You can purchase systems with similar guts from HP and Dell, but a large bit of the cost savings go away when you do that.
  3. Do I have the Parallel Task Licenses?
    See what parallel you already own and contact your ANSYS sales person to see what it would cost to add more, especially HPC Packs.

If your problems are large and have long run times, this hardware and the needed software licenses will pay for themselves very quickly.  For us, a job that was running over 5 days now finishes in half a day.  That is huge as far as being able to either up the problem size to get more accuracy or getting results to our customers much faster.

Interested in Getting a System from PADT?

Several of our customers have expressed interest in purchasing our leasing one of these low-cost cluster systems from PADT. So We are offering a purchase or a lease on custom systems if you are located in Arizona, New Mexico, Colorado, Utah or Nevada.  Shoot
sales@padtinc.com an e-mail if you would like more information.  If you are outside of PADT’s ANSYS sales territory, we are more than happy to answer any questions you may have via e-mail to help you put your own system together, but we just don’t have the bandwidth to support outside of our area.