Why do my ANSYS jobs take days and weeks to finish? Well it depends…

Real World Lessons on How to Minimize Run Time for ANSYS HPC

Recently I had a VP of Engineering start a phone conversation with me that went something like this. “Well Dave, you see this is how it is. We just spent a truckload of money on a 256 core cluster and our solve times are slower now than with our previous 128 core cluster. What the *&(( is going on here?!”

I imagine many of us have heard similar stories or even received the same questions from our co-workers, CEO’s & Directors. I immediately had my concerns and I truly thought carefully as to what I should say next. I recalled a conversation I had with one of my college professors. He had told me that when I find myself stepping into gray areas that a good start to the conversation is to say. “Well it depends…”

Guess what, that is exactly what I said. I said “Well it depends…” followed by going into explaining to him two fundamental pillars of computer science that have plagued most of us since computers were created: I said “Well you may be, CPU bound (compute bound) or I/O bound. He told me that they had paid a premium for the best CPU’s on the market and some other details about the HPC cluster. Garnering some of other details about the cluster my hunch was that his HPC cluster may actually be I/O bound.

I/O Bound

Basically this means that your cluster’s $2,000 worth of CPU’s are basically stalled out and sitting idle. The CPU’s are waiting for new data to process and move on. I also briefly explained that his HPC cluster may be compute bound. I quickly reassured him that the likelihood of his HPC cluster being compute bound was about 10% possible and very unlikely. I knew the specifications on the CPU’s in this HPC cluster and the likelihood that they were the issue of his ANSYS slow run times was low on my radar. These literally were the latest and greatest CPU’s ever to hit this planet (at that moment in time). So, let me step back a minute, to refresh our memories on what it means when a system is compute bound.

Compute Bound

Being compute bound means that the HPC cluster’s CPU’s were sitting at 99 or 100% for long periods of time. When this happens very bad things begin to happen to your HPC cluster. CPU requests to peripherals are delayed or infinitely lost to the ether. The HPC cluster may become unresponsive and even lock up.

All I could hear was silence on the other end. “Dave, I get it, I understand, please find the problem and fix our HPC cluster for us. ” I happily agreed to help out! I concluded our phone conversation asking that he send me the specific details, down to the nuts and bolts of the hardware! I also requested operating system and software that was installed and used on the 256 core HPC cluster.

What NOT to do when configuring an ANSYS Distributed HPC cluster.

Seeking that perfect balance!

After a quick NDA signing, a few dollars exchange and a sprinkle of some other legal things that lawyers get excited about. I set out to discover the cause. After reviewing the information provided to me I almost immediately saw three concerns:

To interconnect what?

Let Merriam-Webster describe it:

Definition of INTERCONNECT

transitive verb
: to connect with one another
intransitive verb
: to be or become mutually connected

— in·ter·con·nec·tion noun
— in·ter·con·nec·tiv·i·ty noun

1. The systems are interconnected with a series of wires.
2. The lessons are designed to show students how the two subjects interconnect
3.  A series of interconnecting stories

First Known Use of INTERCONNECT: 1865

Concern numeral Uno!!! Interconnect me

Though the company’s 256 core HPC cluster had a second dedicated GigE interconnect. Distributed ANSYS is highly bandwidth and latency bound often requiring more bandwidth than a dedicated NIC (Network Interface Card) may provide. Yes, the dedicated second GigE card interconnect was much better than trying to use a single NIC for all of the network traffic which would also include the CPU interconnect. I did have a few of the MAPDL output files from the customer that I could take a peek at. After reviewing the customer output files it became fairly clear that interconnect communication speeds between the 16 core x 16 server in the cluster was not adequate. The master Message Parsing Interface (MPI) process that Distributed ANSYS uses requires a high amount of bandwidth and low latency for proper distributed scaling to the other processes. Theoretically the data bandwidth between cores solving local to the machine will be higher than the bandwidth traveling across the various interconnect methods (see below). ANSYS, Inc. recommends Infiniband for CPU interconnect traffic. Here are a couple of reasons why they recommend this. See how the theoretical data limits increase going from Gigabit Ethernet up to FDR Infiniband.

Theoretical lane bandwidth limits for:

  • Gigabit Ethernet (GigE): ~128MB/s
  • Signal Data Rate (SDR): ~ 328 MB/s
  • Double Data Rate (DDR): ~640 MB/s
  • Quad Data Rate (QDR): ~1,280 MB/s
  • Fourteen Data Rate (FRD): ~1,800 MB/s

GEEK CRED: A few years ago companies such as MELLANOX started aggregating the Infiniband channels. The typical aggregate modifiers are 4X or even a 12X increase. So for example the 4X QDR Infiniband switch and cards I use at PADT and recommended to this customer, would have a (4X 10Gbit/s) or 5,120 MB/s of throughput! Here is a quick video that I made of a MELLANOX IS5023 18-port 4X QDR full bi-directional switch in action:

This is how you do it with a CUBE HVPC! MAPDL output file from our CUBE HVPC w16i-GPU workstation. This is running the ANSYS industry benchmark V14sp-5. I wanted to show the communication speeds between the master MPI process and the other solver processes to see just how fast the solvers can communicate. With a peak communication speed of 9593 MB/s this CUBE HVPC workstation rocks!

Chassis Profile 4u standard depth or rackmountable
CPU 1 x One Dual Socket
Chipset INTEL 602 Chipset
Processors 2 x INTEL e5-2690 @ 2.9GHz
Cores 2 x 8
Memory 128GB DDR3-1600 ECC Reg RAM
OS Drives 2 x 2.5″ SATA III 256GB SSD Drives RAID 0
DATA/HOME Hard Disk Drives 4 x 3.5″ SAS2 600GB 15kRPM drives RAID 0
SAS RAID (Onboard, Optional) RAID 0 (OS RAID)
SAS RAID (RAID card, Optional) LSI 2208 (DATA VOL RAID)
Networking (Onboard) Dual GigE (Intel i350)
Video (Onboard) NVIDIA QUADRO K5000
GPU (Optional) NVIDIA TESLA K2000
Operating System Windows 7 Professional 64-bit
Optional Installed Software ANSYS 14.5 Release

imageStats for CUBE HVPC Model Number : w16i-KGPU

Learn more about this and other CUBE HVPC systems here.

Concern #2: Using RAID 5 Array for Solving Disk Volume

The hard drives that are used for I/O during a solve, the solving volume, were configured in a RAID 5 hard disk array. Some sample data below showing the minimum write speed of a similar RAID 5 array. These are speeds that are better off seen in your long-term storage volume not on your solving/working directory.

LSI 2008 HITACHI ULTASTAR 15K600
Qty / Type / Size / RAID Qty 8 x 3.5″ SAS2 15k 600GB RAID 5
TEST # p1
min Read 204 MB/s
max Read 395  MB/s
Avg Read N/A
min Write 106 MB/s
max Write 243.5 MB/s
Avg Write N/A
Access Time N/A

Concern #3: Using RAID 1 for Operating System

The hard drive array for the OS was configured in a RAID 1 configuration. For a number cruncher server having RAID 1 is not necessary. If you absolutely have to have RAID 1. Please spend the extra money and go to a RAID 10 configuration.

I really don’t want to get into the seemingly infinite details of hard drives speeds, latency. Or even begin to explain to you if I should be using an onboard RAID Controller, dedicated RAID controller or a software RAID configuration completed within the OS. There is so much information available on the web that a person gets overloaded. When it comes to Distributed ANSYS, think fast hard drives and fast RAID controllers. Start researching your hard drives and RAID controllers using the list provided below. Again, only as a suggestion! I have listed the drives in order based on a very scientific and nerdy method. If I saw a pile of hard drives, what hard drive would I reach for first?

  1. I prefer using SEAGATE SAVVIO or HITACHI enterprise class drives. (Serial Attached SCSI) SAS2 6Gbit/s 3.5”15,000 RPM spindle drives (best bang for your dollar of space, more read & write heads over a 2.5” spindle hard drive).
  2. I prefer using Micron or INTEL SSD enterprise class SSD. SATA III Solid State Drive 6 Gbit/s (SSD sizes have increased however you will need more of these for an effective solving array and they still are not cheap).
  3. I prefer using the SEAGATE SAVVIO 2.5” enterprise class spindle drives. SAS2 6Gbit/s 2.5” 15,000 RPM spindle drives (if you need a small form factor, fast and additional storage. But the 2.5” drives do not have as many read & write heads as a 3.5” drive. In a situation where I need to slam 4 or 8 drives into a tight location.
    Right now, SEAGATE SAVVIO 2.5” are the way to go!  Here is a link to a data sheet.
    Another similar option is the HITACHI ULTRASTAR 15k600.  It’s spec sheet is here.
  4. SATA II 3Gbit/s 3.5” 7,200 RPM spindle drives are also a good option.  I prefer Western Digital RE4 1TB or 2TB drives. There spec sheet is here.

LSI 2108 RAID Controller and Hard Drive data/details:

image

How a CUBE HVPC System from PADT, Inc. balanced out this configuration and how much would it cost?

I quoted out the below items, installed and out the door (including my travel expenses, etc.) at: $30,601

The company ended up going with their own preferred hardware vendor. Understandable, one good thing is that we are now on the preferred purchasing supplier list. They were greatly appreciative of my consulting time and indicated that they will request a “must have” quote for a CUBE HVPC system the next refresh in a year. They want to go over 1,000 cores the next refresh.

I recommended that they install the following into the HPC cluster based: (note they already had blazing fast hard drives)

  • 16 – Supermicro AOC-S2208L-H8iR LSI 2208 RAID controller cards.
  • 32 – Supermicro CBL-0294L-01 cabling to connect the LSI RAID cards to the SAS2 hard drives.
  • 1 – MELLANOX IS5023 18-port 4X QDR Infiniband switch
  • 16 – Supermicro AOC-UIBQ-M2 Dual port 4X QDR Infiniband card
  • 16 – Supermicro QSFP Infiniband cables in a couple different lengths

A special thanks and shout out to Sheldon Imaoka of ANSYS, Inc. for inspiring me to write this blog article!

2000 Core Milestone Passed for CUBE HVPC Systems

IMG_9548As we put the finishing touches on the latest 512 core CUBE HVPC cluster, PADT is happy to report that there are now 2,042 cores worth of High Value Performance Computing (HVPC) power out there in the form of PADT’s CUBE computer systems.  That is 2,042 Intel or AMD cores crunching away in workstations, compute servers, and mini-clusters chugging on CFD, Explicit Dynamics, and good old fashioned structural models – producing more accurate results in less time for less cost.

When PADT started selling CUBE HVPC systems it was for a very simple reason: our customers wanted to buy more compute horsepower but they could not afford it within their existing budgets. They saw the systems we were using and asked if we could build one for them.  We did. And now we have put together enough systems to get to 2,042 cores and over 9.5TB of RAM.

CUBE-HVPC-512-core-closeup3-1000h

Our Latest Cluster is Ready to Ship

We just finished testing ANSYS, FLUENT, and HFSS on our latest build, a 512 core AMD based cluster. IT is a nice system:

  • 512 2.5GHz AMD Opteron 6380 Processors: 16 cores per chip, 4 chips per node, 8 nodes
  • 2,048 GB RAM, 256GB per node, 8 nodes
  • 24 TB disk space – RAID0:  3TB per node, 8 nodes
  • 16 Port 40Gbps Infiniband Switch (so they can connect to their older cluster as well)
  • Linux

All for well under $180,000.

It was so pretty that we took some time to take some nice images of it (click to see the full size):

CUBE-HVPC-512-core-front1-1000h CUBE-HVPC-512-core-service1-1000h CUBE-HVPC-512-core-stairs-1000h

And it sounded so awesome, that we took this video so everyone can here it spooling up on an FLUENT benchmark:

If that made you smile, you are a simulation geek!

Next we are building two 64 core compute servers, for another repeat customer, with an Infiniband switch to hook up to their two existing CUBE systems. This will get them to a 256 core cluster.

We will let you know when we get to 5000 cores out there!

Are you ready to step out of the box, and step into a CUBE?  Contact us to get a quote for your next simulation workstation, compute server, or cluster.

Monster in the Closet: PADT Goes Live with 512 Core HVPC CUBE Cluster

imageThere is a closet in the back of PADT’s product development lab. It does not store empty boxes, old files, or obsolete hardware.  Within that closet is a monster.  Not the sort of monster that scares little children at night.  No, this is a monster that puts fear into the heart of those who try to paint high performance computing as a difficult and expensive task only to be undertaking by those who are in the priesthood.  It makes salespeople who earn fat commissions by selling consulting services and unnecessary add-ons quake in fear.

This closet holds PADT’s latest upgrade to our compute infrastructure: a 512 core CUBE HVPC Cluster.  No data center, no special consultants, no expensive add-ons. Just 512 cores chugging away at solving FLUENT and CFX problems, and pumping a large amount of heat up into the ceiling.

Here are the specifics:

CUBE C512 Columbia Class Cluster

  • 512 AMD 2.4GHz Cores (in 8 nodes, 4 sockets per node, 16 cores per socket)
  • 2TB RAM (256 GB per node of DDR3 1600 ECC RAM)
  • Raid Controller Card (1 per node)
  • 24TB Data Disk Space (3TB per node of SAS2 15k drives in RAID0)
  • Infiniband (8 Port switch, 40 Gbps)
  • 52 Port GIGE switch connected to 2 GIGE ports per node
  • 42 U Rack with thermal convection ducting (chimney)
  • Keyboard, monitor, mouse in drawer
  • CENTOS (switching to RedHat soon)

We built this system with CFD simulation in mind.  The original goal was to provide a proof of concept to expand our CUBE HVPC offering, showing that you can create a cluster of this size, with very good speed, for a price that small and medium sized companies can afford.  We also needed a way to run large problems for benchmarks in support of our ANSYS sales efforts and to provide faster technical support our FLUENT and CFX customers.  We already have a growing queue of benchmarks waiting to get into the machine.

The image above is the glamour shot.  Here is what it looks like in the closet:

image

Keeping with our theme of High Value Performance Computing we stuck it into this closet that was built for telephone equipment and networking equipment back at the turn of the century when Motorola had this suite.  We were able to fit a modern rack in next to an old rack that was in there. We then used the included duct to push the air up into our ceiling space and moved the A/C ducting to duct right into the front of the units.  We did need to keep the flow going into the rack instead of into the area under the networking and telephone switches, so we used an old video game poster:

image
Anyone remember Ratchet and Clank? 
Best PS2 games ever.

It works well and adds a little color to the closet.

So far our testing has shown some great numbers. Not the fastest cluster out there, but if you look at the cost, it offers incredible performance.   You could add a drive array over Infiniband, faster chips, and some redundant power. And it will run faster and more reliably, but it will cost much more.  We are cheap so we like this solution.

Oh yea, with the parts from our old CFD cluster and some new bits, we will be building a smaller mini-cluster using INTEL chips, a GPU or two, and a ton of fast disk and RAM as our FEA cluster.  Look for an update on that in a couple of months.

Interested in getting a cluster like this for you computing pleasure?  A system configured like this one will run about $150,000 (video game poster is extra). Visit our CUBE page to learn more or just shoot an email to sales@padtinc.com.  Don’t worry, we don’t sell these with sales people, someone from IT will get back with you.

Empty High-End Computer Rack, What Should we Fill it With?

Empty high end rack
Empty high end rack

We have a new rack installed in our compute server room (well closet really).  I wonder what we can fill that with? Looks like it can handle a lot of heat, and a lot of units.  We shall see what the week brings.

I smell some new CUBE HVPC hardware in PADT’s future. Stay Tuned.

Building CUBE Mini-Clusters in the Clean Room

It is a busy time in the world of CUBE computers. We are building our own new cluster, replacing a couple of older file servers we bought from “those other guys” and building a 128 core mini-cluster for a new CUBE customer.  We ran out of room in the IT cubicle so we looked around and found that PADT’s clean room was not being used.  A few tables and tools later and we had a mini-cluster assembly facility.

CUBE-HVPC-CleanRoom_Asembly

With the orders that customers have told us are on the way before the end of the year, this is going to be a busy area through December.