Using Bright CM to Manage a Linux Cluster

COD_Cluster-Bright-1What goes into managing a Linux HPC (High Performance Computing) cluster?

There is an endless list of software, tools and configurations that are required or recommended for efficiently managing a shared HPC cluster environment.

A shared HPC cluster typically has many layers that deliver a usable environment that doesn’t have to  depend on the users coordinating closely or the system administrators being superheroes of late-night patching and just-in-time recovery.

bright-f1

Figure 1 Typical Layers of a shared HPC cluster.

For each layer in the diagram above there are numerous open-source and paid software tools to choose from. The thing to note is that it’s not just a choice. System administrators have to work with the user requirements, compatibility tweaks and ease of implementation and use to come up with a perfect recipe (much like carrot cake). Once the choices have been made, users and system administrators have to train, learn and start utilizing these tools.

HPC @ PADT Inc.

At PADT Inc. we have several Linux based HPC clusters that are in high demand. Our Clusters are based on the Cube High Value Performance Computing (HVPC) systems and are designed to optimize the performance of numerical simulation software. We were facing several challenges that are common with building & maintaining HPC clusters. The challenges were mainly in the areas of security, imaging and deployment, resource management, monitoring and maintenance.

To solve these challenges there is an endless list of software tools and packages both open-source and commercial. Each one of these tools comes with its own steep learning curve and mounting time to test & implement.

Enter – Bright Computing

After testing several tools we came across the Bright Computing – Bright Cluster Manager (Bright CM). Bright CM eliminates the need for system administrators to manually install and configure the most common HPC cluster components. On top of that it provides the majority of the HPC software packages, tools and software libraries in their default software image.

A Bright CM cluster installation starts off with an extremely useful installation wizard that asks all of the right questions while giving the user full control to customize the installation. With a note pad, a couple of hours and a basic understanding of HPC clusters, you are ready to install your applications.

bright-f2

Figure 2. Installation Wizard

An all knowing dashboard helps system admins master and monitor the cluster(s) or if you prefer the CLI CM shell provides full functionality through command line. From the dashboard system admins can manage multiple clusters down to the finest details.

bright-f3

Figure 3. Cluster Management Interface.

An extensive cluster monitoring interface allows systems admins, users and key stakeholders to generate and view detailed reports about the different cluster components.

bright-f4

Figure 4. Cluster Monitoring Interface.

Bright CM has proven to be a valuable tool in managing and optimizing our HPC environment. For further information and a demo of Bright Cluster Manager please contact sales@padtinc.com.

Help! My New HPC System is not High Performance!

It is an all too common feeling, that sinking feeling that leads to the phrase “Oh Crap” being muttered under your breath. You just spent almost a year getting management to pay for a new compute workstation, server or cluster. You did the ROI and showed an eight-month payback because of how much faster your team’s runs will be. But now you have the benchmark data on real models, and they are not good. “Oh Crap”

Although a frequent problem, and the root causes are often the same, the solutions can very. In this posting I will try and share with you what our IT and ANSYS technical support staff here at PADT have learned.

Hopefully this article can help you learn what to do to avoid or circumvent any future or current pitfalls if you order an HPC system. PADT loves numerical simulation, we have been doing this for twenty years now. We enjoy helping, and if you are stuck in this situation let us know.

Wall Clock Time

It is very easy to get excited about clock speeds, bus bandwidth, and disk access latency. But if you are solving large FEA or CFD models you really only care about one thing. Wall Clock Time. We cannot tell you how many times we have worked with customers, hardware vendors, and sometimes developers, who get all wrapped up in the optimization of one little aspect of the solving process. The problem with this is that high performance computing is about working in a system, and the system is only as good as its weakest link.

We see people spend thousands on disk drives and high speed disk controllers but come to discover that their solves are CPU bound, adding better disk drives makes no difference. We also see people blow their budget on the very best CPU’s but don’t invest in enough memory to solve their problems in-core. This often happens because when they look at benchmark data they look at one small portion and maximize that measurement, when that measurement often doesn’t really matter.

The fundamental thing that you need to keep in mind while ordering or fixing an HPC system for numerical simulation is this: all that matters is how long it takes in the real world from when you click “Solve” till your job is finished. I bring this up first because it is so fundamental, and so often ignored.

The Causes

As mentioned above, an HPC server or cluster is a system made up of hardware, software, and people who support it. And it is only as good as its weakest link. The key to designing or fixing your HPC system is to look at it as a system, find the weakest links, and improve that links performance. (OK, who remembers the “Weakest Link” lady? You know you kind of miss her…)

In our experience we have found that the cause for most poorly performing systems can be grouped into one of these categories:

  • Unbalanced System for the Problems Being Solved:

    One of the components in the system cannot keep up with the others. This can be hardware or software. More often than not it is the hardware being used. Let’s take a quick look at several gotchas in a misconfigured numerical simulation machine.

  • I/O is a Bottleneck
    Number crunching, memory, and storage are only as fast as the devices that transfer data between them.
  • Configured Wrong

    Out of simple lack of experience the wrong hardware is used, the OS settings are wrong, or drivers are not configured properly.

  • Unnecessary Stuff Added out of Fear

    People tend to overcompensate out of fear that something bad might happen, so they burden a system with software and redundant hardware to avoid a one in a hundred chance of failure, and slow down the other ninety-nine runs in the process.

Avoiding an Expensive Medium Performance Computing (MPC) System

The key to avoiding these situations is to work with an expert who knows the hardware AND the software, or become that expert yourself. That starts with reading the ANSYS documentation, which is fairly complete and detailed.

Often times your hardware provider will present themselves as the expert, and their heart may be in the right place. But only a handful of hardware providers really understand HPC for simulation. Most simply try and sell you the “best” configuration you can afford and don’t understand the causes of poor performance listed above. More often than we like, they sell a system that is great for databases, web serving, or virtual machines. That is not what you need.

A true numerical simulation hardware or software expert should ask you questions about the following, if they don’t, you should move on:

  • What solver will you use the most?
  • What is more important, cost or performance? Or better: Where do you want to be on the cost vs. performance curve?
  • How much scratch space do you need during a solve? How much storage do you need for the files you keep from a run?
  • How will you be accessing the systems, sending data back and forth, and managing your runs?

Another good test of an expert is if you have both FEA and CFD needs, they should not recommend a single system for you. You may be constrained by budget, but an expert should know the difference between the two solvers vis-à-vis HPC and design separate solutions for each.

If they push virtual machines on you, show them the door.

The next thing you should do is step back and take the advice of writing instructors. Start cutting stuff. (I know, if you have read my blog posts for a while, you know I’m not practicing what I preach. But you should see the first drafts…) You really don’t need huge costly UPS’, the expensive archival backup system, or some arctic chill bubbling liquid nitrogen cooling system. Think of it as a race car, if it doesn’t make the car go faster or keep the driver safe, you don’t need it.

A hard but important step in cutting things down to the basics is to try and let go of the emotional aspect. It is in many ways like picking out a car and the truth is, the red paint job doesn’t make it go any faster, and the fancy tail pipes will look good, but also don’t help. Don’t design for the worst-case model either. If 90% of your models run in 32GB or RAM, don’t do a 128GB system for that one run you need to do a year that is that big. Suffer a slow solve on that one and use the money to get a faster CPU, a better disk array, or maybe a second box.

Pull back, be an engineer, and just get what you need. Tape robots look cool, blinky lights and flashy plastic case covers even cooler. Do you really need that? Most of time the numerical simulation cruncher is locked up in a cold dark room. Having an intern move data to USB drives once a month may be a more practical solution.

Another aspect of cutting back is dealing with that fear thing. The most common mistake we see is people using RAID configurations for storing redundant data, not read/write speed. Turn off that redundant writing and dump across as many drives as you can in parallel, RAID 0. Yes you may lose a drive. Yes that means you lose a run. But if that happens once every six months, which is very unlikely, the lost productivity from those lost runs is small compared to the lost productivity of solving all those other runs on a slow disk array.

Intel-AMD-Flunet-Part2-Chart2Lastly, benchmark. This is obvious but often hard to do right. The key is to find real problems that represent a spectrum of the runs you plan on doing. Often different runs, even within the same solver, have different HPC needs. It is a good idea to understand which are more common and bias your design to those. Do not benchmark with standard benchmarks, use industry accepted benchmarks for numerical simulation. Yes it’s an amazing feeling knowing that your new cluster is number 500 on the Top 500 list. However if it is number 5000 on the ANSYS Numerical simulation benchmark list nobody wins.

Fixing the System You Have

As of late we have started tearing down clusters in numerous companies around the US. Of course we would love to sell you new hardware however at PADT, as mentioned before, we love numerical simulation. Fixing your current system may allow you to stretch that investment another year or more. As a co-owner of a twenty year old company, this makes me feel good about that initial investment. When we sick our IT team on extending the life of one of our systems, I start thinking about and planning for that next $150k investment we will need to do in a year or more.

Breathing new life into your existing hardware basically requires almost the same steps as avoiding a bad system in the first place. PADT has sent our team around the country helping companies breath new life into their existing infrastructure. The steps they use are the same but instead of designing stuff, we change things. Work with an expert, start cutting stuff out, breath new life into the growing old hardware, avoid fear and “cool factor” based choices, and verify everything.

Take a look and understand the output from your solvers, there is a lot of data in there. As an example, here is an article we wrote describing some of those hidden gems within your numerical simulation outputs. http://www.padtinc.com/blog/the-focus/ansys-mechanical-io-bound-cpu-bound

Play with things, see what helps and what hurts. It may be time to bring in an outside expert to look at things with fresh eyes.

Do not be afraid to push back against what IT is suggesting, unless you are very fortunate, they probably don’t have the same understanding as you do when it comes to numerical simulation computing. They care about security and minimizing the cost of maintaining systems. They may not be risk takers and they don’t like non-standard solutions. All of these can often result in a system that is configured for IT, and not fast numerical simulation solves. You may have to bring in senior management to solve this issue.

PADT is Here to Help

Cube_Logo_Target1The easiest way to avoid all of this is to simply purchase your HPC hardware from PADT.  We know simulation, we know HPC, and we can translate between engineers and IT.  This is simply because simulation is what we do, and have done since 1994.   We can configure the right system to meet your needs, at that point on the price performance curve you want.  Our CUBE systems also come preloaded and tested with your simulation software, so you don’t have to worry about getting things to work once the hardware shows up.

If you already have a system or are locked in to a provider, we are still here to help.  Our system architects can consult over the phone or in person, bringing their expertise to the table on fixing existing systems or spec’ing new ones.  In fact, the idea for this article came when our IT manager was reconfiguring a customer’s “name brand” cluster here in Phoenix, and he got a call from a user in the Midwest that had the exact same problem.  Lots of expensive hardware, and disappointing performance. They both had the wrong hardware for their problems, system bottlenecks, and configuration issues.

Learn more on our HPC Server and Cluster Performance Tuning page, or by contacting us. We would love to help out. It is what we like to do and we are good at it.

Part 2: ANSYS FLUENT Performance Comparison: AMD Opteron vs. Intel XEON

AMD Opteron 6308, INTEL XEON e5-2690 & INTEL XEON e5-2667V2 Comparison using ANSYS FLUENT 14.5.7

Note: The information and data contained in this article was complied and generated on September 12, 2013 by PADT, Inc. on CUBE HVPC hardware using FLUEN 14.5.7.  Please remember that hardware and software change with new releases and you should always try to run your own benchmarks, on your own typical problems, to understand how performance will impact you.

By David Mastel

Due to the response to the original article on this subject,  I thought it would be good to do a quick follow-up using one of our latest CUBE HVPC builds. Again, the ANSYS Fluent standard benchmarks were used in garnering the stats on this dual socket INTEL XEON e5-2667V2 configuration.

CUBE HVPC Test configurations (Same as in last comparison)

  • Server 1: CUBE HVPC c16
  • CPU: 4, AMD Opteron 6308 @ 3.5GHz (Quad Core)
  • Memory: 256GB (32x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • Hardware RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  •  OS: Linux 64-bit / Kernel 2.6.32-358.18.1.e16.x86_64
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI
  • HCA: SMC AOC-UIBQ-M2 – QDR Infiniband
    • The IB card installed however solves were run distributed locally
  • Switch: MELLANOX IS5023 Non-Blocking 18-port switch

Server 2: CUBE HVPC c16i (Intel server from last comparison)

  • CPU: 2, INTEL XEON e5-2690 @ 2.9GHz (Octa Core)
  • Memory: 128GB (16x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  • OS: Windows 7 Professional 64-bit
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI

Server 3: CUBE HVPC c16ivy (New “Ivy” based Intel server)

  • CPU: 2, INTEL XEON e5-2667V2 @ 3.3 (Octa Core)
  • Memory: 128GB (16x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  • OS: Linux 64-bit / Kernel 2.6.32-358.18.1.e16.x86_64
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI
  • HCA: SMC – QDR Infiniband
    • The IB card installed however solves were run distributed locally

ANSYS FLUENT 14.5.7 Performance using the ANSYS FLUENT Benchmark suite provided by ANSYS, Inc.

ANSYS Fluent Benchmark page link:http://www.ansys.com/Support/Platform+Support/Benchmarks+Overview/ANSYS+Fluent+Benchmarks

Release ANSYS FLUENT 14.5.7 Test Cases
(20 Iterations each)

  • Reacting Flow with Eddy Dissipation Model (eddy_417k)
  • Single-stage Turbomachinery Flow (turbo_500k)
  • External Flow Over an Aircraft Wing (aircraft_2m)
  • External Flow Over a Passenger Sedan (sedan_4m)
  • External Flow Over a Truck Body with a Polyhedral Mesh (truck_poly_14m)
  • External Flow Over a Truck Body 14m (truck_14m)

Here are the results for all three machines, total and average time:

Intel-AMD-Flunet-Part2-Chart1Intel-AMD-Flunet-Part2-Chart2

 

Summary: Are you sure? Part 2

So I didn’t have to have the “Are you sure?” question with Eric this time and I didn’t bother triple checking the results because indeed, the Ivy Bridge-EP Socket 2011 is one fast CPU! That combined with a 0.022 micron manufacturing process  the data speaks for itself. For example, lets re-dig into the data for the External Flow Over a Truck Body with a Polyhedral Mesh (truck_poly_14m) benchmark and see what we find:

Intel-AMD-FLUENT-Details

 

 

 

 

 

 

 

 

 

 

 

Intel-AMD-FLUENT-summary

 

 

 

 

 

 

 

 

 

 

 

Current Pricing of INTEL® and AMD® CPU’s

Here is the up to the minute pricing for each CPU’s. I took these prices off of NewEgg and IngramMicro’s website. The date of the monetary values was captured on October 4, 2013.

Note AMD’s price per CPU went up and the INTEL XEON e5-2690 went down. Again, these prices based on today’s pricing, October 4, 2013.

AMD Opteron 6308 Abu Dhabi 3.5GHz 4MB L2 Cache 16MB L3 Cache Socket G34 115W Quad-Core Server Processor OS6308WKT4GHKWOF

  •  $501 x 4 = $2004.00

Intel Xeon E5-2690 2.90 GHz Processor – Socket LGA-2011, L2 Cache 2MB, L3 Cache 20 MB, 8 GT/s QPI

  • $1986.48 x 2 = $3972.96

Intel Xeon E5-2667V2 3.3 GHz Processor – Socket LGA-2011, L2 Cache 2MB, L3 Cache 25 MB, 8 GT/s QPI,

  • $1933.88 x 2 = $3867.76

REFERENCES:
http://www.ingrammicro.com
http://www.newegg.com

INTEL XEON e5-2667V2
http://ark.intel.com/products/75273/Intel-Xeon-Processor-E5-2667-v2-25M-Cache-3_30-GHz

INTEL XEON e5-2690
http://ark.intel.com/products/64596/

AMD Opteron 6308
http://www.amd.com/us/Documents/Opteron_6300_QRG.pdf

http://en.wikipedia.org/wiki/Double-precision_floating-point_format

http://en.wikipedia.org/wiki/Central_processing_unit#Integer_range

http://en.wikipedia.org/wiki/Floating_point

STEP OUT OF THE BOX, STEP INTO A CUBE

PADT offers a line of high performance computing (HPC) systems specifically designed for CFD and FEA number crunching aimed at a balance between cost and performance. We call this concept High Value Performance Computing, or HVPC. These systems have allowed PADT and our customers to carry out larger simulations, with greater accuracy, in less time, at a lower cost than name-brand solutions. This leaves you more cash to buy more hardware or software.

Let CUBE HVPC by PADT, Inc. quote you a configuration today!

 

Columbia: PADT’s Killer Kilo-Core CUBE Cluster is Online

iIn the back of PADT’s product development lab is a closet.  Yesterday afternoon PADT’s tireless IT team crammed themselves into the back of that closet and powered up our new cluster, bringing 1104 connected cores online.  It sounded like a jet taking off when we submitted a test FLUENT solve across all the cores.  Music to our ears.

We have recently become slammed with benchmarks for ANSYS and CUBE customers as well as our normal load of services work, so we decided it was time to pull the trigger and double the size of our cluster while adding a storage node.  And of course, we needed it yesterday.  So the IT team rolled up their sleeves, configured a design, ordered hardware, built it up, tested it all, and got it on line, in less than two weeks.  This was while they did their normal IT work and dealt with a steady stream of CUBE sales inquiries.  But it was a labor of love. We have all dreamed about breaking that thousand core barrier on one system, and this was our chance to make it happen.

If you need more horsepower and are looking for a solution that hits that sweet spot between cost and performance, visit our CUBE page at www.cube-hvpc.com and learn more about our workstations, servers, and clusters.  Our team (after they get a little rest) will be more than happy to work with you to configure the right system for your real world needs.

Now that the sales plug is done, lets take a look at the stats on this bad boy:

Name: Columbia
After the class of battlestars in Battlestar Galactica
Brand: CUBE High Value Performance Compute Cluster, by PADT
Nodes: 18
17 compute, 1 storage/control node, 4 CPU per Node
Cores: 1104
AMD Opteron: 4 x 6308 3.5 GHz, 32 x 6278 2.4 GHz, 36 x 6380 2.5 GHz
Interconnect: 18 port MELLANOX IB 4X QDR Infiniband switch
Memory: 4.864 Terabytes
Solve Disk: 43.5 TB RAID 0
Storage Disk: 64 TB RAID 50

Here are some pictures of the build and the final product:

a
A huge delivery from our supplier, Supermicro, started the process. This was the first pallet.

b
The build included installing the largest power strip any of us had ever seen.

c
Building a cluster consists of doing the same thing, over and over and over again.

f
We took over PADT’s clean room because it turns out you need a lot of space to build something this big.

g
It is fun to get the chance to build the machine you always wanted to build

h
2AM Selfie: Still going strong!

d
Almost there. After blowing a breaker, we needed to wait for some more
power to be routed to the closet.

e
Up and running!
Ratchet and Clank providing cooling air containment.

David, Sam, and Manny deserve a big shout-out for doing such a great job getting this thing up and running so fast!

When I logged on to my first computer, a TRS-80, in my high-school computer lab, I never, ever thought I would be running on a machine this powerful.  And I would have told people they were crazy if they said a machine with this much throughput would cost less than $300,000.  It is a good time to be a simulation user!

Now I just need to find a bigger closet for when we double the size again…

CUBE-HVPC-Logo-wide

Why do my ANSYS jobs take days and weeks to finish? Well it depends…

Real World Lessons on How to Minimize Run Time for ANSYS HPC

Recently I had a VP of Engineering start a phone conversation with me that went something like this. “Well Dave, you see this is how it is. We just spent a truckload of money on a 256 core cluster and our solve times are slower now than with our previous 128 core cluster. What the *&(( is going on here?!”

I imagine many of us have heard similar stories or even received the same questions from our co-workers, CEO’s & Directors. I immediately had my concerns and I truly thought carefully as to what I should say next. I recalled a conversation I had with one of my college professors. He had told me that when I find myself stepping into gray areas that a good start to the conversation is to say. “Well it depends…”

Guess what, that is exactly what I said. I said “Well it depends…” followed by going into explaining to him two fundamental pillars of computer science that have plagued most of us since computers were created: I said “Well you may be, CPU bound (compute bound) or I/O bound. He told me that they had paid a premium for the best CPU’s on the market and some other details about the HPC cluster. Garnering some of other details about the cluster my hunch was that his HPC cluster may actually be I/O bound.

I/O Bound

Basically this means that your cluster’s $2,000 worth of CPU’s are basically stalled out and sitting idle. The CPU’s are waiting for new data to process and move on. I also briefly explained that his HPC cluster may be compute bound. I quickly reassured him that the likelihood of his HPC cluster being compute bound was about 10% possible and very unlikely. I knew the specifications on the CPU’s in this HPC cluster and the likelihood that they were the issue of his ANSYS slow run times was low on my radar. These literally were the latest and greatest CPU’s ever to hit this planet (at that moment in time). So, let me step back a minute, to refresh our memories on what it means when a system is compute bound.

Compute Bound

Being compute bound means that the HPC cluster’s CPU’s were sitting at 99 or 100% for long periods of time. When this happens very bad things begin to happen to your HPC cluster. CPU requests to peripherals are delayed or infinitely lost to the ether. The HPC cluster may become unresponsive and even lock up.

All I could hear was silence on the other end. “Dave, I get it, I understand, please find the problem and fix our HPC cluster for us. ” I happily agreed to help out! I concluded our phone conversation asking that he send me the specific details, down to the nuts and bolts of the hardware! I also requested operating system and software that was installed and used on the 256 core HPC cluster.

What NOT to do when configuring an ANSYS Distributed HPC cluster.

Seeking that perfect balance!

After a quick NDA signing, a few dollars exchange and a sprinkle of some other legal things that lawyers get excited about. I set out to discover the cause. After reviewing the information provided to me I almost immediately saw three concerns:

To interconnect what?

Let Merriam-Webster describe it:

Definition of INTERCONNECT

transitive verb
: to connect with one another
intransitive verb
: to be or become mutually connected

— in·ter·con·nec·tion noun
— in·ter·con·nec·tiv·i·ty noun

1. The systems are interconnected with a series of wires.
2. The lessons are designed to show students how the two subjects interconnect
3.  A series of interconnecting stories

First Known Use of INTERCONNECT: 1865

Concern numeral Uno!!! Interconnect me

Though the company’s 256 core HPC cluster had a second dedicated GigE interconnect. Distributed ANSYS is highly bandwidth and latency bound often requiring more bandwidth than a dedicated NIC (Network Interface Card) may provide. Yes, the dedicated second GigE card interconnect was much better than trying to use a single NIC for all of the network traffic which would also include the CPU interconnect. I did have a few of the MAPDL output files from the customer that I could take a peek at. After reviewing the customer output files it became fairly clear that interconnect communication speeds between the 16 core x 16 server in the cluster was not adequate. The master Message Parsing Interface (MPI) process that Distributed ANSYS uses requires a high amount of bandwidth and low latency for proper distributed scaling to the other processes. Theoretically the data bandwidth between cores solving local to the machine will be higher than the bandwidth traveling across the various interconnect methods (see below). ANSYS, Inc. recommends Infiniband for CPU interconnect traffic. Here are a couple of reasons why they recommend this. See how the theoretical data limits increase going from Gigabit Ethernet up to FDR Infiniband.

Theoretical lane bandwidth limits for:

  • Gigabit Ethernet (GigE): ~128MB/s
  • Signal Data Rate (SDR): ~ 328 MB/s
  • Double Data Rate (DDR): ~640 MB/s
  • Quad Data Rate (QDR): ~1,280 MB/s
  • Fourteen Data Rate (FRD): ~1,800 MB/s

GEEK CRED: A few years ago companies such as MELLANOX started aggregating the Infiniband channels. The typical aggregate modifiers are 4X or even a 12X increase. So for example the 4X QDR Infiniband switch and cards I use at PADT and recommended to this customer, would have a (4X 10Gbit/s) or 5,120 MB/s of throughput! Here is a quick video that I made of a MELLANOX IS5023 18-port 4X QDR full bi-directional switch in action:

This is how you do it with a CUBE HVPC! MAPDL output file from our CUBE HVPC w16i-GPU workstation. This is running the ANSYS industry benchmark V14sp-5. I wanted to show the communication speeds between the master MPI process and the other solver processes to see just how fast the solvers can communicate. With a peak communication speed of 9593 MB/s this CUBE HVPC workstation rocks!

Chassis Profile 4u standard depth or rackmountable
CPU 1 x One Dual Socket
Chipset INTEL 602 Chipset
Processors 2 x INTEL e5-2690 @ 2.9GHz
Cores 2 x 8
Memory 128GB DDR3-1600 ECC Reg RAM
OS Drives 2 x 2.5″ SATA III 256GB SSD Drives RAID 0
DATA/HOME Hard Disk Drives 4 x 3.5″ SAS2 600GB 15kRPM drives RAID 0
SAS RAID (Onboard, Optional) RAID 0 (OS RAID)
SAS RAID (RAID card, Optional) LSI 2208 (DATA VOL RAID)
Networking (Onboard) Dual GigE (Intel i350)
Video (Onboard) NVIDIA QUADRO K5000
GPU (Optional) NVIDIA TESLA K2000
Operating System Windows 7 Professional 64-bit
Optional Installed Software ANSYS 14.5 Release

imageStats for CUBE HVPC Model Number : w16i-KGPU

Learn more about this and other CUBE HVPC systems here.

Concern #2: Using RAID 5 Array for Solving Disk Volume

The hard drives that are used for I/O during a solve, the solving volume, were configured in a RAID 5 hard disk array. Some sample data below showing the minimum write speed of a similar RAID 5 array. These are speeds that are better off seen in your long-term storage volume not on your solving/working directory.

LSI 2008 HITACHI ULTASTAR 15K600
Qty / Type / Size / RAID Qty 8 x 3.5″ SAS2 15k 600GB RAID 5
TEST # p1
min Read 204 MB/s
max Read 395  MB/s
Avg Read N/A
min Write 106 MB/s
max Write 243.5 MB/s
Avg Write N/A
Access Time N/A

Concern #3: Using RAID 1 for Operating System

The hard drive array for the OS was configured in a RAID 1 configuration. For a number cruncher server having RAID 1 is not necessary. If you absolutely have to have RAID 1. Please spend the extra money and go to a RAID 10 configuration.

I really don’t want to get into the seemingly infinite details of hard drives speeds, latency. Or even begin to explain to you if I should be using an onboard RAID Controller, dedicated RAID controller or a software RAID configuration completed within the OS. There is so much information available on the web that a person gets overloaded. When it comes to Distributed ANSYS, think fast hard drives and fast RAID controllers. Start researching your hard drives and RAID controllers using the list provided below. Again, only as a suggestion! I have listed the drives in order based on a very scientific and nerdy method. If I saw a pile of hard drives, what hard drive would I reach for first?

  1. I prefer using SEAGATE SAVVIO or HITACHI enterprise class drives. (Serial Attached SCSI) SAS2 6Gbit/s 3.5”15,000 RPM spindle drives (best bang for your dollar of space, more read & write heads over a 2.5” spindle hard drive).
  2. I prefer using Micron or INTEL SSD enterprise class SSD. SATA III Solid State Drive 6 Gbit/s (SSD sizes have increased however you will need more of these for an effective solving array and they still are not cheap).
  3. I prefer using the SEAGATE SAVVIO 2.5” enterprise class spindle drives. SAS2 6Gbit/s 2.5” 15,000 RPM spindle drives (if you need a small form factor, fast and additional storage. But the 2.5” drives do not have as many read & write heads as a 3.5” drive. In a situation where I need to slam 4 or 8 drives into a tight location.
    Right now, SEAGATE SAVVIO 2.5” are the way to go!  Here is a link to a data sheet.
    Another similar option is the HITACHI ULTRASTAR 15k600.  It’s spec sheet is here.
  4. SATA II 3Gbit/s 3.5” 7,200 RPM spindle drives are also a good option.  I prefer Western Digital RE4 1TB or 2TB drives. There spec sheet is here.

LSI 2108 RAID Controller and Hard Drive data/details:

image

How a CUBE HVPC System from PADT, Inc. balanced out this configuration and how much would it cost?

I quoted out the below items, installed and out the door (including my travel expenses, etc.) at: $30,601

The company ended up going with their own preferred hardware vendor. Understandable, one good thing is that we are now on the preferred purchasing supplier list. They were greatly appreciative of my consulting time and indicated that they will request a “must have” quote for a CUBE HVPC system the next refresh in a year. They want to go over 1,000 cores the next refresh.

I recommended that they install the following into the HPC cluster based: (note they already had blazing fast hard drives)

  • 16 – Supermicro AOC-S2208L-H8iR LSI 2208 RAID controller cards.
  • 32 – Supermicro CBL-0294L-01 cabling to connect the LSI RAID cards to the SAS2 hard drives.
  • 1 – MELLANOX IS5023 18-port 4X QDR Infiniband switch
  • 16 – Supermicro AOC-UIBQ-M2 Dual port 4X QDR Infiniband card
  • 16 – Supermicro QSFP Infiniband cables in a couple different lengths

A special thanks and shout out to Sheldon Imaoka of ANSYS, Inc. for inspiring me to write this blog article!

2000 Core Milestone Passed for CUBE HVPC Systems

IMG_9548As we put the finishing touches on the latest 512 core CUBE HVPC cluster, PADT is happy to report that there are now 2,042 cores worth of High Value Performance Computing (HVPC) power out there in the form of PADT’s CUBE computer systems.  That is 2,042 Intel or AMD cores crunching away in workstations, compute servers, and mini-clusters chugging on CFD, Explicit Dynamics, and good old fashioned structural models – producing more accurate results in less time for less cost.

When PADT started selling CUBE HVPC systems it was for a very simple reason: our customers wanted to buy more compute horsepower but they could not afford it within their existing budgets. They saw the systems we were using and asked if we could build one for them.  We did. And now we have put together enough systems to get to 2,042 cores and over 9.5TB of RAM.

CUBE-HVPC-512-core-closeup3-1000h

Our Latest Cluster is Ready to Ship

We just finished testing ANSYS, FLUENT, and HFSS on our latest build, a 512 core AMD based cluster. IT is a nice system:

  • 512 2.5GHz AMD Opteron 6380 Processors: 16 cores per chip, 4 chips per node, 8 nodes
  • 2,048 GB RAM, 256GB per node, 8 nodes
  • 24 TB disk space – RAID0:  3TB per node, 8 nodes
  • 16 Port 40Gbps Infiniband Switch (so they can connect to their older cluster as well)
  • Linux

All for well under $180,000.

It was so pretty that we took some time to take some nice images of it (click to see the full size):

CUBE-HVPC-512-core-front1-1000h CUBE-HVPC-512-core-service1-1000h CUBE-HVPC-512-core-stairs-1000h

And it sounded so awesome, that we took this video so everyone can here it spooling up on an FLUENT benchmark:

If that made you smile, you are a simulation geek!

Next we are building two 64 core compute servers, for another repeat customer, with an Infiniband switch to hook up to their two existing CUBE systems. This will get them to a 256 core cluster.

We will let you know when we get to 5000 cores out there!

Are you ready to step out of the box, and step into a CUBE?  Contact us to get a quote for your next simulation workstation, compute server, or cluster.

Monster in the Closet: PADT Goes Live with 512 Core HVPC CUBE Cluster

imageThere is a closet in the back of PADT’s product development lab. It does not store empty boxes, old files, or obsolete hardware.  Within that closet is a monster.  Not the sort of monster that scares little children at night.  No, this is a monster that puts fear into the heart of those who try to paint high performance computing as a difficult and expensive task only to be undertaking by those who are in the priesthood.  It makes salespeople who earn fat commissions by selling consulting services and unnecessary add-ons quake in fear.

This closet holds PADT’s latest upgrade to our compute infrastructure: a 512 core CUBE HVPC Cluster.  No data center, no special consultants, no expensive add-ons. Just 512 cores chugging away at solving FLUENT and CFX problems, and pumping a large amount of heat up into the ceiling.

Here are the specifics:

CUBE C512 Columbia Class Cluster

  • 512 AMD 2.4GHz Cores (in 8 nodes, 4 sockets per node, 16 cores per socket)
  • 2TB RAM (256 GB per node of DDR3 1600 ECC RAM)
  • Raid Controller Card (1 per node)
  • 24TB Data Disk Space (3TB per node of SAS2 15k drives in RAID0)
  • Infiniband (8 Port switch, 40 Gbps)
  • 52 Port GIGE switch connected to 2 GIGE ports per node
  • 42 U Rack with thermal convection ducting (chimney)
  • Keyboard, monitor, mouse in drawer
  • CENTOS (switching to RedHat soon)

We built this system with CFD simulation in mind.  The original goal was to provide a proof of concept to expand our CUBE HVPC offering, showing that you can create a cluster of this size, with very good speed, for a price that small and medium sized companies can afford.  We also needed a way to run large problems for benchmarks in support of our ANSYS sales efforts and to provide faster technical support our FLUENT and CFX customers.  We already have a growing queue of benchmarks waiting to get into the machine.

The image above is the glamour shot.  Here is what it looks like in the closet:

image

Keeping with our theme of High Value Performance Computing we stuck it into this closet that was built for telephone equipment and networking equipment back at the turn of the century when Motorola had this suite.  We were able to fit a modern rack in next to an old rack that was in there. We then used the included duct to push the air up into our ceiling space and moved the A/C ducting to duct right into the front of the units.  We did need to keep the flow going into the rack instead of into the area under the networking and telephone switches, so we used an old video game poster:

image
Anyone remember Ratchet and Clank? 
Best PS2 games ever.

It works well and adds a little color to the closet.

So far our testing has shown some great numbers. Not the fastest cluster out there, but if you look at the cost, it offers incredible performance.   You could add a drive array over Infiniband, faster chips, and some redundant power. And it will run faster and more reliably, but it will cost much more.  We are cheap so we like this solution.

Oh yea, with the parts from our old CFD cluster and some new bits, we will be building a smaller mini-cluster using INTEL chips, a GPU or two, and a ton of fast disk and RAM as our FEA cluster.  Look for an update on that in a couple of months.

Interested in getting a cluster like this for you computing pleasure?  A system configured like this one will run about $150,000 (video game poster is extra). Visit our CUBE page to learn more or just shoot an email to sales@padtinc.com.  Don’t worry, we don’t sell these with sales people, someone from IT will get back with you.

Empty High-End Computer Rack, What Should we Fill it With?

Empty high end rack
Empty high end rack

We have a new rack installed in our compute server room (well closet really).  I wonder what we can fill that with? Looks like it can handle a lot of heat, and a lot of units.  We shall see what the week brings.

I smell some new CUBE HVPC hardware in PADT’s future. Stay Tuned.