Help! My New HPC System is not High Performance!

It is an all too common feeling, that sinking feeling that leads to the phrase “Oh Crap” being muttered under your breath. You just spent almost a year getting management to pay for a new compute workstation, server or cluster. You did the ROI and showed an eight-month payback because of how much faster your team’s runs will be. But now you have the benchmark data on real models, and they are not good. “Oh Crap”

Although a frequent problem, and the root causes are often the same, the solutions can very. In this posting I will try and share with you what our IT and ANSYS technical support staff here at PADT have learned.

Hopefully this article can help you learn what to do to avoid or circumvent any future or current pitfalls if you order an HPC system. PADT loves numerical simulation, we have been doing this for twenty years now. We enjoy helping, and if you are stuck in this situation let us know.

Wall Clock Time

It is very easy to get excited about clock speeds, bus bandwidth, and disk access latency. But if you are solving large FEA or CFD models you really only care about one thing. Wall Clock Time. We cannot tell you how many times we have worked with customers, hardware vendors, and sometimes developers, who get all wrapped up in the optimization of one little aspect of the solving process. The problem with this is that high performance computing is about working in a system, and the system is only as good as its weakest link.

We see people spend thousands on disk drives and high speed disk controllers but come to discover that their solves are CPU bound, adding better disk drives makes no difference. We also see people blow their budget on the very best CPU’s but don’t invest in enough memory to solve their problems in-core. This often happens because when they look at benchmark data they look at one small portion and maximize that measurement, when that measurement often doesn’t really matter.

The fundamental thing that you need to keep in mind while ordering or fixing an HPC system for numerical simulation is this: all that matters is how long it takes in the real world from when you click “Solve” till your job is finished. I bring this up first because it is so fundamental, and so often ignored.

The Causes

As mentioned above, an HPC server or cluster is a system made up of hardware, software, and people who support it. And it is only as good as its weakest link. The key to designing or fixing your HPC system is to look at it as a system, find the weakest links, and improve that links performance. (OK, who remembers the “Weakest Link” lady? You know you kind of miss her…)

In our experience we have found that the cause for most poorly performing systems can be grouped into one of these categories:

  • Unbalanced System for the Problems Being Solved:

    One of the components in the system cannot keep up with the others. This can be hardware or software. More often than not it is the hardware being used. Let’s take a quick look at several gotchas in a misconfigured numerical simulation machine.

  • I/O is a Bottleneck
    Number crunching, memory, and storage are only as fast as the devices that transfer data between them.
  • Configured Wrong

    Out of simple lack of experience the wrong hardware is used, the OS settings are wrong, or drivers are not configured properly.

  • Unnecessary Stuff Added out of Fear

    People tend to overcompensate out of fear that something bad might happen, so they burden a system with software and redundant hardware to avoid a one in a hundred chance of failure, and slow down the other ninety-nine runs in the process.

Avoiding an Expensive Medium Performance Computing (MPC) System

The key to avoiding these situations is to work with an expert who knows the hardware AND the software, or become that expert yourself. That starts with reading the ANSYS documentation, which is fairly complete and detailed.

Often times your hardware provider will present themselves as the expert, and their heart may be in the right place. But only a handful of hardware providers really understand HPC for simulation. Most simply try and sell you the “best” configuration you can afford and don’t understand the causes of poor performance listed above. More often than we like, they sell a system that is great for databases, web serving, or virtual machines. That is not what you need.

A true numerical simulation hardware or software expert should ask you questions about the following, if they don’t, you should move on:

  • What solver will you use the most?
  • What is more important, cost or performance? Or better: Where do you want to be on the cost vs. performance curve?
  • How much scratch space do you need during a solve? How much storage do you need for the files you keep from a run?
  • How will you be accessing the systems, sending data back and forth, and managing your runs?

Another good test of an expert is if you have both FEA and CFD needs, they should not recommend a single system for you. You may be constrained by budget, but an expert should know the difference between the two solvers vis-à-vis HPC and design separate solutions for each.

If they push virtual machines on you, show them the door.

The next thing you should do is step back and take the advice of writing instructors. Start cutting stuff. (I know, if you have read my blog posts for a while, you know I’m not practicing what I preach. But you should see the first drafts…) You really don’t need huge costly UPS’, the expensive archival backup system, or some arctic chill bubbling liquid nitrogen cooling system. Think of it as a race car, if it doesn’t make the car go faster or keep the driver safe, you don’t need it.

A hard but important step in cutting things down to the basics is to try and let go of the emotional aspect. It is in many ways like picking out a car and the truth is, the red paint job doesn’t make it go any faster, and the fancy tail pipes will look good, but also don’t help. Don’t design for the worst-case model either. If 90% of your models run in 32GB or RAM, don’t do a 128GB system for that one run you need to do a year that is that big. Suffer a slow solve on that one and use the money to get a faster CPU, a better disk array, or maybe a second box.

Pull back, be an engineer, and just get what you need. Tape robots look cool, blinky lights and flashy plastic case covers even cooler. Do you really need that? Most of time the numerical simulation cruncher is locked up in a cold dark room. Having an intern move data to USB drives once a month may be a more practical solution.

Another aspect of cutting back is dealing with that fear thing. The most common mistake we see is people using RAID configurations for storing redundant data, not read/write speed. Turn off that redundant writing and dump across as many drives as you can in parallel, RAID 0. Yes you may lose a drive. Yes that means you lose a run. But if that happens once every six months, which is very unlikely, the lost productivity from those lost runs is small compared to the lost productivity of solving all those other runs on a slow disk array.

Intel-AMD-Flunet-Part2-Chart2Lastly, benchmark. This is obvious but often hard to do right. The key is to find real problems that represent a spectrum of the runs you plan on doing. Often different runs, even within the same solver, have different HPC needs. It is a good idea to understand which are more common and bias your design to those. Do not benchmark with standard benchmarks, use industry accepted benchmarks for numerical simulation. Yes it’s an amazing feeling knowing that your new cluster is number 500 on the Top 500 list. However if it is number 5000 on the ANSYS Numerical simulation benchmark list nobody wins.

Fixing the System You Have

As of late we have started tearing down clusters in numerous companies around the US. Of course we would love to sell you new hardware however at PADT, as mentioned before, we love numerical simulation. Fixing your current system may allow you to stretch that investment another year or more. As a co-owner of a twenty year old company, this makes me feel good about that initial investment. When we sick our IT team on extending the life of one of our systems, I start thinking about and planning for that next $150k investment we will need to do in a year or more.

Breathing new life into your existing hardware basically requires almost the same steps as avoiding a bad system in the first place. PADT has sent our team around the country helping companies breath new life into their existing infrastructure. The steps they use are the same but instead of designing stuff, we change things. Work with an expert, start cutting stuff out, breath new life into the growing old hardware, avoid fear and “cool factor” based choices, and verify everything.

Take a look and understand the output from your solvers, there is a lot of data in there. As an example, here is an article we wrote describing some of those hidden gems within your numerical simulation outputs. http://www.padtinc.com/blog/the-focus/ansys-mechanical-io-bound-cpu-bound

Play with things, see what helps and what hurts. It may be time to bring in an outside expert to look at things with fresh eyes.

Do not be afraid to push back against what IT is suggesting, unless you are very fortunate, they probably don’t have the same understanding as you do when it comes to numerical simulation computing. They care about security and minimizing the cost of maintaining systems. They may not be risk takers and they don’t like non-standard solutions. All of these can often result in a system that is configured for IT, and not fast numerical simulation solves. You may have to bring in senior management to solve this issue.

PADT is Here to Help

Cube_Logo_Target1The easiest way to avoid all of this is to simply purchase your HPC hardware from PADT.  We know simulation, we know HPC, and we can translate between engineers and IT.  This is simply because simulation is what we do, and have done since 1994.   We can configure the right system to meet your needs, at that point on the price performance curve you want.  Our CUBE systems also come preloaded and tested with your simulation software, so you don’t have to worry about getting things to work once the hardware shows up.

If you already have a system or are locked in to a provider, we are still here to help.  Our system architects can consult over the phone or in person, bringing their expertise to the table on fixing existing systems or spec’ing new ones.  In fact, the idea for this article came when our IT manager was reconfiguring a customer’s “name brand” cluster here in Phoenix, and he got a call from a user in the Midwest that had the exact same problem.  Lots of expensive hardware, and disappointing performance. They both had the wrong hardware for their problems, system bottlenecks, and configuration issues.

Learn more on our HPC Server and Cluster Performance Tuning page, or by contacting us. We would love to help out. It is what we like to do and we are good at it.

Empty High-End Computer Rack, What Should we Fill it With?

Empty high end rack

We have a new rack installed in our compute server room (well closet really).  I wonder what we can fill that with? Looks like it can handle a lot of heat, and a lot of units.  We shall see what the week brings.

Picking a Server Rack Frame

Selecting a server rack frame could be the most important part of the designing phase. In order to assist with choosing a proper fit for your environment, here are 8 rack considerations to keep in mind:

  1. What size Rack Cabinet Enclosure do I need?
    Selecting the correct server cabinet size depends on 2 major factors: the type of equipment needed for rack mount capabilities and the amount of equipment requiring server rack enclosure space. The key to having a good server rack buying experience is planning. Ideally, users should tally the total amount of rack units currently needed and also keep in mind future expansion because cabinet rack units can not be added on once a server rack is fabricated. If additional rack mount accessories such as environmental monitoring, battery back-up, and/or remote power management are required, extra front and rear cabinet space might be needed in order to sufficiently mount rack accessories vertically and horizontally. At rackmountsales.com you can choose racks by size.
  2. What is the significance of Internal Rack Cabinet Enclosure Dimensions?
    Internal Dimensions should be used as a guide to gauge the size and amount of equipment one can install in server rack enclosures. Internal vertical measurements from the tallest point of any side rail to the bottom chassis is regarded as total internal height. Internal depth is figured by measuring from the insides of both front and rear doors. Lastly, internal width measurements extend from one side panel to the other.

    When accessing rack mount needs, internal dimension measurements should also take into consideration rack equipment and accessories that normally mount internally to the rear or side of cabinets. Additional space can be modified during rack manufacturing to allow for side, rear, and front mounted rack equipment. Additionally, the auxiliary compartment space will provide room for ventilation systems, bulky power cords and cabling management requirements.

  3. What is the significance of External Rack Cabinet Enclosure Dimensions?
    Determining server rack location within a data center or co-location facility is often overlooked until the rack enclosure arrives at the dock for delivery. It is very crucial for users to determine if the finalized exact external dimensions of the server rack will fit through doorways and other obstructions of the intended target location. Consider carefully environmental factors such as ceiling height and clearance regulations in your data center or server farm. Also, be sure to respect dimensions of stairways and freight elevators if server racks need to be transported through them for final placement.
  4. Will my Rack Cabinet Enclosure fit in the room it’s intended for?
    Considerations such as server rack weight and height are very important factors to take into account when moving server racks from place to place. Particular server racks can weigh in at over 300 lbs. and can stand very tall at over 7 feet. Server racks are large items which require considerable effort when moving, rounding corners, lifting up stairs, and fitting in any tight spaces. Please ensure that enough room has been made and accounted for before rack enclosures are purchases and finalized.
  5. Will the Rack Cabinet Enclosure fit through all doors on the way into the destination room?
    All of our server rack enclosures ship fully assembled. There are some removable components, such as door as side panels, but that will not change external dimensions of the rack frame which cannot be taken apart. Please consider all product dimensions carefully to ensure server racks meet all clearance regulations.
  6. What is a Rack Unit? What does 40U mean? 44U? 48U? etc.
    A “Rack Unit” or Rack “U” is an EIA standard allowance unit for measuring rack mount equipment. One “Rack Unit” is equal to 1.75″ in height. To calculate internal useable space of a rack enclosure, simply multiply the total amount of Rack Units by 1.75″. For example, a44U rack enclosure would have 77″ of internal usable space (44 x 1.75). Click here to choose racks by height.
  7. How do I calculate how many Rack Units I need?
    Many data center managers calculate rack enclosure height needed by determining the optimal rack unit usage. For example, if users are aware that future plans call for the addition of 20 2U sized servers, they could count on needing a 44U rack enclosure. This will allow enough internal height for approximately 20 servers, room for a 1U patch panel and a 2U UPS back-up battery. Rear or side vertically mounted power management devices will also have sufficient room to perform their functions.
  8. What is the purpose of a 2-Post Relay Rack?
    A Relay Rack is a 2-post aluminum or steel structure with either EIA standard (round) mounting holes or universal (square) mounting holes. Relay Racks are also known as 2-Post Racks or Open Bay Racks. The vertical holes spacing on Relay Racks are standardized for mounting Telco, or computer / network equipment. Relay Racks can also mount cantilever shelving for other non-rack mountable equipment.
    The Open Bay rack design also provides maximum air flow for the entire rack due to the open frame construction.

Rail Comparison

Universal Square hole Rail
Universal Mounting Rails
Rack Enclosure EIA Mounting Rails
EIA Standard 10/32 Tapped Mounting Rails

Rack Mount Rails: We can manufacture server rack enclosures with either Universal Mounting Rails (square holes fitted with cage nuts) or with EIA Standard rails (10/32 tapped holes). All our cabinet rails are high quality gauge steel (1/8″ thick or more) and have an electroplate finish to maximize protection.

Universal Mounting Rails: Universal rails will support 19″ EIA width rack mount and networking equipment and almost all sever equipment. Cage nuts and screws will be needed in order to mount equipment to universal mounting rails.

EIA Standard Mounting Rails: Standard Mounting Rails support 19″ EIA width rack mount and networking equipment and some sever manufacturer’s rack mounting equipment. Please be aware that not all rack mountable equipment will match up against the EIA 1032 hole pattern on Standard Rails. Standard mounting rails will not allow the use of Cage Nuts.

Which Mounting Rails do I need? It depends on the equipment you will be mounting in the rack enclosure. Most rack mount and networking equipment such as hubs, routers, patch panels, etc. will conform to EIA Standard hole spacing. However, some sever and rack accessory manufacturers will implement rack mounting kits to assist with attaching equipment to Universal Rails. With this example, proper cage nuts and screws will most likely be needed in order to mount this type of rack mount equipment in one of our server cabinets.

Mounting Hardware

There are currently 3 types of Mounting Hardware used with our server cabinet rails:

10-32 Tapped Cage Nuts and Screws – American Version – Commonly used in all rack mount applications including music, video, broadcast, data and more. The “10” refers to the drill size for a tapped (threaded) hole. The outside diameter of a 10-32 screw is 0.19″, it is smaller than a 12-24 screw. This screw type has 32 threads per inch.

12-24 Tapped Screws – American Version – The “12” refers to the drill size required for a tapped (threaded) hole (a #12 drill is 0.189″). The outside diameter of a 12-24 screw is 0.2160″. It is larger than a 10-32 screw. This screw type has 24 threads per inch

M6 Tapped Cage Nuts and Screws – Metric Version – Metric thread size of 6 millimeters. Typical thread size for European rack applications. Also used in Compaq racks and Euro racks sold here in the US. Larger than both 10-32 and 12-24.

What mounting hardware do I need?
It depends on the mounting rails of the rack enclosure or relay rack you will be ordering. Most 4-Post server racks, cabinets, LAN enclosures either use Cage Nuts and Screws for square hole type Universal Mounting Rails or 1032 tapped screws for round hole style EIA Standard Mounting Rails. Please be aware that almost all 2-post open relay racks use 1032 Tapped Screws (round hole mounting rails).