Using Bright CM to Manage a Linux Cluster

COD_Cluster-Bright-1What goes into managing a Linux HPC (High Performance Computing) cluster?

There is an endless list of software, tools and configurations that are required or recommended for efficiently managing a shared HPC cluster environment.

A shared HPC cluster typically has many layers that deliver a usable environment that doesn’t have to  depend on the users coordinating closely or the system administrators being superheroes of late-night patching and just-in-time recovery.

bright-f1

Figure 1 Typical Layers of a shared HPC cluster.

For each layer in the diagram above there are numerous open-source and paid software tools to choose from. The thing to note is that it’s not just a choice. System administrators have to work with the user requirements, compatibility tweaks and ease of implementation and use to come up with a perfect recipe (much like carrot cake). Once the choices have been made, users and system administrators have to train, learn and start utilizing these tools.

HPC @ PADT Inc.

At PADT Inc. we have several Linux based HPC clusters that are in high demand. Our Clusters are based on the Cube High Value Performance Computing (HVPC) systems and are designed to optimize the performance of numerical simulation software. We were facing several challenges that are common with building & maintaining HPC clusters. The challenges were mainly in the areas of security, imaging and deployment, resource management, monitoring and maintenance.

To solve these challenges there is an endless list of software tools and packages both open-source and commercial. Each one of these tools comes with its own steep learning curve and mounting time to test & implement.

Enter – Bright Computing

After testing several tools we came across the Bright Computing – Bright Cluster Manager (Bright CM). Bright CM eliminates the need for system administrators to manually install and configure the most common HPC cluster components. On top of that it provides the majority of the HPC software packages, tools and software libraries in their default software image.

A Bright CM cluster installation starts off with an extremely useful installation wizard that asks all of the right questions while giving the user full control to customize the installation. With a note pad, a couple of hours and a basic understanding of HPC clusters, you are ready to install your applications.

bright-f2

Figure 2. Installation Wizard

An all knowing dashboard helps system admins master and monitor the cluster(s) or if you prefer the CLI CM shell provides full functionality through command line. From the dashboard system admins can manage multiple clusters down to the finest details.

bright-f3

Figure 3. Cluster Management Interface.

An extensive cluster monitoring interface allows systems admins, users and key stakeholders to generate and view detailed reports about the different cluster components.

bright-f4

Figure 4. Cluster Monitoring Interface.

Bright CM has proven to be a valuable tool in managing and optimizing our HPC environment. For further information and a demo of Bright Cluster Manager please contact sales@padtinc.com.

From Piles to Power – My First PADT PC Build

Welcome to the PADT IT Department now build your own PC

[Editors Note: Ahmed has been here a lot longer than 2 weeks, but we have been keeping him busy so he is just now finding the time to publish this. ]

I have been working for PADT for a little over 2 weeks now. After taking the ceremonial office tour that left me with a fine white powder all over my shoes (it’s a PADT Inc special treat). I was taken to meet my team, David Mastel – My Boss for short, who is the IT commander & chief at PADT Inc. and Sam Goff – the all-knowing systems administrator.

I was shown to a cubicle that reminded me of the shady computer “recycling” outfits you’d see on a news report highlighting the vast amounts of abandoned hardware; except there were no CRT (tube) screens or little children working as slave labor.
aa1

Sacred Tradition

This tradition started with Sam, then Manny, and now it was my turn taking this rite of passage. As part of the PADT IT department, I am required by sacred tradition to build my own desktop with my bare hands – then I was handed a screwdriver.

My background is mixed and diverse but mostly has one thing in common. We usually depended on pre-built servers, systems and packages. Branded machines have an embedded promise of reliability, support and superiority over the custom built machines.

  1. What most people don’t know about branded machines is that they carry two pretty heavy tariffs.
  2. First, you are paying upfront for the support structure, development, R&D, supply chains that are required to pump out thousands of machines.
  3. Second, because these large companies are trying to maximize their margins, they will look for a proprietary cost effective configuration that will:
    1. Most probably fail or become obsolete as close as possible to the 3-year “expected” life-span of computers.
    2. Lock users into buying any subsequent upgrade or spare part from them.

Long Story short, the last time I fully built a desktop computer was back in college when a 2GB hard disk was a technological breakthrough that we could only imagine how many MP3’s we could store on it.

The Build

There were two computer cases on the ground, one resembled a 1990 Mercury Sable that was at most tolerable as a new car and the other looked more like 1990 BMW 325ci a little old but carries a heritage and potential to be great once again.
aa2

So with my obvious choice for a case I began to collect parts from the different bins and drawers and I was immediately shocked at how “organized” this room really was. So I picked up the following:

There are a few things that I would have chosen differently but were not available at the time of the build or were ridiculous for a work desktop would be:

  • Replaced 2 drives with SSD disks to hold OS and applications
  • Explored a more powerful Nvidia card (not really required but desired)

So after a couple of hours of fidgeting and checking manuals this is what the build looks like.
aa3

(The case above was the first prototype ANSYS Numerical Simulation workstation in 2010. It has a special place in David’s Heart)

Now to the Good STUFF! – Benchmarking the rebuilt CUBE prototype

ANSYS R15.0.7 FEA Benchmarks

Below are the results for the v15sp5 benchmark running distributed parallel on 4-Cores.
aa4

ANSYS R15.0.7 CFD Benchmarks

Below are the results for the aircraft_2m benchmark using parallel processing on 4-Cores.
aa5

This machine is a really cool sleeper computer that is more than capable at whatever I throw at it.

The only thing that worries me is that when Sam handed me the case to get started, David was trying –but failed- to hide a smile that makes me feel that there is something obviously wrong in my first build and I failed to catch it. I guess I will just wait and see.

Home Grown HPC on CUBE Systems

compute-cluster-1

A Little Project Background

Recently I’ve been working on developing a computer vision system for a long standing customer. We are developing software that enables them to use computers to “see” where a particular object is space, and accurately determine its precise location with respect to the camera. From that information, they can do all kinds of useful things.

In order to figure out where something is in 3D space from a 2D image you have to perform what is commonly referred to as pose estimation. It’s a highly interesting problem by itself, but it’s not something I want to focus on in detail here. If you are interested in obtaining more information, you can Google pose estimation or PnP problems. There are, however, a couple of aspects of that problem that do pertain to this blog article. First, pose estimation is typically a nonlinear, iterative process. (Not all algorithms are iterative, but the ones I’m using are.) Second, like any algorithm, its output is dependent upon its input; namely, the accuracy of its pose estimate is dependent upon the accuracy of the upstream image processing techniques. Whatever error happens upstream of this algorithm typically gets magnified as the algorithm processes the input.

The Problem I Wish to Solve

You might be wondering where we are going with HPC given all this talk about computer vision. It’s true that computer vision, especially image processing, is computationally intensive, but I’m not going to focus on that aspect. The problem I wanted to solve was this: Is there a particular kind of pattern that I can use as a target for the vision system such that the pose estimation is less sensitive to the input noise? In order to quantify “less sensitive” I needed to do some statistics. Statistics is almost-math, but just a hair shy. You can translate that statement as: My brain neither likes nor speaks statistics… (The probability of me not understanding statistical jargon is statistically significant. I took a p-test in a cup to figure that out…) At any rate, one thing that ALL statistics requires is a data set. A big data set. Making big data sets sounds like an HPC problem, and hence it was time to roll my own HPC.

The Toolbox and the Solution

My problem reduced down to a classic Monte Carlo type simulation. This particular type of problem maps very nicely onto a parallel processing paradigm known as Map-Reduce. The concept is shown below:
matt-hpc-1

The idea is pretty simple. You break the problem into chunks and you “Map” those chunks onto available processors. The processors do some work and then you “Reduce” the solution from each chunk into a single answer. This algorithm is recursive. That is, any single “Chunk” can itself become a new blue “Problem” that can be subdivided. As you can see, you can get explosive parallelism.

Now, there are tools that exist for this kind of thing. Hadoop is one such tool. I’m sure it is vastly superior to what I ended up using and implementing. However, I didn’t want to invest at this time in learning a specialized tool for this particular problem. I wanted to investigate a lower level tool on which this type of solution can be built. The tool I chose was node.js (www.nodejs.org).

I’m finding Node to be an awesome tool for hooking computers together in new and novel ways. It acts kind of like the post office in that you can send letters and messages and get letters and messages all while going about your normal day. It handles all of the coordinating and transporting. It basically sends out a helpful postman who taps you on the shoulder and says, “Hey, here’s a letter”. You are expected to do something (quickly) and maybe send back a letter to the original sender or someone else. More specifically, node turns everything that a computer can do into a “tap on the shoulder”, or an event. Things like: “Hey, go read this file for me.”, turns into, “OK. I’m happy to do that. I tell you what, I’ll tap you on the shoulder when I’m done. No need to wait for me.” So, now, instead of twiddling your thumbs while the computer spins up the harddrive, finds the file and reads it, you get to go do something else you need to do. As you can imagine, this is a really awesome way of doing things when stuff like network latency, hard drives spinning and little child processes that are doing useful work are all chewing up valuable time. Time that you could be using getting someone else started on some useful work. Also, like all children, these little helpful child processes that are doing real work never seem to take the same time to do the same task twice. However, simply being notified when they are done allows the coordinator to move on to other children. Think of a teacher in a class room. Everyone is doing work, but not at the same pace. Imagine if the teacher could only focus on one child at a time until that child fully finished. Nothing would ever get done!

Here is a little graph of our internal cluster at PADT cranking away on my Monte Carlo simulation.
matt-hpc-2

It’s probably impossible to read the axes, but that’s 1200+ cores cranking away. Now, here is the real kicker. All of the machines have an instance of node running on them, but one machine is coordinating the whole thing. The CPU on the master node barely nudges above idle. That is, this computer can manage and distribute all this work by barely lifting a finger.

Conclusion

There are a couple of things I want to draw your attention to as I wrap this up.

  1. CUBE systems aren’t only useful for CAE simulation HPC! They can be used for a wide range of HPC needs.
  2. PADT has a great deal of experience in software development both within the CAE ecosystem and outside of this ecosystem. This is one of the more enjoyable aspects of my job in particular.
  3. Learning new things is a blast and can have benefit in other aspects of life. Thinking about how to structure a problem as a series of events rather than a sequential series of steps has been very enlightening. In more ways than one, it is also why this blog article exists. My Monte Carlo simulator is running right now. I’m waiting on it to finish. My natural tendency is to busy wait. That is, spin brain cycles watching the CPU graph or the status counter tick down. However, in the time I’ve taken to write this article, my simulator has proceeded in parallel to my effort by eight steps. Each step represents generating and reducing a sample of 500,000,000 pose estimates! That is over 4 billion pose estimates in a little under an hour. I’ve managed to write 1,167 words…

CUBE_Logo_150w

Slide Rules, Logarithms, and Compute Servers

If any of you have been to PADT’s headquarters in Tempe, Arizona, you probably noticed the giant slide rule in the middle of our building.  You can see a portion of it in the picture below, at the top of our Training, Mentoring, and Support group picture.

PADT-TechSupport-Team-Prop

This thing is huge, over 6 feet (2 m) from side to side, in its un-extended position hanging on the wall.

In theory a gigantic slide rule could provide more accuracy, but our trophy, a Kueffel & Esser model 68 1929 copyrighted 1947 and 1961, was intended for teaching purposes in classrooms.  Most engineers had essentially pocket size or belt holder sized slide rules, also known as slip sticks. 

For the real thing, here is a picture of a slide rule used by Eric Miller’s father Col. BT Miller while at West Point from 1955 to 1958 as well as during his Master’s program in 1964.

Burt-Miller-SlideRule-D2

Why do we care about the slide rule today?  Have you ever seen World War II aircraft, submarines, or aircraft carriers?  These were designed using slide rules and/or logarithms.  The early space program?  Slide rules were used then too.  Some phenomenal engineering was accomplished by our predecessors using these devices.  Back then the numerical operations were just a tool to utilize their engineering knowledge.  Now I think we have a tendency to focus on the numerical due to its ease of use and impressive presentation, while perhaps forgetting or at least de-emphasizing the underlying engineering.  That’s not to say that we don’t have great engineers out there; rather it’s a call to energize you all to remember, consider, and utilize your engineering knowledge as you use your simulation tools.

By contrast, here is a picture of PADT’s brand new server room, with cluster machines being put together in the big cabinets.  Hundreds of cores.

servers

What about the giant slide rule?

My father found a thick book at an estate sale a few months ago.  There are a lot of retirees living in Arizona, so estate sales are quite common and popular.  They occur at a life stage when due to death or the need for assisted living, folks are no longer able to live in their home so the contents are sold, clearing out the home and generating some cash for the family.  This particular estate sale was for a retired engineer.  The book caught my father’s eye, first because it was quite thick and second because the title was, Mechanical Engineers’ Handbook.  Figuring it was a bargain for the amazing price of $1.00, he bought it for me.  This book is better known as Marks’ Handbook.  It’s apparently still in publication, at least as late as the 11th Edition in 2006, but the particular edition my father bought for me is the Fifth Edition from 1951.

marks-handbook

Although the slide rule is mostly a curiosity to us today, in 1951 it was state of the art for numerical computation.  While Marks’ has a couple of paragraphs on “Computing Machines”, described as “electrically driven mechanical desk calculators such as the Marchant, Monroe, or Friden”, the slide rule was what I will call the calculator of choice by mechanical engineers at the beginning of the 2nd half of the 20th century. 

As an aside, these mechanical calculators performed multiplication and division, using what I will describe as incredibly complex mechanisms.  Here is a link to a Wikipedia article on the Marchant Calculator:  http://en.wikipedia.org/wiki/Marchant_Calculator

Marks’ Handbook devotes about 3 pages to the operation of the slide rule, starting with simple multiplication and division and then discussing various methods of utilization and various types of slide rules.  It starts off by stating, “The slide rule is an indispensable aid in all problems in multiplication, division, proportion, squares, square roots, etc., in which a limited degree of accuracy is sufficient.” 

The slide rule operates using logarithms.  If you’re not familiar with using logarithms then you are probably younger than me, since I recall learning them in math class in probably junior high in the late 1970’s.  The slide rule uses common logarithms, meaning the log of a number is the exponent needed to raise a base of 10 to get that number.  For example, the common log of 100 is 2.  The common log table in the 1951 edition of Marks shows us that the common log of 4.44 is 0.6474.  For the sake of completeness, the ‘other’ logarithm is the natural log, meaning the base is the irrational number e, approximated as 2.718.

log-table

Getting back to common (base of 10) logs, the math magic is that logarithms allow for shortcuts in fairly complex computations.  For example, log (ab) = log a + log b.  That means if we want to multiple two fairly complicated numbers, we can simply look up the common log of each and add them together.  Similarly, log (a/b) = log a – log b. 

Here is an example, which I will keep simple.  Let’s say we want to multiple 0.0512 by 0.624.  On a calculator this is simple, but what if you are stranded on a remote island and all you have is a log table?  Knowing the equations above, you can look up the log of 0.0512 which is 0.7093-2 and the log of 0.624 which is 0.7952-1.  We now add:
adding_numbers

Writing that sum as a positive decimal minus an integer is important to being able to look up the antilogarithm or number whose log is 0.5045 – 2.

Looking up the number whose log is 0.5045 we get 3.195, using a little bit of linear interpolation.  The “-2” tells us to shift the decimal point to the left twice, meaning our answer is 0.003195.  Thus, using a little addition, some table lookup, a bit of in the head interpolation, and some knowledge on how to shift decimal points, we fairly easily arrive at the product of two three digit fractional numbers.  Now you are free to look for more coconuts on the island.  Or maybe get back to a hatch in the ground where you need to type in the numbers 4, 8, 14, 16, 23, and 42 every 108 minutes.  Oops, I’m really becoming Lost here…

sliderule-book

Getting back to the slide rule, one way to think of it is a graphical representation of the log tables.  In its most basic form, the slide rule consists of two logarithmic scales.  By lining up the scales, the log values can be added or subtracted.  For example, if we want to multiply something simple, like 4 x 6, we simply look from left to right on the scale on the ‘fixed’ portion of the slide rule to get to 4, then slide the moving portion of the slide so that its 1 lines up with the 4 found above on the fixed portion.  We then move left to right on the movable scale to find the 6.  Where the 6 on the movable slide lines up with on the fixed portion is our solution, 24.  What we’ve really done is add the log of 4 to the log of 6 and then find the antilog of that result, which is 24.  Now that we’ve found 24, we’re not Lost

We don’t intend to give detailed instructions on all phases of performing calculations using slide rules here, but hopefully you get the basics of how it is done.  There are plenty of online resources as well as slide rule apps that provide all sorts of details.  Besides multiplication and division, slide rules can be used for squares and square roots.  There are (were) specialty slide rules for other purposes.  Note that with additional knowledge and skill in visually interpolating on a log scale, up to 3 or even 4 significant digits can be determined depending on the size of the slide rule.

ted-slide-ruleThe author, attempting to prove that 4 x 6 is indeed 24

After having studied the Marks’ section on slide rules, experimenting with a slide rule app on an iPad as well as the PADT behemoth on the wall, I conclude that it was a very elegant method for calculating numbers much more quickly than could be done by traditional pencil and paper.  It’s must faster to add and subtract vs. complicated multiplication and long division.  My high school physics teacher actually spent a day or two teaching us how to use slide rules back in the early 1980’s.  By then they had been made functionally obsolete by scientific calculators, so looking back it was perhaps more about nostalgia than the math needed.  It does help me to appreciate the accomplishments made in science and engineering before the advent of numerical computing.

The preparation of this article has made me wonder what the guys and gals who used these tools proficiently back in the 1930’s, 40’s, and 50’s would think if they had access to the kind of compute power we have available today.  It also makes me wonder what people will think of our current tools 50 or 60 years from now.  When I first started in simulation over 25 years ago, it would have seemed quite a stretch to be able to solve simultaneously on hundreds if not thousands of compute cores as can be done today.  Back then we were happy to get time on the one number cruncher we had that was dedicated to ANSYS simulation.

Incidentally, this article was inspired by my colleague David Mastel’s recent blog entry on numerical simulation and how PADT is helping our customers take compute servers and work stations to the next level:

http://www.padtinc.com/blog/the-focus/launch-leave-forget-hpc-and-it-ansys

If you are ever in our PADT headquarters building in Tempe, don’t forget to look for the giant slide rule.  Now you will know its original purpose.

“Launch, Leave & Forget” – A Personal Journey of an IT Manager into Numerical Simulation HPC and how PADT is taking Compute Servers & Workstations to the Next Level

fire_and_forget_missileLaunch, Leave & Forget was a phrase that was first introduced in the 1960’s. Basically the US Government was developing missiles that when fired would no longer be needed to be guided or watched by the pilot. The fighter pilot was directing the missile mostly by line of sight and calculated guesswork off to a target in the distance. The pilot often would be shot down or would break away too early from guiding the launch vehicle. Hoping and guess work is not something we strive for when lives are at stake.

So I say all of that to say this. As it relates to virtual prototyping, Launch, Leave & Forget for numerical simulation is something that I have been striving for at PADT, Inc.
Striving internally and for our 1,800 unique customers that really need our help. We are passionate and desire to empower our customers to become comfortable, feel free to be creative and able to step back and let it go! Many of us have a unique and rewarding opportunity to work with customers from the point of design/or even the first to pick up the phone call. Onward to virtual prototyping, product development, Rapid Manufacturing and lastly on to something you can bring into the physical world. A physical prototype that has already gone through 5000 numerical simulations. Unlike the engineers in the 1960’s who would maybe get one, two or three shots at a working prototype. I think it is amazing that a company could go through 5000 different prototypes before finally introducing one into the real world.

clusterAt PADT I continue to look and search for new ways to Launch, Leave & Forget. One passion of mine is computers. I first started using a computer when I was nine years old. I was programming in BASIC creating complex little FOR NEXT statements before I was in seventh grade. Let’s fast forward… so I arrived at PADT in 2005. I was amazed at the small company I had arrived at, creativity and innovation was bouncing off the ceiling at this company. I had never seen anything like it! Humbled on more than one occasion as most of the ANSYS CFD analysts knew as much about computers as I did! No, not the menial IT tasks like networking, domain user creation, backups. What the PADT CFD/FEA Analysts communicated sometimes loudly was that their computers were slow! Humbled again I would retort but you have the fastest machine in the building. How could it be slow?! Your machine here is faster than our webserver in fact this was going to be our new web server. In 2005 then at a stalemate we would walk away both wondering why they solve was so slow! Over the years I would observe numerous issues. I remember spending hours using this ANSYS numerical simulation software. It was new to me and it was complicated! I would often knock on an Analysts door and ask if they had a couple minutes to show me how to run a simulation. Some of the programs I would have to ask two or three times, ANSYS FEA, ANSYS CFX, FLUENT on and on. Often using a round robin approach because I didn’t want to inconvenience the ANSYS Analysts. Probably some early morning around 3am the various ANSYS programs and the hardware, it all clicked with me. I was off and running ANSYS benchmarks on my own! Freedom!! Now I could experiment with the hardware configs. Armed with the ANSYS Fluent, and ANSYS FEA benchmark suites I wanted to make the numerical simulations run as fast or faster than they ever imagined possible! I wanted to please these ANSYS guys, why because I had never met anyone like these guys. I wanted to give them the power they deserved.

“What is the secret sauce or recipe for creating an effective numerical simulation?”

This is a comment that I would hear often. It could be on a conference call with a new customer or internally from our own ANSYS CFD Analysts and/or ANSYS FEA Analysts. “David, all I really care about is When I click ‘Calculate Run’ within ANSYS when is going to complete.” Or “how can we make this solver run faster?”

The secret sauce recipe? Have we signed an NDA yet? Just kidding. I have had the unique opportunity to not just observe ANSYS but other CFD/FEA code running on compute hardware. Learning better ways of optimizing hardware and software. Here is a fairly typical situation of how a typical process for architecting hardware for use with ANSYS software goes.

Getting Involved Early

When the sales guys let me I am often involved at the very beginning of a qualifying lead opportunity. My favorite time to talk to a customer is when a new customer calls me directly at the office.

Nothing but the facts sir!

I have years’ worth of benchmarking data. Do your users have any benchmarking data? Quickly have them run one of the ANSYS standard benchmarks. Just one benchmark can reveal to you a wealth of information about their current IT infrastructure.

Get your IT team onboard early!

This is a huge challenge! In general here are a few roadblocks that smart IT people have in place:

IT MANAGER RULES 101

1) No! talking to sales people
2) No! talking to sales people on the phone
3) No! talking to sales people via email
4) No! talking to sales people at seminars
5) If your boss emails or calls and says “please talk to this sales person @vulture & hawk”. Wait about a week. Then if the boss emails back and says “did you talk to this salesperson yet?” Pick up the phone and call sales rep @vulture & hawk.

it1What is this a joke? Nope, Most IT groups operate like this. Many are under staffed andin constant fix it mode. Most say and think like this. “I would appreciate it if you sat in my chair for one day. My phone constantly rings, so I don’t pick it up or I let it go to voicemail (until the voicemail box files up). Email constantly swoops in so it goes to junk mail. Seminar invites and meet and greets keep coming in – nope won’t go. Ultimately I know you are going to try to sell me something”.

Who have they been talking to? Do they even know what ANSYS is? I have been humbled over the years when it comes to hardware. I seriously believed the fastest web server at that moment in time would make a fast numerical simulation server.

If I can get on the phone with another IT Manager 90% of the time the walls come down and we can talk our own language. What do they say to me? Well I have had IT Managers and Directors tell me they would never buy a compute cluster or compute workstation from me. “Oh well our policy states that only buy from big boy pants Computer, Inc., mom & pop shop #343,” or the best one was ‘the owner’s nephew. He builds computers on the side.”. They stand behind their walls of policy and circumstance. But, at the end of the calls they are normally asking us to send a quote to them.

repair

So, now what?

Well, do you really know your software? Have you spent hours running different hardware configurations of the same workstation? Observing the read/writes of an eight drive 600GB SAS3 15k RPM 12Gbps RAID 0 configuration. Is 3 drives for the OS and 5 drives for the Solving array the best configuration for the hardware and software? Huh? What’s that?? Oh boy…

Help! My New HPC System is not High Performance!

It is an all too common feeling, that sinking feeling that leads to the phrase “Oh Crap” being muttered under your breath. You just spent almost a year getting management to pay for a new compute workstation, server or cluster. You did the ROI and showed an eight-month payback because of how much faster your team’s runs will be. But now you have the benchmark data on real models, and they are not good. “Oh Crap”

Although a frequent problem, and the root causes are often the same, the solutions can very. In this posting I will try and share with you what our IT and ANSYS technical support staff here at PADT have learned.

Hopefully this article can help you learn what to do to avoid or circumvent any future or current pitfalls if you order an HPC system. PADT loves numerical simulation, we have been doing this for twenty years now. We enjoy helping, and if you are stuck in this situation let us know.

Wall Clock Time

It is very easy to get excited about clock speeds, bus bandwidth, and disk access latency. But if you are solving large FEA or CFD models you really only care about one thing. Wall Clock Time. We cannot tell you how many times we have worked with customers, hardware vendors, and sometimes developers, who get all wrapped up in the optimization of one little aspect of the solving process. The problem with this is that high performance computing is about working in a system, and the system is only as good as its weakest link.

We see people spend thousands on disk drives and high speed disk controllers but come to discover that their solves are CPU bound, adding better disk drives makes no difference. We also see people blow their budget on the very best CPU’s but don’t invest in enough memory to solve their problems in-core. This often happens because when they look at benchmark data they look at one small portion and maximize that measurement, when that measurement often doesn’t really matter.

The fundamental thing that you need to keep in mind while ordering or fixing an HPC system for numerical simulation is this: all that matters is how long it takes in the real world from when you click “Solve” till your job is finished. I bring this up first because it is so fundamental, and so often ignored.

The Causes

As mentioned above, an HPC server or cluster is a system made up of hardware, software, and people who support it. And it is only as good as its weakest link. The key to designing or fixing your HPC system is to look at it as a system, find the weakest links, and improve that links performance. (OK, who remembers the “Weakest Link” lady? You know you kind of miss her…)

In our experience we have found that the cause for most poorly performing systems can be grouped into one of these categories:

  • Unbalanced System for the Problems Being Solved:

    One of the components in the system cannot keep up with the others. This can be hardware or software. More often than not it is the hardware being used. Let’s take a quick look at several gotchas in a misconfigured numerical simulation machine.

  • I/O is a Bottleneck
    Number crunching, memory, and storage are only as fast as the devices that transfer data between them.
  • Configured Wrong

    Out of simple lack of experience the wrong hardware is used, the OS settings are wrong, or drivers are not configured properly.

  • Unnecessary Stuff Added out of Fear

    People tend to overcompensate out of fear that something bad might happen, so they burden a system with software and redundant hardware to avoid a one in a hundred chance of failure, and slow down the other ninety-nine runs in the process.

Avoiding an Expensive Medium Performance Computing (MPC) System

The key to avoiding these situations is to work with an expert who knows the hardware AND the software, or become that expert yourself. That starts with reading the ANSYS documentation, which is fairly complete and detailed.

Often times your hardware provider will present themselves as the expert, and their heart may be in the right place. But only a handful of hardware providers really understand HPC for simulation. Most simply try and sell you the “best” configuration you can afford and don’t understand the causes of poor performance listed above. More often than we like, they sell a system that is great for databases, web serving, or virtual machines. That is not what you need.

A true numerical simulation hardware or software expert should ask you questions about the following, if they don’t, you should move on:

  • What solver will you use the most?
  • What is more important, cost or performance? Or better: Where do you want to be on the cost vs. performance curve?
  • How much scratch space do you need during a solve? How much storage do you need for the files you keep from a run?
  • How will you be accessing the systems, sending data back and forth, and managing your runs?

Another good test of an expert is if you have both FEA and CFD needs, they should not recommend a single system for you. You may be constrained by budget, but an expert should know the difference between the two solvers vis-à-vis HPC and design separate solutions for each.

If they push virtual machines on you, show them the door.

The next thing you should do is step back and take the advice of writing instructors. Start cutting stuff. (I know, if you have read my blog posts for a while, you know I’m not practicing what I preach. But you should see the first drafts…) You really don’t need huge costly UPS’, the expensive archival backup system, or some arctic chill bubbling liquid nitrogen cooling system. Think of it as a race car, if it doesn’t make the car go faster or keep the driver safe, you don’t need it.

A hard but important step in cutting things down to the basics is to try and let go of the emotional aspect. It is in many ways like picking out a car and the truth is, the red paint job doesn’t make it go any faster, and the fancy tail pipes will look good, but also don’t help. Don’t design for the worst-case model either. If 90% of your models run in 32GB or RAM, don’t do a 128GB system for that one run you need to do a year that is that big. Suffer a slow solve on that one and use the money to get a faster CPU, a better disk array, or maybe a second box.

Pull back, be an engineer, and just get what you need. Tape robots look cool, blinky lights and flashy plastic case covers even cooler. Do you really need that? Most of time the numerical simulation cruncher is locked up in a cold dark room. Having an intern move data to USB drives once a month may be a more practical solution.

Another aspect of cutting back is dealing with that fear thing. The most common mistake we see is people using RAID configurations for storing redundant data, not read/write speed. Turn off that redundant writing and dump across as many drives as you can in parallel, RAID 0. Yes you may lose a drive. Yes that means you lose a run. But if that happens once every six months, which is very unlikely, the lost productivity from those lost runs is small compared to the lost productivity of solving all those other runs on a slow disk array.

Intel-AMD-Flunet-Part2-Chart2Lastly, benchmark. This is obvious but often hard to do right. The key is to find real problems that represent a spectrum of the runs you plan on doing. Often different runs, even within the same solver, have different HPC needs. It is a good idea to understand which are more common and bias your design to those. Do not benchmark with standard benchmarks, use industry accepted benchmarks for numerical simulation. Yes it’s an amazing feeling knowing that your new cluster is number 500 on the Top 500 list. However if it is number 5000 on the ANSYS Numerical simulation benchmark list nobody wins.

Fixing the System You Have

As of late we have started tearing down clusters in numerous companies around the US. Of course we would love to sell you new hardware however at PADT, as mentioned before, we love numerical simulation. Fixing your current system may allow you to stretch that investment another year or more. As a co-owner of a twenty year old company, this makes me feel good about that initial investment. When we sick our IT team on extending the life of one of our systems, I start thinking about and planning for that next $150k investment we will need to do in a year or more.

Breathing new life into your existing hardware basically requires almost the same steps as avoiding a bad system in the first place. PADT has sent our team around the country helping companies breath new life into their existing infrastructure. The steps they use are the same but instead of designing stuff, we change things. Work with an expert, start cutting stuff out, breath new life into the growing old hardware, avoid fear and “cool factor” based choices, and verify everything.

Take a look and understand the output from your solvers, there is a lot of data in there. As an example, here is an article we wrote describing some of those hidden gems within your numerical simulation outputs. http://www.padtinc.com/blog/the-focus/ansys-mechanical-io-bound-cpu-bound

Play with things, see what helps and what hurts. It may be time to bring in an outside expert to look at things with fresh eyes.

Do not be afraid to push back against what IT is suggesting, unless you are very fortunate, they probably don’t have the same understanding as you do when it comes to numerical simulation computing. They care about security and minimizing the cost of maintaining systems. They may not be risk takers and they don’t like non-standard solutions. All of these can often result in a system that is configured for IT, and not fast numerical simulation solves. You may have to bring in senior management to solve this issue.

PADT is Here to Help

Cube_Logo_Target1The easiest way to avoid all of this is to simply purchase your HPC hardware from PADT.  We know simulation, we know HPC, and we can translate between engineers and IT.  This is simply because simulation is what we do, and have done since 1994.   We can configure the right system to meet your needs, at that point on the price performance curve you want.  Our CUBE systems also come preloaded and tested with your simulation software, so you don’t have to worry about getting things to work once the hardware shows up.

If you already have a system or are locked in to a provider, we are still here to help.  Our system architects can consult over the phone or in person, bringing their expertise to the table on fixing existing systems or spec’ing new ones.  In fact, the idea for this article came when our IT manager was reconfiguring a customer’s “name brand” cluster here in Phoenix, and he got a call from a user in the Midwest that had the exact same problem.  Lots of expensive hardware, and disappointing performance. They both had the wrong hardware for their problems, system bottlenecks, and configuration issues.

Learn more on our HPC Server and Cluster Performance Tuning page, or by contacting us. We would love to help out. It is what we like to do and we are good at it.

Video Tips: Parallel Part by Part Meshing in ANSYS v15.0

This video shows you a new capability in ANSYS v15.0 that allows multiple parts to be simultaneously meshed on multiple CPU cores…with no additional licenses required!

Exercising Parallel Meshing in ANSYS Mechanical R15

[The following is an email that Manoj sent the tech support staff at PADT. I thought is was perfect for a The Focus posting, so here it is – Eric]

First of all I found out a way to get Mesh Generation time (if no one knew about this).  In ANSYS Mechanical go to Tools->Options->Miscellaneous and turn “Report Performance Diagnostics in Messages” to Yes.  It will give you “Elapsed Time for Last Mesh Generation” in the Messages window.

clip_image001

clip_image002

Next I did a benchmark on the Parallel Part by Part meshing of a Helicopter Rotor Hub with 502 bodies.  The mesh settings were getting a mesh of about 560,026 elements and 1.23 million nodes.

clip_image004

I did Parallel Part by Part Meshing on this model with 1,2,4,6 and 8 cores and here are the results.

Can I say “I LIKE IT!”

1 core: 172 seconds (1.0)
2 core:  89 seconds (1.9)
4 core:  52 seconds (3.3)
6 core:  38 seconds (4.5)
8 core:  33 seconds (5.2)

image

Of course this is a small mesh so as the number of cores goes up, the benefits go down.   I will be doing some testing on some models that take a lot longer to mesh but wanted to start simple. I’ll make a video summarizing that study showing how to set up the whole process and the results.

If you are curious, Manoj is running on a PADT CUBE server. As configured it would cost around $19k. You could drop a few thousand of the price if you changed up cards or went with CPU’s that were not so leading edge.

Here are the SPECs:

CUBE HVPC w8i-KGPU
CUBE Mid-Tower Chassis – 26db quiet edition
Two XEON e5-2637 v2 (4 cores per, 3.5GHz each)
128 GB of DDR3-1600 ECC Reg RAM
NVIDIA QUADRO K5000
NVIDIA TESLA K20x
7.1 HD Audio (to really rock your webinars…)
SMC LSI 2208 RAID Card – 6Gbps
OS Drive: 2 x 256GB SSD 6gbps
Solver Array: 3 x 600GB SAS2 15k RPM 6Gbps

CUBE Systems are Now Part of the ANSYS, Inc. HPC Partner Program

CUBE-HVPC-Logo-wide_thumb.png

The relationship between ANSYS, Inc. and PADT is a long one that runs deep. And that relationship just got stronger with PADT joining the HPC Partner Program with our line of CUBE compute systems specifically designed for simulation. The partner program was set up by ANSYS, Inc. to work:

CUBE-HVPC-512-core-closeup3-1000h_thumb.jpg“… with leaders in high-performance computing (HPC) to ensure that the engineering simulation software is optimized on the latest computing platforms. In addition, HPC partners work with ANSYS to develop specific guidelines and recommended hardware and system configurations. This helps customers to navigate the rapidly changing HPC landscape and acquire the optimum infrastructure for running ANSYS software. This mutual commitment means that ANSYS customers get outstanding value from their overall HPC investment.”

CUBE-HVPC-512-core-stairs-1000h_thumb.jpg

PADT is very excited to be part of this program and to contribute to the ANSYS/HPC community as much as we can.  Users know they can count on PADT’s strong technical expertise with ANSYS Mechanical, ANSYS Mechanical APDL, ANSYS FLUENT, ANSYS CFX, ANSYS Maxwell, ANSYS HFSS, and other ANSYS, Inc. products, a true differentiator when compared with other hardware providers.

Customers around the US have fallen in love with their CUBE workstations, servers, mini-clusters, and clusters finding them to be the right mix between price and performance. CUBE systems let users carry out larger simulations, with greater accuracy, in less time, at a lower cost than name-brand solutions. This leaves you more cash to buy more hardware or software.

Assembled by PADT’s IT staff, CUBE computing systems are delivered with the customer’s simulation software loaded and tested. We configure each system specifically for simulation, making choices based upon PADT’s extensive experience using similar systems for the same kind of work. We do not add things a simulation user does not need, and focus on the hardware and setup that delivers performance.

CUBE-HVPC-512-core-front1-1000h_thumb.jpg

Is it time for you to upgrade your systems?  Is it time for you to “step out of the box, and step in to a CUBE?”  Download a brochure of typical systems to see how much your money can actually buy, visit the website, or contact us.  Our experts will spend time with you to understand your needs, your budget, and what your true goals are for HPC. Then we will design your custom system to meet those needs.

 

This May Be the Fastest ANSYS Mechanical Workstation we Have Built So Far

The Build Up

Its 6:30am and a dark shadow looms in Eric’s doorway. I wait until Eric finishes his Monday morning company updates. “Eric check this out, the CUBE HVPC w16i-k20x we built for our latest customer ANSYS Mechanical scaled to 16 cores on our test run.” The left eyebrow of Eric’s slightly rises up. I know I have him now I have his full and complete attention.

Why is this huge news?

This is why; Eric knows and probably many of you reading this also know that solving differential equations, distributed, parallel along with using graphic processing unit makes our hearts skip a beat. The finite element method used for solving these equations is CPU intensive and I/O intensive. This is headline news type stuff to us geek types. We love scratching our way along the compute processing power grids to utilize every bit of performance out of our hardware!

Oh and yes a lower time to solve is better! No GPU’s were harmed in this tests. Only one NVIDIA TESLA k20X GPU was used during the test.

Take a Deep Breath and Start from the Beginning:

I have been gathering and hording years’ worth of ANSYS mechanical benchmark data. Why? Not sure really after all I am wanna-be ANSYS Analysts. However, it wasn’t until a couple weeks ago that I woke up to the why again. MY CUBE HVPC team sold a dual socket INTEL Ivy bridge based workstation to a customer out of Washington state. Once we got the order, our Supermicro reseller‘s phone has been bouncing of the desk. After some back and forth, this is how the parts arrive directly from Supermicro, California. Yes, designed in the U.S.A.  And they show up in one big box:

clip_image002[4]

Normal is as Normal Does

As per normal is as normal does, I ran the series of ANSYS benchmarks. You know the type of benchmarks that perform coupled-physics simulations and solving really huge matrix numbers. So I ran ANSYS v14sp-5, ANSYS FLUENT benchmarks and some benchmarks for this customer, the types of runs they want to use the new machine for. So I was talking these benchmark results over with Eric. He thought that now is a perfect time to release the flood of benchmark data. Well some/a smidge of the benchmark data. I do admit the data does get overwhelming so I have tried to trim down the charts and graphs to the bare minimum. So what makes this workstation recipe for the fastest ANSYS Mechanical workstation so special? What is truly exciting enough to tip me over in my overstuffed black leather chair?

The Fastest Ever? Yup we have been Changed Forever

Not only is it the fastest ANSYS Mechanical workstation running on CUBE HVPC hardware.  It uses two INTEL CPU’s at 22 nanometers. Additionally, this is the first time that we have had an INTEL dual socket based workstation continue to gain faster times on and up to its maximum core count when solving in ANSYS Mechanical APDL.

Previously the fastest time was on the CUBE HVPC w16i-GPU workstation listed below. And it peaked at 14 cores. 

Unfortunately we only had time before we shipped the system off to gather two runs: 14 and 16 cores on the new machine. But you can see how fast that was in this table.  It was close to the previous system at 14 cores, but blew past it at 16 whereas the older system actually got clogged up and slowed down:

  Run Time (Sec)
Cores Used Config B Config C Config D
14 129.1 95.1 91.7
16 130.5 99 83.5

And here are the results as a bar graph for all the runs with this benchmark:

CUBE-Benchmark-ANSYS-2013_11_01

  We can’t wait to build one of these with more than one motherboard, maybe a 32 core system with infinband connecting the two. That should allow some very fast run times on some very, very large problems.

ANSYS V14sp-5 ANSYS R14 Benchmark Details

  • Elements : SOLID187, CONTA174, TARGE170
  • Nodes : 715,008
  • Materials : linear elastic
  • Nonlinearities : standard contact
  • Loading : rotational velocity
  • Other : coupling, symentric, matrix, sparse solver
  • Total DOF : 2.123 million
  • ANSYS 14.5.7

Here are the details and the data of the March 8, 2013 workstation:

Configuration C = CUBE HVPC w16i-GPU

  • CPU: 2x INTEL XEON e5-2690 (2.9GHz 8 core)
  • GPU: NVIDIA TESLA K20 Companion Processor
  • GRAPHICS: NVIDIA QUADRO K5000
  • RAM: 128GB DDR3 1600Mhz ECC
  • HD RAID Controller: SMC LSI 2208 6Gbps
  • HDD: (os and apps): 160GB SATA III SSD
  • HDD: (working directory):6x 600GB SAS2 15k RPM 6Gbps
  • OS: Windows 7 Professional 64-bit, Linux 64-bit
  • Other: ANSYS R14.0.8 / ANSYS R14.5

Here are the details from the new, November 1, 2013 workstation:

Configuration D = CUBE HVPC w16i-k20x

  • CPU: 2x INTEL XEON e5-2687W V2 (3.4GHz)
  • GPU: NVIDIA TESLA K20X Companion Processor
  • GRAPHICS: NVIDIA QUADRO K4000
  • RAM: 128GB DDR3 1600Mhz ECC
  • HDD: (os and apps): 4 x 240GB Enterprise Class Samsung SSD 6Gbps
  • HD RAID CONTROLLER: SMC LSI 2208 6Gbps
  • OS: Windows 7 Professional 64-bit, Linux 64-bit
  • Other: ANSYS 14.5.7

You can view the output from the run on the newer box (Configuration D) here:

Here is a picture of the Configuration D machine with the info on its guts:

clip_image006[4]clip_image008[4]

What is Inside that Chip:

The one (or two) CPU that rules them all: http://ark.intel.com/products/76161/

Intel® Xeon® Processor E5-2687W v2

  • Status: Launched
  • Launch Date: Q3’13
  • Processor Number: E5-2687WV2
  • # of Cores: 8
  • # of Thread: 16
  • Clock Speed: 3.4 GHz
  • Max Turbo Frequency: 4 GHz
  • Cache:  25 MB
  • Intel® QPI Speed:  8 GT/s
  • # of QPI Link:  2
  • Instruction Se:  64-bit
  • Instruction Set Extension:  Intel® AVX
  • Embedded Options Available:  No
  • Lithography:  22 nm
  • Scalability:  2S Only
  • Max TDP:  150 W
  • VID Voltage Range:  0.65–1.30V
  • Recommended Customer Price:  BOX : $2112.00, TRAY: $2108.00

The GPU’s that just keep getting better and better:

Features

TESLA C2075

TESLA K20X

TESLA K20

Number and Type of GPU

FERMI

Kepler GK110

Kepler GK110

Peak double precision floating point performance

515 Gflops

1.31 Tflops

1.17 Tflops

Peak single precision floating point performance

1.03 Tflops

3.95 Tflops

3.52 Tflops

Memory Bandwidth (ECC off)

144 GB/sec

250 GB/sec

208 GB/sec

Memory Size (GDDR5)

6GB

6GB

5GB

CUDA Cores

448

2688

2496

clip_image012[4]

Ready to Try one Out?

If you are as impressed as we are, then it is time for you to try out this next iteration of the Intel chip, configured for simulation by PADT, on your problems.  There is no reason for you to be using a CAD box or a bloated web server as your HPC workstation for running ANSYS Mechanical and solving in ANSYS Mechanical APDL.  Give us a call, our team will take the time to understand the types of problems you run, the IT environment you run in, and custom configure the right system for you:

http://www.padtinc.com/products/hardware/cube-hvpc,
email: garrett.smith@padtinc.com,
or call 480.813.4884

Part 2: ANSYS FLUENT Performance Comparison: AMD Opteron vs. Intel XEON

AMD Opteron 6308, INTEL XEON e5-2690 & INTEL XEON e5-2667V2 Comparison using ANSYS FLUENT 14.5.7

Note: The information and data contained in this article was complied and generated on September 12, 2013 by PADT, Inc. on CUBE HVPC hardware using FLUEN 14.5.7.  Please remember that hardware and software change with new releases and you should always try to run your own benchmarks, on your own typical problems, to understand how performance will impact you.

By David Mastel

Due to the response to the original article on this subject,  I thought it would be good to do a quick follow-up using one of our latest CUBE HVPC builds. Again, the ANSYS Fluent standard benchmarks were used in garnering the stats on this dual socket INTEL XEON e5-2667V2 configuration.

CUBE HVPC Test configurations (Same as in last comparison)

  • Server 1: CUBE HVPC c16
  • CPU: 4, AMD Opteron 6308 @ 3.5GHz (Quad Core)
  • Memory: 256GB (32x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • Hardware RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  •  OS: Linux 64-bit / Kernel 2.6.32-358.18.1.e16.x86_64
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI
  • HCA: SMC AOC-UIBQ-M2 – QDR Infiniband
    • The IB card installed however solves were run distributed locally
  • Switch: MELLANOX IS5023 Non-Blocking 18-port switch

Server 2: CUBE HVPC c16i (Intel server from last comparison)

  • CPU: 2, INTEL XEON e5-2690 @ 2.9GHz (Octa Core)
  • Memory: 128GB (16x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  • OS: Windows 7 Professional 64-bit
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI

Server 3: CUBE HVPC c16ivy (New “Ivy” based Intel server)

  • CPU: 2, INTEL XEON e5-2667V2 @ 3.3 (Octa Core)
  • Memory: 128GB (16x8G) DDR3-1600 ECC Reg. RAM (1600MHz)
  • RAID Controller: Supermicro AOC-S2208L-H8iR 6Gbps, PCI-e x 8 Gen3
  • Hard Drives: Supermicro HDD-A0600-HUS156060VLS60 – Hitachi 600G SAS2.0 15K RPM 3.5″
  • OS: Linux 64-bit / Kernel 2.6.32-358.18.1.e16.x86_64
  • App: ANSYS FLUENT 14.5.7
  • MPI: Platform MPI
  • HCA: SMC – QDR Infiniband
    • The IB card installed however solves were run distributed locally

ANSYS FLUENT 14.5.7 Performance using the ANSYS FLUENT Benchmark suite provided by ANSYS, Inc.

ANSYS Fluent Benchmark page link:http://www.ansys.com/Support/Platform+Support/Benchmarks+Overview/ANSYS+Fluent+Benchmarks

Release ANSYS FLUENT 14.5.7 Test Cases
(20 Iterations each)

  • Reacting Flow with Eddy Dissipation Model (eddy_417k)
  • Single-stage Turbomachinery Flow (turbo_500k)
  • External Flow Over an Aircraft Wing (aircraft_2m)
  • External Flow Over a Passenger Sedan (sedan_4m)
  • External Flow Over a Truck Body with a Polyhedral Mesh (truck_poly_14m)
  • External Flow Over a Truck Body 14m (truck_14m)

Here are the results for all three machines, total and average time:

Intel-AMD-Flunet-Part2-Chart1Intel-AMD-Flunet-Part2-Chart2

 

Summary: Are you sure? Part 2

So I didn’t have to have the “Are you sure?” question with Eric this time and I didn’t bother triple checking the results because indeed, the Ivy Bridge-EP Socket 2011 is one fast CPU! That combined with a 0.022 micron manufacturing process  the data speaks for itself. For example, lets re-dig into the data for the External Flow Over a Truck Body with a Polyhedral Mesh (truck_poly_14m) benchmark and see what we find:

Intel-AMD-FLUENT-Details

 

 

 

 

 

 

 

 

 

 

 

Intel-AMD-FLUENT-summary

 

 

 

 

 

 

 

 

 

 

 

Current Pricing of INTEL® and AMD® CPU’s

Here is the up to the minute pricing for each CPU’s. I took these prices off of NewEgg and IngramMicro’s website. The date of the monetary values was captured on October 4, 2013.

Note AMD’s price per CPU went up and the INTEL XEON e5-2690 went down. Again, these prices based on today’s pricing, October 4, 2013.

AMD Opteron 6308 Abu Dhabi 3.5GHz 4MB L2 Cache 16MB L3 Cache Socket G34 115W Quad-Core Server Processor OS6308WKT4GHKWOF

  •  $501 x 4 = $2004.00

Intel Xeon E5-2690 2.90 GHz Processor – Socket LGA-2011, L2 Cache 2MB, L3 Cache 20 MB, 8 GT/s QPI

  • $1986.48 x 2 = $3972.96

Intel Xeon E5-2667V2 3.3 GHz Processor – Socket LGA-2011, L2 Cache 2MB, L3 Cache 25 MB, 8 GT/s QPI,

  • $1933.88 x 2 = $3867.76

REFERENCES:
http://www.ingrammicro.com
http://www.newegg.com

INTEL XEON e5-2667V2
http://ark.intel.com/products/75273/Intel-Xeon-Processor-E5-2667-v2-25M-Cache-3_30-GHz

INTEL XEON e5-2690
http://ark.intel.com/products/64596/

AMD Opteron 6308
http://www.amd.com/us/Documents/Opteron_6300_QRG.pdf

http://en.wikipedia.org/wiki/Double-precision_floating-point_format

http://en.wikipedia.org/wiki/Central_processing_unit#Integer_range

http://en.wikipedia.org/wiki/Floating_point

STEP OUT OF THE BOX, STEP INTO A CUBE

PADT offers a line of high performance computing (HPC) systems specifically designed for CFD and FEA number crunching aimed at a balance between cost and performance. We call this concept High Value Performance Computing, or HVPC. These systems have allowed PADT and our customers to carry out larger simulations, with greater accuracy, in less time, at a lower cost than name-brand solutions. This leaves you more cash to buy more hardware or software.

Let CUBE HVPC by PADT, Inc. quote you a configuration today!

 

Columbia: PADT’s Killer Kilo-Core CUBE Cluster is Online

iIn the back of PADT’s product development lab is a closet.  Yesterday afternoon PADT’s tireless IT team crammed themselves into the back of that closet and powered up our new cluster, bringing 1104 connected cores online.  It sounded like a jet taking off when we submitted a test FLUENT solve across all the cores.  Music to our ears.

We have recently become slammed with benchmarks for ANSYS and CUBE customers as well as our normal load of services work, so we decided it was time to pull the trigger and double the size of our cluster while adding a storage node.  And of course, we needed it yesterday.  So the IT team rolled up their sleeves, configured a design, ordered hardware, built it up, tested it all, and got it on line, in less than two weeks.  This was while they did their normal IT work and dealt with a steady stream of CUBE sales inquiries.  But it was a labor of love. We have all dreamed about breaking that thousand core barrier on one system, and this was our chance to make it happen.

If you need more horsepower and are looking for a solution that hits that sweet spot between cost and performance, visit our CUBE page at www.cube-hvpc.com and learn more about our workstations, servers, and clusters.  Our team (after they get a little rest) will be more than happy to work with you to configure the right system for your real world needs.

Now that the sales plug is done, lets take a look at the stats on this bad boy:

Name: Columbia
After the class of battlestars in Battlestar Galactica
Brand: CUBE High Value Performance Compute Cluster, by PADT
Nodes: 18
17 compute, 1 storage/control node, 4 CPU per Node
Cores: 1104
AMD Opteron: 4 x 6308 3.5 GHz, 32 x 6278 2.4 GHz, 36 x 6380 2.5 GHz
Interconnect: 18 port MELLANOX IB 4X QDR Infiniband switch
Memory: 4.864 Terabytes
Solve Disk: 43.5 TB RAID 0
Storage Disk: 64 TB RAID 50

Here are some pictures of the build and the final product:

a
A huge delivery from our supplier, Supermicro, started the process. This was the first pallet.

b
The build included installing the largest power strip any of us had ever seen.

c
Building a cluster consists of doing the same thing, over and over and over again.

f
We took over PADT’s clean room because it turns out you need a lot of space to build something this big.

g
It is fun to get the chance to build the machine you always wanted to build

h
2AM Selfie: Still going strong!

d
Almost there. After blowing a breaker, we needed to wait for some more
power to be routed to the closet.

e
Up and running!
Ratchet and Clank providing cooling air containment.

David, Sam, and Manny deserve a big shout-out for doing such a great job getting this thing up and running so fast!

When I logged on to my first computer, a TRS-80, in my high-school computer lab, I never, ever thought I would be running on a machine this powerful.  And I would have told people they were crazy if they said a machine with this much throughput would cost less than $300,000.  It is a good time to be a simulation user!

Now I just need to find a bigger closet for when we double the size again…

CUBE-HVPC-Logo-wide

Why do my ANSYS jobs take days and weeks to finish? Well it depends…

Real World Lessons on How to Minimize Run Time for ANSYS HPC

Recently I had a VP of Engineering start a phone conversation with me that went something like this. “Well Dave, you see this is how it is. We just spent a truckload of money on a 256 core cluster and our solve times are slower now than with our previous 128 core cluster. What the *&(( is going on here?!”

I imagine many of us have heard similar stories or even received the same questions from our co-workers, CEO’s & Directors. I immediately had my concerns and I truly thought carefully as to what I should say next. I recalled a conversation I had with one of my college professors. He had told me that when I find myself stepping into gray areas that a good start to the conversation is to say. “Well it depends…”

Guess what, that is exactly what I said. I said “Well it depends…” followed by going into explaining to him two fundamental pillars of computer science that have plagued most of us since computers were created: I said “Well you may be, CPU bound (compute bound) or I/O bound. He told me that they had paid a premium for the best CPU’s on the market and some other details about the HPC cluster. Garnering some of other details about the cluster my hunch was that his HPC cluster may actually be I/O bound.

I/O Bound

Basically this means that your cluster’s $2,000 worth of CPU’s are basically stalled out and sitting idle. The CPU’s are waiting for new data to process and move on. I also briefly explained that his HPC cluster may be compute bound. I quickly reassured him that the likelihood of his HPC cluster being compute bound was about 10% possible and very unlikely. I knew the specifications on the CPU’s in this HPC cluster and the likelihood that they were the issue of his ANSYS slow run times was low on my radar. These literally were the latest and greatest CPU’s ever to hit this planet (at that moment in time). So, let me step back a minute, to refresh our memories on what it means when a system is compute bound.

Compute Bound

Being compute bound means that the HPC cluster’s CPU’s were sitting at 99 or 100% for long periods of time. When this happens very bad things begin to happen to your HPC cluster. CPU requests to peripherals are delayed or infinitely lost to the ether. The HPC cluster may become unresponsive and even lock up.

All I could hear was silence on the other end. “Dave, I get it, I understand, please find the problem and fix our HPC cluster for us. ” I happily agreed to help out! I concluded our phone conversation asking that he send me the specific details, down to the nuts and bolts of the hardware! I also requested operating system and software that was installed and used on the 256 core HPC cluster.

What NOT to do when configuring an ANSYS Distributed HPC cluster.

Seeking that perfect balance!

After a quick NDA signing, a few dollars exchange and a sprinkle of some other legal things that lawyers get excited about. I set out to discover the cause. After reviewing the information provided to me I almost immediately saw three concerns:

To interconnect what?

Let Merriam-Webster describe it:

Definition of INTERCONNECT

transitive verb
: to connect with one another
intransitive verb
: to be or become mutually connected

— in·ter·con·nec·tion noun
— in·ter·con·nec·tiv·i·ty noun

1. The systems are interconnected with a series of wires.
2. The lessons are designed to show students how the two subjects interconnect
3.  A series of interconnecting stories

First Known Use of INTERCONNECT: 1865

Concern numeral Uno!!! Interconnect me

Though the company’s 256 core HPC cluster had a second dedicated GigE interconnect. Distributed ANSYS is highly bandwidth and latency bound often requiring more bandwidth than a dedicated NIC (Network Interface Card) may provide. Yes, the dedicated second GigE card interconnect was much better than trying to use a single NIC for all of the network traffic which would also include the CPU interconnect. I did have a few of the MAPDL output files from the customer that I could take a peek at. After reviewing the customer output files it became fairly clear that interconnect communication speeds between the 16 core x 16 server in the cluster was not adequate. The master Message Parsing Interface (MPI) process that Distributed ANSYS uses requires a high amount of bandwidth and low latency for proper distributed scaling to the other processes. Theoretically the data bandwidth between cores solving local to the machine will be higher than the bandwidth traveling across the various interconnect methods (see below). ANSYS, Inc. recommends Infiniband for CPU interconnect traffic. Here are a couple of reasons why they recommend this. See how the theoretical data limits increase going from Gigabit Ethernet up to FDR Infiniband.

Theoretical lane bandwidth limits for:

  • Gigabit Ethernet (GigE): ~128MB/s
  • Signal Data Rate (SDR): ~ 328 MB/s
  • Double Data Rate (DDR): ~640 MB/s
  • Quad Data Rate (QDR): ~1,280 MB/s
  • Fourteen Data Rate (FRD): ~1,800 MB/s

GEEK CRED: A few years ago companies such as MELLANOX started aggregating the Infiniband channels. The typical aggregate modifiers are 4X or even a 12X increase. So for example the 4X QDR Infiniband switch and cards I use at PADT and recommended to this customer, would have a (4X 10Gbit/s) or 5,120 MB/s of throughput! Here is a quick video that I made of a MELLANOX IS5023 18-port 4X QDR full bi-directional switch in action:

This is how you do it with a CUBE HVPC! MAPDL output file from our CUBE HVPC w16i-GPU workstation. This is running the ANSYS industry benchmark V14sp-5. I wanted to show the communication speeds between the master MPI process and the other solver processes to see just how fast the solvers can communicate. With a peak communication speed of 9593 MB/s this CUBE HVPC workstation rocks!

Chassis Profile 4u standard depth or rackmountable
CPU 1 x One Dual Socket
Chipset INTEL 602 Chipset
Processors 2 x INTEL e5-2690 @ 2.9GHz
Cores 2 x 8
Memory 128GB DDR3-1600 ECC Reg RAM
OS Drives 2 x 2.5″ SATA III 256GB SSD Drives RAID 0
DATA/HOME Hard Disk Drives 4 x 3.5″ SAS2 600GB 15kRPM drives RAID 0
SAS RAID (Onboard, Optional) RAID 0 (OS RAID)
SAS RAID (RAID card, Optional) LSI 2208 (DATA VOL RAID)
Networking (Onboard) Dual GigE (Intel i350)
Video (Onboard) NVIDIA QUADRO K5000
GPU (Optional) NVIDIA TESLA K2000
Operating System Windows 7 Professional 64-bit
Optional Installed Software ANSYS 14.5 Release

imageStats for CUBE HVPC Model Number : w16i-KGPU

Learn more about this and other CUBE HVPC systems here.

Concern #2: Using RAID 5 Array for Solving Disk Volume

The hard drives that are used for I/O during a solve, the solving volume, were configured in a RAID 5 hard disk array. Some sample data below showing the minimum write speed of a similar RAID 5 array. These are speeds that are better off seen in your long-term storage volume not on your solving/working directory.

LSI 2008 HITACHI ULTASTAR 15K600
Qty / Type / Size / RAID Qty 8 x 3.5″ SAS2 15k 600GB RAID 5
TEST # p1
min Read 204 MB/s
max Read 395  MB/s
Avg Read N/A
min Write 106 MB/s
max Write 243.5 MB/s
Avg Write N/A
Access Time N/A

Concern #3: Using RAID 1 for Operating System

The hard drive array for the OS was configured in a RAID 1 configuration. For a number cruncher server having RAID 1 is not necessary. If you absolutely have to have RAID 1. Please spend the extra money and go to a RAID 10 configuration.

I really don’t want to get into the seemingly infinite details of hard drives speeds, latency. Or even begin to explain to you if I should be using an onboard RAID Controller, dedicated RAID controller or a software RAID configuration completed within the OS. There is so much information available on the web that a person gets overloaded. When it comes to Distributed ANSYS, think fast hard drives and fast RAID controllers. Start researching your hard drives and RAID controllers using the list provided below. Again, only as a suggestion! I have listed the drives in order based on a very scientific and nerdy method. If I saw a pile of hard drives, what hard drive would I reach for first?

  1. I prefer using SEAGATE SAVVIO or HITACHI enterprise class drives. (Serial Attached SCSI) SAS2 6Gbit/s 3.5”15,000 RPM spindle drives (best bang for your dollar of space, more read & write heads over a 2.5” spindle hard drive).
  2. I prefer using Micron or INTEL SSD enterprise class SSD. SATA III Solid State Drive 6 Gbit/s (SSD sizes have increased however you will need more of these for an effective solving array and they still are not cheap).
  3. I prefer using the SEAGATE SAVVIO 2.5” enterprise class spindle drives. SAS2 6Gbit/s 2.5” 15,000 RPM spindle drives (if you need a small form factor, fast and additional storage. But the 2.5” drives do not have as many read & write heads as a 3.5” drive. In a situation where I need to slam 4 or 8 drives into a tight location.
    Right now, SEAGATE SAVVIO 2.5” are the way to go!  Here is a link to a data sheet.
    Another similar option is the HITACHI ULTRASTAR 15k600.  It’s spec sheet is here.
  4. SATA II 3Gbit/s 3.5” 7,200 RPM spindle drives are also a good option.  I prefer Western Digital RE4 1TB or 2TB drives. There spec sheet is here.

LSI 2108 RAID Controller and Hard Drive data/details:

image

How a CUBE HVPC System from PADT, Inc. balanced out this configuration and how much would it cost?

I quoted out the below items, installed and out the door (including my travel expenses, etc.) at: $30,601

The company ended up going with their own preferred hardware vendor. Understandable, one good thing is that we are now on the preferred purchasing supplier list. They were greatly appreciative of my consulting time and indicated that they will request a “must have” quote for a CUBE HVPC system the next refresh in a year. They want to go over 1,000 cores the next refresh.

I recommended that they install the following into the HPC cluster based: (note they already had blazing fast hard drives)

  • 16 – Supermicro AOC-S2208L-H8iR LSI 2208 RAID controller cards.
  • 32 – Supermicro CBL-0294L-01 cabling to connect the LSI RAID cards to the SAS2 hard drives.
  • 1 – MELLANOX IS5023 18-port 4X QDR Infiniband switch
  • 16 – Supermicro AOC-UIBQ-M2 Dual port 4X QDR Infiniband card
  • 16 – Supermicro QSFP Infiniband cables in a couple different lengths

A special thanks and shout out to Sheldon Imaoka of ANSYS, Inc. for inspiring me to write this blog article!

2000 Core Milestone Passed for CUBE HVPC Systems

IMG_9548As we put the finishing touches on the latest 512 core CUBE HVPC cluster, PADT is happy to report that there are now 2,042 cores worth of High Value Performance Computing (HVPC) power out there in the form of PADT’s CUBE computer systems.  That is 2,042 Intel or AMD cores crunching away in workstations, compute servers, and mini-clusters chugging on CFD, Explicit Dynamics, and good old fashioned structural models – producing more accurate results in less time for less cost.

When PADT started selling CUBE HVPC systems it was for a very simple reason: our customers wanted to buy more compute horsepower but they could not afford it within their existing budgets. They saw the systems we were using and asked if we could build one for them.  We did. And now we have put together enough systems to get to 2,042 cores and over 9.5TB of RAM.

CUBE-HVPC-512-core-closeup3-1000h

Our Latest Cluster is Ready to Ship

We just finished testing ANSYS, FLUENT, and HFSS on our latest build, a 512 core AMD based cluster. IT is a nice system:

  • 512 2.5GHz AMD Opteron 6380 Processors: 16 cores per chip, 4 chips per node, 8 nodes
  • 2,048 GB RAM, 256GB per node, 8 nodes
  • 24 TB disk space – RAID0:  3TB per node, 8 nodes
  • 16 Port 40Gbps Infiniband Switch (so they can connect to their older cluster as well)
  • Linux

All for well under $180,000.

It was so pretty that we took some time to take some nice images of it (click to see the full size):

CUBE-HVPC-512-core-front1-1000h CUBE-HVPC-512-core-service1-1000h CUBE-HVPC-512-core-stairs-1000h

And it sounded so awesome, that we took this video so everyone can here it spooling up on an FLUENT benchmark:

If that made you smile, you are a simulation geek!

Next we are building two 64 core compute servers, for another repeat customer, with an Infiniband switch to hook up to their two existing CUBE systems. This will get them to a 256 core cluster.

We will let you know when we get to 5000 cores out there!

Are you ready to step out of the box, and step into a CUBE?  Contact us to get a quote for your next simulation workstation, compute server, or cluster.

Monster in the Closet: PADT Goes Live with 512 Core HVPC CUBE Cluster

imageThere is a closet in the back of PADT’s product development lab. It does not store empty boxes, old files, or obsolete hardware.  Within that closet is a monster.  Not the sort of monster that scares little children at night.  No, this is a monster that puts fear into the heart of those who try to paint high performance computing as a difficult and expensive task only to be undertaking by those who are in the priesthood.  It makes salespeople who earn fat commissions by selling consulting services and unnecessary add-ons quake in fear.

This closet holds PADT’s latest upgrade to our compute infrastructure: a 512 core CUBE HVPC Cluster.  No data center, no special consultants, no expensive add-ons. Just 512 cores chugging away at solving FLUENT and CFX problems, and pumping a large amount of heat up into the ceiling.

Here are the specifics:

CUBE C512 Columbia Class Cluster

  • 512 AMD 2.4GHz Cores (in 8 nodes, 4 sockets per node, 16 cores per socket)
  • 2TB RAM (256 GB per node of DDR3 1600 ECC RAM)
  • Raid Controller Card (1 per node)
  • 24TB Data Disk Space (3TB per node of SAS2 15k drives in RAID0)
  • Infiniband (8 Port switch, 40 Gbps)
  • 52 Port GIGE switch connected to 2 GIGE ports per node
  • 42 U Rack with thermal convection ducting (chimney)
  • Keyboard, monitor, mouse in drawer
  • CENTOS (switching to RedHat soon)

We built this system with CFD simulation in mind.  The original goal was to provide a proof of concept to expand our CUBE HVPC offering, showing that you can create a cluster of this size, with very good speed, for a price that small and medium sized companies can afford.  We also needed a way to run large problems for benchmarks in support of our ANSYS sales efforts and to provide faster technical support our FLUENT and CFX customers.  We already have a growing queue of benchmarks waiting to get into the machine.

The image above is the glamour shot.  Here is what it looks like in the closet:

image

Keeping with our theme of High Value Performance Computing we stuck it into this closet that was built for telephone equipment and networking equipment back at the turn of the century when Motorola had this suite.  We were able to fit a modern rack in next to an old rack that was in there. We then used the included duct to push the air up into our ceiling space and moved the A/C ducting to duct right into the front of the units.  We did need to keep the flow going into the rack instead of into the area under the networking and telephone switches, so we used an old video game poster:

image
Anyone remember Ratchet and Clank? 
Best PS2 games ever.

It works well and adds a little color to the closet.

So far our testing has shown some great numbers. Not the fastest cluster out there, but if you look at the cost, it offers incredible performance.   You could add a drive array over Infiniband, faster chips, and some redundant power. And it will run faster and more reliably, but it will cost much more.  We are cheap so we like this solution.

Oh yea, with the parts from our old CFD cluster and some new bits, we will be building a smaller mini-cluster using INTEL chips, a GPU or two, and a ton of fast disk and RAM as our FEA cluster.  Look for an update on that in a couple of months.

Interested in getting a cluster like this for you computing pleasure?  A system configured like this one will run about $150,000 (video game poster is extra). Visit our CUBE page to learn more or just shoot an email to sales@padtinc.com.  Don’t worry, we don’t sell these with sales people, someone from IT will get back with you.