To GPU or Not, This is No Longer The Question…

imageExecutive Summary

Without losing you quickly due to fabulous marketing and product information. I thought it prudent if I get to the answer of my question as quickly as possible. What was the question? Should I invest in a companion processor and an ANSYS HPC Pack? Yes, now is the time.

  • If your current workstation is beginning to show its age and you are unable to purchase new hardware.
  • Get your critical results fast. It is painful waiting upwards of 50 hours for your solution to solve.
  • Assign a dollar value to your current frustration level.
  • Current pricing for Nvidia TESLA C2075 is around $2500. Current pricing for the NVidia Quadro 6000 is around $5,000.
 

“Keeping It Real”

Matt Sutton our Lead Software Development Engineer at PADT, Inc. was solving a very large ANSYS MAPDL acoustic ultrasound wave propagation model. The model labored over 12 hours to solve on Matt’s older Dell Precision 690 8 core workstation. We loaded the model onto our CUBE HVPC w8i with the NVidia Tesla C2075 and it solved the model in 50 minutes. That is a 15x speedup for Matt’s particular model!

 

Benchmarks

I started off with our benchmark assault using an industry standard ANSYS benchmark for HPC. This is a benchmark that NVidia requested we use for testing our TESLA C2075. The next benchmark that I used was an internal PADT, Inc. modal coupler benchmark.

Nvidia V14sp-5 ANSYS R14 Benchmark Matrix (Sparse solver, 2100k)

 

CUBE HVPC – w12a-GPU – 64GB

 

CONFIG A

               

64GB

#

Cores

Config A: Win v14 SMP In-Core

Config A: Win v14 SMP In-Core & GPU

Config A: Win v14 DIST In-Core

Config A: Win v14 DIST & GPU

Config A: Linux v14 SMP In-Core

Config A: Linux v14 SMP In-Core & GPU

Config A: Linux v14 DIST In-Core

Config A: Linux v14 DIST & GPU

   

speedup

  2 1388.20 463.40 1425.50 294.40 1324.40 466.30 1312.30 292.10      
  4 870.00 447.20 855.00 232.40 817.00 422.00 975.40 221.80      
  6 670.60 440.20 588.00 245.10 625.50 394.00 601.60 208.10      
  8 594.30 439.10 605.00 222.00 480.10 383.90 542.30 181.10      
  10 539.50 426.90 435.00 283.40 491.30 392.80 396.90 220.80      
  12 538.40 437.10 402.20 211.10 480.10 394.80 343.80 183.10     2x
                         
CUBE HVPC – w8i-GPU – 64GB CONFIG B                
64GB # Cores Config B: Win v14 SMP In-Core Config B: Win v14 SMP In-Core & GPU Config B: Win v14 DIST In-Core Config B: Win v14 DIST & GPU Config B: Linux v14 SMP In-Core Config B: Linux v14 SMP In-Core & GPU Config B: Linux v14 DIST In-Core Config B: Linux v14 DIST & GPU      
  2 1078.50 361.00 1111.90 235.70 1036.00 350.20 1072.90 250.60      
  4 645.50 330.70 652.30 185.50 608.10 312.00 790.90 193.20      
  6 494.30 322.70 458.30 233.70 464.90 303.20 502.70 178.70      
  8 438.50 328.30 462.20 230.40 406.10 304.80 451.20 166.00     2.5x
                         
CUBE HVPC – w16i-GPU – 128GB

 

                 
128 GB # Cores Config C: Linux v14 SMP In-Core & GPU Config C: Linux v14 DIST & GPU                  
  2 296.6 208.10                  
  4 254.4 160.70                  
  6 254.2 164.20                  
  8 239.9 138.20                  
  10 238.6 159.70                  
  12 246.3 129.60                  
  14 237.6 129.1                  
  16 248.9 130.5                  
image

 

CUBE HVPC – PADT, Inc. – Coupling Modal Benchmark (PCG solver, ~1 Million DOF, 50 modes)

 

CUBE HVPC w12i w/GPU

ANSYS 13.0

Shared Memory Parallel INTEL XEON 2×6 @3.47GHz /144GB of RAM w12i-GPU
  Processors Time Spent Computing Solution (secs) Date/Individuals Initials:
  2 5416.4 11/7/2011 – DRJM
  10+GPU 1914.2 (incore) 11/8/2011 – DRJM
  12+GPU 1946 (incore) 11/8/2011 – DRJM

CUBE HVPC w8i w/GPU

ANSYS R14

Shared Memory Parallel INTEL XEON 2×4 @2.8GHz /64GB of RAM w8i-GPU
  6+GPU 3659.4 (out of core) 4/11/12 – DRJM
  8+GPU 3686.7 (out of core) 4/11/12 – DRJM

CUBE HVPC w16i w/GPU

ANSYS R14

Shared Memory Parallel INTEL XEON e5-2690 2×8 @2.9GHz /128GB of RAM W16i-GPU
  14+GPU 2113 (incore) 4/18/12 – DRJM
  16+GPU 1533.9 (incore) 4/18/12 – DRJM

 

image

Summary, Debates, Controversy & Conclusions

One question that I hear often is this: “What operating systems is faster. Linux or Windows?”. My typical response will begin with “Well, it depends…” However, what the data illustrates with the two independent CPU workstations as well as Operating Systems. The ANSYS benchmarks were all performed in the same relative time frame. The only exception was the ANSYS modal analysis benchmark.

Are you ready for the answer? Here it is…yes Linux is faster than windows! Even if you come from the AMD CPU or INTEL CPU side of the tracks. Linux is faster! No big discovery right? I know we all knew this already but we were maybe afraid to ask by Operating system as well as CPU manufacture how much? Well with our 2.1 million degree of freedom Nvidia benchmark. Ummm, not very much as the data clearly indicates. However once again, the Linux based OS does give you a system performance advantage! If you use AMD or INTEL it is still faster. The next question that I hear next is: “What processor is faster AMD or INTEL?”. My typical response will begin with “Well, it depends…”

You did buy the NVidia TESLA C2075 GPU, okay this is great news! However, you figured it best that you keep it safe, stay the same, and continue to use Shared Memory Parallel solve method. As you ponder further on the speed up values. You will see once again that the best results are going to be found on the Linux Operating System and when you choose to solve in Distributed Memory Parallel mode. The AMD based CPU had the best speedup values when comparing the workstation against the two solve modes with a 2.2x’s speed up.

Lets begin to unpack this data even more and see if you can come up with your own judgments. This is where things might start to get controversial…so hang on.

Windows, Linux, AMD or INTEL

Time Spent Computing The Solution

  • ANSYS R14 Distributed Memory Parallel Results: (Incore)
    • LINUX 64-bit:
      • One minute forty-seven seconds faster on 2x AMD CPU (343.80 seconds vs. 451.20 seconds)
    • Windows 7 Professional 64-bit:
      • One minute faster on 2x AMD CPU (402.20 seconds vs. 462.20 seconds).
  • ANSYS R14 Shared Memory Parallel Results: (Incore)
    • LINUX 64-bit
      • One minute fourteen seconds faster on 2x INTEL XEON CPU (406.10 seconds vs. 480.10 seconds)
    • Windows 7 Professional 64-bit is sixty seconds faster solve time on AMD based CPU
      • One minute forty seconds faster on 2x INTEL XEON CPU (438.50 vs. 538.40 seconds)

The Nvidia TESLA C2075 GPU and ANSYS R14 – GPU to the rescue!

  • ANSYS R14 Distributed Memory Parallel with Nvidia TESLA C2075 GPU Assist Results
    • LINUX 64-bit:
      • Seventeen seconds faster on 2x INTEL XEON CPU (166 seconds vs. 183.10 seconds)
      • 129.1 seconds was the fastest overall solve time achieved for any operating system on this benchmark!
      • 2 x INTEL XEON E5-2690 – fourteen cores with ANSYS R14 DMP w/GPU assist
    • Windows 7 Professional 64-bit:
      • Nineteen seconds faster on 2x AMD CPU (211.10 seconds vs. 230.40 seconds).
      • 211 seconds was the fastest solve time achieved using Windows for this benchmark!
      • 2 x INTEL XEON X5560 – eight cores with ANSYS R14 DMP w/GPU assist
  • ANSYS R14 Shared Memory Parallel with Nvidia TESLA C2075 GPU Assist Results
    • LINUX 64-bit:
      • Seventeen seconds faster on 2x INTEL XEON CPU (166 seconds vs. 183.10 seconds)
      • Config C: 237.6 seconds was the fastest overall solve time achieved for any operating system on this benchmark!
      • 2 x INTEL XEON E5-2690 – fourteen cores with ANSYS R14 SMP w/GPU assist
    • Windows 7 Professional 64-bit:
      • Nineteen seconds faster on 2x AMD CPU (211.10 seconds vs. 230.40 seconds).
      • 211 seconds was the fastest solve time achieved using Windows for this benchmark!
      • 2 x INTEL XEON X5560 – eight cores with ANSYS R14 SMP w/GPU assist

ANSYS R14 Distributed Memory Parallel with GPU vs. ANSYS R14 Shared Memory Parallel with GPU speed up results:

Distributed Memory Parallel (DMP) or Shared Memory Parallel (SMP)

  • 2 x AMD Opteron 4184 based workstation vs. “DMP or SMP”
    • 2.2x’s speedup on Linux Operating System (183.10 seconds vs. 394.80 seconds)
    • 2x’s speedup on Windows Operating System (211.10 seconds vs. 437.10 seconds)
  • 2 x INTEL XEON E5-2690 based CPU vs. “DMP or SMP”
    • 1.9x’s speedup on Linux Operating System (129.41 seconds vs. 237.6 seconds)

With or Without GPU? ANSYS R14 Distributed Memory Parallel speed up results

Distributed Memory Parallel (DMP) with GPU vs. Distributed Memory Parallel without GPU

  • AMD Opteron 4184 based workstation
    • 2x’s speedup Linux Operating System (183.10 seconds vs. 343.80 seconds)
    • 2x’s speedup on Windows Operating System (211.10 seconds vs. 402.2 seconds)
  • INTEL XEON x5560 based workstation
    • 2.5x’s speedup on Linux Operating System (166 seconds vs. 451.20 seconds)
    • 1.4x’s speedup on Windows Operating System (230.40 seconds vs. 462.20 seconds)

Some Notes:

  • ANSYS Base License – unlocks up to 2 CPU Cores
  • ANSYS HPC Pack – unlocks up to 8 CPU Cores and GPU
  • The total amount of system RAM you have affects your Distributed solve times. A minimum of 48GB of RAM is recommended.
  • The processing speed of your CPU affects your Shared Memory Parallel solve times.
  • Model limits for direct will depend on largest front sizes
    • 6M DOF for 6GB Tesla C2075 and Quadro 6000
  • Model limits for iterative PCG and JCG
    • 3M DOF for 6GB Tesla C2075 and Quadro 6000

 

image

Hardware Specs of Workstations:

CONFIG A – CUBE HVPC w12a-GPU

The AMD® based CUBE HVPC w12a FEA Simulation Workstation:

  • CPU: 2x AMD Opteron 4184 (2.8GHz Ghz 6 core)
  • GPU: NVIDIA TESLA C2075 Companion Processor
  • RAM: 64GB DDR3 1333Mhz ECC
  • HDD: (os and apps): 450GB WD Velociraptor 10k
  • HDD: (working directory): 6 x 1TB WD RE4 drives using LSI 2008 RAID 0
  • OS: Windows 7 Professional 64-bit, Linux 64-bit
  • OTHER: ANSYS R14, latest NVIDIA TESLA Drivers

(Here is a picture of Sam with two full
tower CUBE HVPC workstations getting
ready to ship out!)

CONFIG B – CUBE HVPC w8i-GPU

The INTEL® based CUBE HVPC w8i FEA Simulation Workstation:

  • CPU: 2x INTEL XEON x5560 (2.8GHz 4 core)
  • GPU: NVIDIA TESLA C2075 Companion Processor
  • RAM: 64GB DDR3 1333Mhz ECC
  • HDD: (os and apps): 146GB SAS 15k
  • HDD: (working directory): 3 x 73GB SAS 15k – LSI RAID 0
  • OS: Windows 7 Professional 64-bit, Linux 64-bit
  • OTHER: ANSYS R14, latest NVIDIA TESLA Drivers

CONFIG C – CUBE HVPC c16i-GPU

The INTEL® based CUBE HVPC w16i-GPU FEA Server:

  • CPU: 2x INTEL XEON e5-2690 (2.9GHz 8 core)
  • GRAPHICS/GPU: NVIDIA QUADRO 6000
  • RAM: 128GB DDR3 1600 MHz ECC
  • HDD: (os and apps): 300GB SATA III 10k WD Velociraptor
  • HDD: (working directory): 3 x 600GB SATA III 10k WD
  • OS: Linux 64-bit
  • OTHER: ANSYS R14, latest NVIDIA TESLA Drivers

CUBE HVPC w12i-GPU

The INTEL® based CUBE HVPC w12i-GPU FEA Simulation Workstation: image

  • CPU: 2x INTEL XEON x5690 (3.47GHz 6 core)
  • GRAPHICS/GPU: NVIDIA QUADRO 6000
  • RAM: 144GB DDR3 1333Mhz ECC
  • HDD: (os and apps): 256GB SSD SATA III
  • HDD: (working directory): 4 x 600GB SAS2 15k – LSI RAID 0
  • OS: Windows 7 Professional 64-bit
  • OTHER: ANSYS R13

References