NVIDIA DIGITS with Caffe – Performance on Pascal multi-GPU

Machine Learning and AI is in my opinion the most exciting area of computing right now. The combination of plentiful large data sets and extraordinary compute capability made possible with GPU acceleration has facilitated an explosion in machine inference through data analysis that was computationally intractable just a few years ago.

We put a couple of systems under (heavy) load doing image classification with NVIDIA DIGITS using Caffe to see how the GTX 1070 and Titan X Pascal perform in single and multi-GPU configurations. The overall performance was great! The multi-GPU scaling was not as good as hoped but it was still a significant benefit for jobs that may run for 10’s of hours!

We used two nice base system configurations for this testing and liked them enough to

The testing done in this post motivated the creation of recommended systems for DIGITS/Caffe.

DIGITS

NVIDIA DIGITS — Deep Learning GPU Training System. This includes NVIDIA’s optimized version of Berkeley Vision and Learning Center’s Caffe deep learning framework and experimental support for the Torch Lua framework. It bundles NVIDIA tools for deep learning including cuDNN, cuBLAS, cuSPARCE, NCCL, and of course the CUDA tool kit. There are tools for manipulating data sets and neural network configuration files for LeNet-5, AlexNet and LeGoogleNet included. The web interface provides graphical tools for setting up training and test datasets, selection and configuration of training models, job control with visualization and tools for testing trained models. NVIDIA is actively developing this software stack and it has become a very usable and convenient interface.

Hardware Configurations

In general …

When someone asks me “what hardware do you recommend for doing xxxx?” the answer is almost always “it depends”. ( I’m thinking about cases where xxxx has GPU compute support )

It depends on …

  • What software are you using specifically, including libraries, versions, etc..?
  • Does it have good GPU acceleration and multi-GPU support?
  • Is it performance bound by CPU, GPU, memory, disk I/O, …?
  • Are your jobs large/complex enough to take advantage of high performance hardware?
  • Is time to solution more important than budget?!

There are some hardware recommendation generalizations you can make. For software and problems that will spend most of their run-time on the GPU and that have a fair amount of data movement from CPU/disk space to GPU memory the following are generally good recommendations;

  • Use a single CPU
    • Why? CPU’s have memory and PCIe lanes “attached” to them. More than 1 CPU socket means you can have a situation where a GPU needs data that is associated with the system memory on a CPU that is not controlling it’s PCIe lanes. Think about the path of that memory access … it can really slow things down for some applications!
  • Have twice as many (or more) CPU cores as you do GPU’s
    • Why? Multi-GPU jobs will usually have a “support” CPU-core for each GPU and you may have extra CPU load like other I/O operations and compute going on a the same time.
  • Have twice as much system memory as you do total GPU memory
    • Why? It’s good to have enough memory to allow for full GPU memory pinning back to CPU space. And, extra memory will get used for I/O buffer cache.
  • Have your GPU’s on full X16 PCIe slots
    • Why? That’s probably obvious. GPU are so fast these days that I/O often becomes a bottleneck. X16 is better than X8!
  • Use SSD’s for storage
    • Why? Again, hopefully that is obvious. You can get affordable > 1TB SSD’s these days. You don’t want to be held up by platter drive latency and bandwidth!

You can argue about the above generalizations. In some particular use cases some of those suggestions wont matter that much, but I feel it is good advise overall. Running Caffe with DIGITS definitely fits a use case for the generalizations above. Lets get on to some testing!

Test Systems

We did our DIGITS testing on two base systems; the Peak Mini (Compact GPU Workstation) and Peak Single (DIGITS GPU Workstation) configured similar to our recommended systems for DIGITS/Caffe.

The Peak Mini (Compact GPU Workstation)
CPU: Intel Xeon E5 1650v4 6-core @ 3.6Hz (3.8GHz All-Core-Turbo)
Memory: 128 GB DDR4 2133GHz Reg ECC
PCIe: (2) X16-X16 v3
Motherboard: ASUS X99-M WS
The Peak Single (“DIGITS” GPU Workstation
CPU: Intel Core i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
Memory: 128 GB DDR4 2133MHz Reg ECC
PCIe: (4) X16-X16 v3
Motherboard: ASUS X99-E-10G WS

Note: We normally recommend Xeon processors in our Peak line of systems since they do have somewhat better reliability. (I just happened to have an i7 6900K on my desk when I setup the Peak Single for testing and, hey!, it’s a great processor and works on that board:-)

GPU?s

Video cards used for testing.

Card CUDA cores GPU clock MHz Memory clock MHz* Application clock MHz** FB Memory MiB
GTX 1070 1920 1506 4004 1506 8110
TITAN X Pascal 3584 1911 5005 1417 12186

Caveat:

Heavy compute on GeForce cards can shorten their lifetime! I believe it is perfectly fine to use these cards but keep in mind that you may fry one now and then!

Software

The OS install was basically a default Ubuntu 14.04.5 install followed by installing CUDA 8 and the DIGITS4 stack (similar to the description here). CUDA 8 includes a recent enough display driver for all of the NVIDIA cards except the most recent 1060 and 1050 cards. For those you will probably want to use the “375” driver from the “graphics drivers ppa”. If you use Ubuntu 14.04 and the deb repo files the install will be fairly painless. If you want to use Ubuntu 16.04 you will need to do a manual install and compile some code. It’s not bad and I will do a blog post on how to do that before too long (I hope :-).

Following is a list of the software used in the testing.

Test job image dataset

I wanted to do an image classification problem with DIGITS that was large enough to stress the systems in a way more like a “real world” workload. I used the training image set from
IMAGENET Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)
I only used the the training set images from the “challenge”. All 138GB of them! I used the tools in DIGITS to partition this set into a training set and validation set and then used the GoogLeNet 22-layer network.

  • Training set — 960893 images
  • Validation set — 320274 images
  • Model — GoogLeNet
  • Duration — 30 Epochs

Many of the images in the IMAGENET collection are copyrighted. This means that usage and distribution is somewhat restricted. One of the things listed in the conditions for download is this,
“You will NOT distribute the above URL(s)”
So, I wont. Please see the IMAGENET site for information on obtaining datasets.

Citation
    Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575, 2014.
    paper |
    bibtex

Results

DIGITS with Caffe is all about the GPU’s! Performance was determined by the capability of the GPU and how many cards were being used. Both system platforms, Peak Mini and Peak Single, provided adequate support for the GPU’s and gave essentially the same run times for the same GPU configurations.

The following table lists model training times for 30 epochs with from 1 to 4 GTX 1070’s and Titan X Pascal cards.

GoogLeNet model training with Caffe on 1.3 million image dataset for 30 epochs using 1-4 GTX1070 and TitanX video cards

GPU’s Model training runtime
(1) GTX 1070 32hr
(2) GTX 1070 19hr 32min
(4) GTX 1070 12hr 43min
(1) Titan X 19hr 34min
(2) Titan X 13hr 21min
(4) Titan X 8hr 1min
Notes:
The 1 and 2 GTX 1070 job runs were done with an image batch size of 64 all others used an image batch size of 128
GPU memory usage was 60-98% depending on batch size, number of GPU’s used and GPU memory size
CPU utilization ranged from 2-8 cores depending on the number of GPU’s and fluctuated over the course of the job run
System memory usage varied from 32-120GB depending on the number of GPU’s and CPU cores in use.

Note on GPU performance scaling

Multi-GPU scaling was somewhat disappointing. Going from 1 to 4 GPU’s gives a little better than twice the performance. This is still a welcome improvement in job run-time since these job ran for a considerable length of time. However, be cautioned that GoogLeNet is a rather demanding model to optimize — Using the AlexNet model did not benefit from multiple GPU’s This may be because the GPU’s are so fast that unless there is a substantial amount of work to be done per iteration the job would be CPU or I/O bound. Also, see the note below from the Caffe documentation on GitHub.

“Performance is heavily dependent on the PCIe topology of the system, the configuration of the neural network you are training, and the speed of each of the layers. Systems like the DIGITS DevBox have an optimized PCIe topology (X99-E WS chipset). In general, scaling on 2 GPUs tends to be ~1.8X on average for networks like AlexNet, CaffeNet, VGG, GoogleNet. 4 GPUs begins to have falloff in scaling. Generally with “weak scaling” where the batchsize increases with the number of GPUs you will see 3.5x scaling or so. With “strong scaling”, the system can become communication bound, especially with layer performance optimizations like those in cuDNNv3, and you will likely see closer to mid 2.x scaling in performance. Networks that have heavy computation compared to the number of parameters tend to have the best scaling performance.”

The quote above is based on performance on observations using Maxwell based cards, GTX 970 for example. The newer Pascal based cards like the GTX 1070 and Titan X Pascal are twice as fast as the Maxwell cards so multi-GPU scaling is potentially even worse than suggested above.

Conclusions and recommendations

Overall I was very impressed with the performance of the new Pascal GPU’s. They are twice the performance of the last generation Maxwell cards for compute. The Titan X Pascal and GTX 1070 are both great cards for GPU accelerated computing. The Titan X performs around twice that of the GTX 1070 but at about 3 times the cost. An important consideration when deciding on a hardware configuration for this work is to realize that multi-GPU scaling may not be very good unless your model is challenging enough to keep the GPU busy between I/O operations.

For hardware recommendations I would say sticking with the generalizations I listed earlier is good advise. You probably wont be using your system for just one task so it’s not good to compromise on any area of your hardware. However, you don’t need to max everything out either! Balance is important. For workloads like training convolution neural network with Caffe you want to focus on the GPU since that is where the majority of you performance will come from. The CPU, memory and storage are in support of the GPU. The basic guide of 2 CPU cores per GPU and at least twice as much system memory as GPU memory is good. And, I don’t recommend platter based hard drives for anything other than storage. You should be running your OS and staging your datasets on SSD’s. It’s also good to give your GPU’s full X16 PCIe lanes (but I haven’t tested the effect of that yet!) If you can’t afford a new system to work with then I recommend just updating your video card to something like a GTX 1070 or better and then dive in and give it try!

This is a great time to be working on machine learning and AI. All of the ingredients are in place, good theoretical base, lots of data, great computer hardware (GPU acceleration is a game changer!) … lots of problems to solve!

Happy computing –dbk