06 Sep 2022

The Best Bang for Your Buck Hardware for Deep Learning

Though graphical processors were originally meant for gamers, any computer science enthousiast knows that they are extremely valuable in other domains as well. It seems that after a long period of supply chain issues and shortages, the market seems to stabilize. The deep learning community can take a breath, relax, buy a new batch GPUs and run some training sessions.

NVIDIA knows that gamers are not the sole target demographic for their products anymore. In September of 2018, they released the NVIDIA Tesla T4: a server-grade inference card for deep learning. The Tesla V100, meant for training, is part of their deep learning line-up as well. These cards are fitted with so-called Tensor Cores for neural network performance. The same Tensor Cores are also present in the latest generations of consumer cards like the RTX 2060, 2070, 2080 and 2080 Ti and the “SUPER” cards, as well as the 3000-series. This shows that NVIDIA is serious about deep learning, even in consumer cards. In this blog post, we'll guide you through the most important aspects of buying a GPU for machine learning purposes.

“Server-grade cards”

Now, when looking for a “bang for your buck” graphics card, you need to take note of a couple of pitfalls. First of all, there is a profound difference between the pricing of consumer cards and server-grade cards that NVIDIA sells. For example, take the Tesla V100: it is a server-grade card based on the Volta GV100 architecture. Similarly, there is a consumer-grade card, the Titan V, based on the same architecture with nearly identical specifications. Both boast 5120 CUDA cores, a TDP of 250 Watts, and around 15 TFLOPS of single-precision floating point performance. The V100 does have more memory: it has 16GB of HBM2 memory running at a slightly higher clock speed compared to the 12GB memory capacity of the Titan V. The main difference lies in the price: the Titan V sets you back around $3000, the Tesla V100 around $10,000. The server-grade cards are designed specifically to fit into server racks, and to run continuously for longer periods of time. The EULA that goes with the required drivers for the consumer cards explicitly forbids the use of consumer cards in data centers. This is why AWS, Azure and Google Cloud do not offer Titan Vs. Either way, if you are a researcher or deep learning hobbyist, you do not need to buy server-grade cards. Unless you are planning on deploying a datacenter with graphics cards, a consumer GPU will have identical performance for a lower price.

Understanding NVIDIA Performance Figures

Second, take the TFLOPS-rating NVIDIA boasts on their website with a grain of salt. We will be looking at the Tesla V100 and T4, as these are the cards that NVIDIA mainly markets for deep learning. Basically, there are two numbers that NVIDIA keeps using, and both have their fair share of asterisks:

Deep learning performance: For the Tesla V100, it is measured to be 125 TFLOPS, compared to the 15 TFLOPS single-precision performance. That is an insane number, how do they get to this? Well, it’s based on NVIDIA’s “mixed precision performance”. Basically, using some mathematical trickery, NVIDIA managed to combine both the advantages of FP32 as well as FP16 training: fast results and accurate convergence. The 640 Tensor Cores introduced in the Tesla V100 are specifically built to accelerate half-precision training which allows them to reach these insane performance results. To turn it on, you need the “NVIDIA NGC TensorFlow 19.03 container” and run it inside a docker instance. Then, enable an environment variable and there you go! To quote NVIDIA: “We are also working closely with the TensorFlow team at Google to merge this feature directly into the TensorFlow framework core.” (Other libraries like PyTorch and MXNet also have support.) Hidden in the documentation, you will find that mixed precision training is only supported for a small subset of operations. So your model might not benefit from this feature. The 125 TFLOPS figure can be achieved under ideal circumstances, but it is hardly a given. If you manage to get mixed-precision training to work and have a simple model that relies heavily on 2D convolutional layers, you might see a large speed-up. In any other case, you will not.

On the inference-side of things, something similar is happening. NVIDIA boasts with very high performance figures for the T4 inference GPU, but they are all based on FP16 performance or quantized performance. Optimization by using half-precision (FP16) could work, but only if your model consists of operations that meet this list of requirements. When using different methods for quantization, you are unlikely to reach the advertized performance numbers. Our advice is as follows: Always look for the FP32 (single-precision) performance. This is the default for TensorFlow and, at this point in time, supports all of TensorFlows built-in ops.

All newer NVIDIA cards contain Tensor Cores. Tensor Cores are like CUDA cores, but optimized for deep learning operations. They provide a very significant speedup for models that are compatible with Tensor Cores. Whether or not some or all of the operations of your model will run on Tensor Cores depends on a large number of factors including: the precision of the operation, the operation itself, the convolution kernel size, or the channel-order. In our experience, the situation is way better now than it used to be in 2020. Especially the last two generations of NVIDIA cards have broad support for Tensor Cores and we have seen very pronounced performance increases (up to 10x), even in models that are not simple 2D ConvNets.

Are there alternatives?

Unfortunately, the alternatives right now are scarce. Google is starting to release some of their TPU-based products. Though the theoretical bang for buck could be up to 5 times better, in practice, support is not great and you are easily tied up in Google infrastructure. TPUs are not ready for prime time yet, but they are promising. Other FPGA chips are starting to hit the market (Intel is doing some work here, other lesser known companies as well, Google it!). All of them have questionable performance metrics. Again, there are a lot of INT8 and FP16 performance metrics. We even saw some companies comparing performance between INT8 models on their hardware and FP32 models on Tesla GPUs - don’t fall for it! Lastly, most of these chips are not supported by TensorFlow or PyTorch out of the box. Intel has their own framework (like TensorRT) called OpenVINO. It requires your model to be converted and does not have support for all the native TensorFlow and PyTorch operations, of course. Same goes for AMD’s ROCm + TensorFlow toolchain. We have yet to find a platform that is fully compatible with everything TensorFlow and PyTorch already have to offer. There are simply too many caveats at this point. You will likely end up working for days until you finally figure out that your specific model is either not compatible with hardware or - even worse - is compatible but does not run as fast as it would on a cheaper NVIDIA GPU. On the other hand, if you only work with off-the-shelf, well known models like ResNet, Inception or something of the likes, you might be in luck!

What to buy then?

Now that we know not to buy server-grade cards (unless you’re Google) and understand how to navigate the NVIDIA product line, we can make some purchase recommendations for the consumer GPUs. The most expensive cards in NVIDIAs line-up are rarely the best in terms of “bang for your buck”. You are better off getting a second-hand top-of-the-line gaming card from the previous generation unless you have special memory requirements. We are not just making this up! We compiled a somewhat complete table of graphics cards and have considered multiple performance metrics.

We used estimates of the actual market price to calculate "bang for buck" by looking at the sales prices of major retailers in Europe. The market value of a GPU changes all the time, so some of them might be out of date. For cards that are not sold by vendors anymore, we listed the approximate second-hand price. Note, in previous versions of this blog post we did not use the Tensor Core metric in the performance score. Since Tensor Core compatibility has improved drastically since the last time we updated this post, we now believe that Tensor Cores will have a large impact on actual performance for many users, so it is now included in the final calculation.

Visit our blog post for the full list of NIVIDA GPUs: https://oddity.ai/blog/best-bang-for-buck-gpu/

The table shows that the best bang for buck GPUs at this point in time are the RTX 2070 and RTX 2070 SUPER. You should be able to purchase them for around €300 and €350 respectively second-hand (in Europe). It's a good deal, since the card has a bunch of Tensor Cores and should run optimized workloads really well. It is an older card though, and the memory size is a bit prohibitive. For anyone that has some more to spend, and wants their setup to be ready for the future, we recommend getting an RTX 3080 or RTX 3080 Ti. Both cards are the best bang for buck in their price range, and use the latest Ampere architecture. Also, they are compatible with the most recent Tensor Core optimizations, so for certain models they might provide a very large speedup. Lastly, the larger memory size can be benefitial for large models.

If you are going to use a lot of off-the-shelf models such as Inception, ResNet, or common YOLO models, you might be best off getting a second-hand 2070 SUPER. If you are training a large model, a custom model, or a model that uses non-standard techniques (such as 3D convolutions), a 3080 Ti is probably your best bet. It is a card that will last you a while, and the recent improvements in Tensor Core compatibility can have a large impact on performance.

Note that at the time of writing, the new Lovelace GPUs are right around the corner. The specs are not in yet, so there's no telling if they will make for a good deal. The releases will impact second-hand prices of previous generations, so make sure to double-check the second-hand prices for some of the 3000-series cards if the Lovelace cards are out.

CO-LOCATED WITH

WHEN

WHERE

ORGANISED BY

Website Search

Wishlist

Exhibitor News