GPU scaling for AI workload optimization

How to optimize your infrastructure for LLM training, fine-tuning and inferencing

Apr 26, 2024

∙ Paid

When AI Engineers and Data Scientists are building training and inference pipelines for large language models, the bottleneck of infrastructure scaling comes into play.

In this blog, we will be covering LLMs or LMMs (Large Multimodal Models) particularly as they pose the biggest challenge regarding compute requirements.

Let’s start with a primer on how cloud infrastructure scaling works, for those who might be new to it.

Vertical Scaling: This approach means you are adding additional power to the existing machine, to fit more data within a single machine (or in-memory).
Horizontal Scaling: This approach means you are adding more machines to the system, for distributed computing, where you need to run parallel workloads.

With this primer, let’s dig deeper into when and how you would be using each of these methods!

Model Training: If you are training a foundational model, a vast amount of data is used which requires you to horizontally scale your resources and run these parallel across multiple GPUs for resource optimization. If you are working with smaller compute resources, and you want to accommodate more data in memory while training a model with hundreds of billions of parameters, you will inherently need to have more memory in the same machine- which is where you will vertically scale your compute.
Model Fine-tuning: Typical model fine-tuning requires passing additional datasets through the foundational models and adjusting the model weights to fit the new dataset. In this scenario, techniques like chain-of-thought-promoting, or Monte Carlo tree search can help for sequentially breaking down the prompts into tasks. For these tasks, if you are using a relatively smaller dataset, typically a high-performance GPU with be able to hold this workload, you don’t need horizontal scaling.
Parameter Efficient Fine-Tuning (PEFT): While the term mentions “fine-tuning”, PEFT isn’t quite fine-tuning, it is rather similar to transfer learning. Low-rank adaptation (LoRa) is the most used method for PEFT. In this method, you freeze the foundational model weights and add a small number of additional parameters on top of the model that gets tuned based on the additional dataset. With this method, you don’t need a lot of compute.
Model Inferencing: When you are scoring/ running inference on the models, vertical scaling might be adequate if the throughput requirements are within the capabilities of a single machine, especially in applications with lower request volumes. In high-demand scenarios where many inference requests must be processed simultaneously, horizontal scaling ensures that the load is distributed, improving response times and system resilience.

Keep reading with a 7-day free trial

Subscribe to AI with Aish to keep reading this post and get 7 days of free access to the full post archives.