How Microsoft's 1-bit LLM is going to change the LLM landscape?

Comprehensive guide to 1-bit LLM and how it is going to evolve in the future

Apr 07, 2024

TL;DR - 1-bit LLMs can:

Drastically reduce the required GPU memory, thus enabling the model to run on less powerful, more affordable hardware.
Potentially cut down the cost of inference, considering less energy consumption and the ability to utilize less expensive hardware setups.
Given the reduced memory and computational overhead, opens the door to deploying more powerful LLMs on edge devices.

While Large Language Models (LLMs) are being adopted swiftly for building applications, efficiency and cost-effectiveness have become paramount because of the GPU cost and shortage.

Recent developments have birthed the concept of 1-bit LLMs, specifically the BitNet b1.58, marking a new frontier in model efficiency and performance. But what exactly is a 1-bit LLM? The term "1-bit" is somewhat of a misnomer when describing the BitNet b1.58 model. Both these terms can be used interchangeably.

With this new approach, researchers are suggesting instead of FP16 (Full Precision floating-point number with 5-bits) or FP32 (Full Precision floating-point number with 6-bits), you can build an equally efficient model using ternary digit set ∈ {-1, 0, 1}.

This necessitates log2(3) ≈ 1.58log2(3)≈1.58 bits per parameter. The BitNet b1.58 model epitomizes this concept, transitioning from the standard 16-bit floating-point weights (FP16) used in conventional LLMs to a more memory-efficient ternary format.

Figure 1: 1-bit LLMs (e.g., BitNet b1.58) provide a Pareto solution to reduce inference cost (latency, throughput, and energy) of LLMs while maintaining model performance. The new computation paradigm of BitNet b1.58 calls for actions to design new hardware optimized for 1-bit LLMs. (Source)

Traditional LLMs employ floating-point numbers with 16-bit precision (FP16) for their weights, a necessity for capturing the nuanced variations in data. Each FP16 weight consumes 16 bits of memory, aggregating to massive total sizes for models often boasting billions of parameters. While beneficial for model accuracy, this precision demands significant computational resources and memory, influencing both the economic and environmental costs of deploying such models.

The research on BitNet b1.58 reminds me of regularization techniques like lasso and ridge regularization in foundational machine learning principles, and dropout in neural networks. Regularization in this context isn’t just about preventing overfitting; it's also about optimizing the model's memory footprint and computational load. BitNet b1.58 introduces a method of regularization that significantly reduces the precision of model weights without sacrificing performance.

Figure 1 above shows a matrix operation of FP16 versus the ternary operator. Are you wondering why ternary over binary? It’s quite simple, similar to binary the {0,1} saves a lot of time in computation, but with an added {-1} we still have the same computation time, but get an added state to represent weights of the LLM.

So, how is this going to change the adoption?

With BitNet b1.58 high-performance AI models can be run more feasibly in resource-constrained environments, such as edge devices, enhancing AI's reach and applicability.

Faster computation also means higher throughput and better user experience. The paper also compares the memory consumption, batch size, and throughput of BitNet versus the Llama model.

Figure 2: Decoding latency (Left) and memory consumption (Right) of BitNet b1.58 varying the model size. (Source)

Table 1: Comparison of the throughput between BitNet b1.58 70B and LLaMA LLM 70B (Source)

However, the path to widespread adoption of 1.58-bit LLMs isn't without challenges. The trade-off between model complexity and the granularity of weight representation might affect tasks requiring extreme precision or nuanced understanding.

Running LLMs on edge devices

Typical LLMs like GPT-3.5, Mistral 7B, or LLaMA 13B leverage FP16, consuming 16 bits per weight. The transition to 1.58-bit representation heralds significant reductions in memory footprint. For context, if you are using GPT-3.5 (175B parameters), it uses approximately 350GB in FP16 versus around 35GB in 1.58-bit. Running these models could incur substantial costs, especially in energy and hardware requirements.

Mixture-of-Experts (MoE) architecture

Mixture-of-Experts (MoE) architecture is designed to scale models efficiently by distributing different parts of the computation across many experts, each specializing in a specific aspect of the data. This architecture allows for greater model capacity without a proportional increase in computation for every input, as only a subset of experts is active per input. MoE enhances model performance, especially in tasks requiring vast knowledge and nuanced understanding, enabling more personalized and accurate responses. While it significantly reduces the computation FLOPs, the high memory consumption, and inter-chip communication overhead limit its deployment and application.

When you think about using 1-bit LLMs with the MoE architecture, it proves to be more efficient. The reduced memory footprint reduces the number of devices required to deploy MoE models. It significantly reduces the overhead of transferring activations across networks, and the entire models could be placed on a single chip.

What does the future of LLM hardware look like?

The researchers of the 1-bit LLM paper highlight that LPUs built by Groq (a LLM Inferencing Hardware startup), and say that they are headed in the right directing in building optimized hardware, and believe that there is an opportunity for AI Hardware Engineers to work on similar optimized hardware to run 1-bit LLMs.

Let’s see what Groq’s LPUs are:

An LPU Inference Engine, with LPU standing for Language Processing Unit™, is a new type of end-to-end processing unit system that provides the fastest inference for computationally intensive applications with a sequential component to them, such as AI language applications (LLMs).

The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth. An LPU has greater compute capacity than a GPU and CPU in regards to LLMs. This reduces the amount of time per word calculated, allowing sequences of text to be generated much faster. Additionally, eliminating external memory bottlenecks enables the LPU Inference Engine to deliver orders of magnitude better performance on LLMs compared to GPUs.

Remember, that LPUs are currently only available for inferencing, not training. Also, Groq mentioned that they are not going to be selling their hardware, but rather be making it available as Infrastructure-as-a-Service (IaaS).

This also brings me to an exciting announcement (murmurs) by Microsoft and OpenAI to invest $100 billion to build their optimized hardware to run the Large Multimodal Models (LMMs) workload - Project Stargate. (This is not an official Microsoft/OpenAI announcement)

💡So where are we headed - my prediction is that in the next 3-5 years the focus will be on breaking NVIDIA's monopoly on GPUs and making something that is not just optimized but at a reasonable price point, and of course available!

AI with Aish

Discussion about this post