AI with Aish

AI with Aish

When Fine-Tuning an Open-Source Model is Actually Worth It

Aishwarya Srinivasan's avatar
Aishwarya Srinivasan
Mar 28, 2026
∙ Paid

In the fast-moving landscape of 2026, the question for AI engineers has shifted. It’s no longer “Can an AI do this?” but rather “What is the most efficient way to deploy intelligence in production?“

We have access to the proprietary models: GPT-5.4, Gemini 3 Pro, and Claude 4.6 Opus. These models are capable of reasoning that would have seemed impossible just two years ago. However, they come with significant baggage: high latency, steep API costs, and a “black box” nature that keeps your most valuable IP in someone else’s cloud.

This is where the Open Models is rising. We’re seeing a massive resurgence in fine-tuning open-source (OS) models (like Llama 4-70B or Mistral 3-Large) to perform at parity with the giants for specific, vertical tasks. But fine-tuning isn’t a silver bullet, it’s an investment.

This blog is a high-resolution map to help you decide when to stick with the “Proprietary Giants” and when to fire up Open Source models to craft your own specialized model.

PS: I recently started at Nebius to lead AI Developer Relations for them, and my focus is Nebius Token Factory to run and fine-tune opensource models. I am writing this blog to be focusing on how you can fine-tuning opensource models on Token Factory. My promise to you is that my opinions and recommendations will be purely organic and not be biased. My goal is to share the most relevant and accurate knowledge with you all :)

AI with Aish is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Index

  • The Birdseye View

  • Proprietary Ceiling vs. Open-Source Floor

  • The Four Reasons to Go Open-Source (Privacy, Cost, Latency, Control)

  • Deep Dive: When Fine-Tuning is the Right Call

  • The Fine-Tuning Workflow: How to Actually Execute

  • Economic Analysis: Total Cost of Fine-Tuning vs. Inference APIs

  • Resource Directory for Builders


The Birdseye View

Before we get into the technical details, here’s the high-level framework. You should only consider fine-tuning when your requirements move away from “general reasoning” and toward “specialized, repeatable tasks.”

If your product needs broad, flexible intelligence (think: brainstorming, open-ended Q&A, multi-step planning), proprietary models are hard to beat. But the moment your workload becomes narrow and high-volume (think: classifying medical records, generating structured API responses, parsing legal contracts), fine-tuning starts to make a lot more sense.


Last week I was at NVIDIA GTC as a special guest, so I got some really great, up-close seats for the keynote and sessions. One of my favorites was the open model panel with Jensen Huang, featuring leaders from Mistral, Perplexity, Cursor, Thinking Machines Lab, LangChain, and more. Jensen's key message was clear: "Proprietary versus open is not a thing. It's proprietary and open."

I'd highly recommend watching the full session. It directly ties into what this entire blog is about.


The Proprietary Ceiling vs. The Open-Source Floor

DeepSeek R1 was an inflection point. It was the moment the industry realized that the moat for closed-source models is no longer just raw performance. Over the last couple of years, the gap between open and closed models has narrowed significantly in terms of capability. But the architectural trade-offs have grown.

The Proprietary Ceiling (GPT-5.4 / Gemini 3 Pro / Claude 4.6)

These models are the “everything machines.” They’ve been trained on massive datasets and fine-tuned with enormous RLHF (Reinforcement Learning from Human Feedback) budgets.

  • The ceiling problem: You can’t improve them. You can only prompt them better. If GPT-5.4 doesn’t understand your niche proprietary assembly language, you’re stuck in few-shot prompting, burning tokens on every request just to teach the model the basics.

The Open-Source Floor (Llama 4 / Mistral 3-Large)

Open-source models give you the raw building blocks. They’re highly capable out of the box, but they lack the specific alignment your product might need.

  • The floor advantage: The baseline is already high enough that these models reason effectively. Fine-tuning lets you raise that baseline specifically in the direction your product needs.


The Four Pillars of the “Why”

If you’re wondering the real “why” behind fine-tuning open-source models, here are the four that matter most:

1. Cost at Scale

If your application generates 1 million tokens a day, paying the markup on a proprietary API is fine. If you’re scaling to 100 billion tokens a month (common in log analysis, high-frequency coding agents, or large-scale document processing), the margins on proprietary APIs will eat your business alive.

2. Latency

Claude 4.6 Opus is brilliant, but it’s a heavy model. The Time to First Token (TTFT) can be 500ms or more. For a real-time UI component or an autonomous agent, that’s too slow. A fine-tuned Llama 4-8B running on optimized infrastructure like Nebius Token Factory can hit 300+ tokens/sec, giving you near-instant responses.

3. Data Privacy and Compliance

If you’re in healthcare, defense, or fintech, sending your sensitive data to a third-party API is a serious compliance risk, even with enterprise agreements. Fine-tuning an open-source model on your own infrastructure (or a private cloud like Nebius) keeps the data under your control.

4. Behavior Control

Proprietary models are often over-aligned. They can be overly cautious, refuse certain tasks, or default to a generic “AI voice.” Fine-tuning lets you bake your product’s persona, tone, and specific logic directly into the model weights, instead of relying on a 2,000-word system prompt that the model might drift away from over time (this is called “instruction drift,” and it’s a real problem in production).

AI with Aish is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


When Fine-Tuning is the Right Call

Let’s get specific. When do you stop prompt engineering and start training?

A. Domain-Specific Language

If you’re working in a niche like legal biotech or semiconductor physics, general models are often confidently wrong. They use everyday definitions for terms that have precise, different meanings in your field.

The fix: Fine-tune on a corpus of your internal documents (even 10,000 to 100,000 examples can make a massive difference). The model learns the statistical patterns of your specific vocabulary. It stops treating “cell” as a generic biology term and starts recognizing it as a “primary electrolytic unit” in your battery research context.

B. Structured Output (The “JSON Problem”)

Even GPT-5.4 struggles with complex, deeply nested JSON schemas when the context window is crowded with other instructions.

The fix: Fine-tune a model specifically on your schema. This is sometimes called “format alignment.” You train the model so that it only outputs your specific structure. This eliminates the need for expensive output parsers, validation layers, and retry loops.

C. Knowledge Distillation (The “Small Model” Strategy)

This is the most powerful use case. You use a large model like Claude 4.6 Opus to generate high-quality “gold standard” answers for your tasks. Then you take those answers and fine-tune a much smaller, faster model (like Llama 4-8B) to replicate the reasoning of the large model.

The result: You get roughly 90-95% of the performance of the large model at a fraction of the cost and significantly faster inference speeds. This is the backbone of how many production AI systems work today.

LLM Model Pruning and Knowledge Distillation with NVIDIA NeMo Framework |  NVIDIA Technical Blog
Image Source: NVIDIA

The Fine-Tuning Workflow: How to Actually Execute

Here’s the practical, step-by-step workflow for going from idea to deployed fine-tuned model:

Step 1: Dataset Preparation

You need high-quality pairs of instruction and response. The most effective approach is to use synthetic data generation. You can use a strong model like Gemini 3 Pro or Claude 4.6 to generate your initial training set, then have domain experts review and correct the outputs.

Here is a great resource by HuggingFace that you can check out: https://huggingface.co/spaces/HuggingFaceFW/finephrase

The quality of your dataset is the single biggest factor in how well your fine-tuned model performs. Garbage in, garbage out still applies.

Step 2: Parameter-Efficient Fine-Tuning (PEFT)

Nobody trains all 70 billion parameters from scratch anymore. It’s too expensive and too slow. The standard approach is LoRA (Low-Rank Adaptation).

How LoRA works: You freeze the original model weights and only train a small “adapter” layer (typically a few megabytes in size). This adapter modifies the model’s behavior without changing the base weights. The result is that you can fine-tune a massive model with a fraction of the compute, and swap adapters in and out depending on the task.

Nebius Token Factory supports this workflow end to end. You can upload your dataset, configure a LoRA fine-tuning run, and deploy the resulting adapter on top of a base model, all within the same platform.

PS: Most platforms like Token Factory by default runs with LoRA, and has more customizations available.

User's avatar

Continue reading this post for free, courtesy of Aishwarya Srinivasan.

Or purchase a paid subscription.
© 2026 Aishwarya Srinivasan · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture