Running LLaMA 4 Maverick Full 1 M-Token Context on Fireworks AI

Apr 29, 2025

I've been testing Meta's new LLaMA 4 Maverick model over the past few weeks, focusing specifically on its expanded 1-million-token context window. I wanted to see how this capability might change the way we work with large codebases.

Testing with the PyTorch Repository

As an experiment, I loaded the PyTorch repository (~16M tokens) into the model in a single pass. Previously, this would have been impossible without chunking the codebase into smaller segments.

The model performed well, accurately capturing the repository's structure including file headers, import patterns, utility functions, and data flows. This comprehensive view allowed for analysis that would have required multiple passes with models featuring smaller context windows.

The Practical Difference Between 128K and 1M Tokens

Working with LLaMA 3.3's 128K-token window meant implementing several workarounds when analyzing large codebases:

Splitting repositories into multiple chunks (roughly eight pieces for something PyTorch-sized)
Running separate analyses on each segment
Manually combining the results afterward
Accepting that some cross-module patterns might be missed

With the expanded context window, these limitations are reduced. The model can now process the PyTorch codebase as a unified whole, which improves:

Detection of modules loaded throughout different parts of the codebase
Identification of helper functions used across various submodules
Tracing execution paths from Python APIs through to C++ and CUDA implementations

Useful Applications I've Found

During my testing, I found several practical use cases where the expanded context is particularly helpful:

1. Repository Analysis and Architecture Reviews

Mapping interconnected packages and modules becomes more straightforward when the entire codebase fits in context.

2. Refactoring Planning

The model can suggest logical divisions based on a more complete understanding of import relationships and usage patterns.

3. Debugging Complex Call Stacks

Tracing functionality like torch.Tensor.backward() from high-level Python through to low-level implementations works better with full context.

4. Documentation Analysis

Analyzing multiple research papers or extensive documentation sets together provides more cohesive summaries.

Context Windows vs. RAG: When to Use Each

Despite the impressive context size, RAG (Retrieval-Augmented Generation) still has important applications:

Long-Context Models work well when your dataset fits within the token limit
RAG Approaches remain valuable for extremely large datasets or when dealing with information that changes frequently

Rather than making RAG obsolete, the larger context window simply shifts the threshold of when retrieval becomes necessary.

My Testing Results

I ran parallel analyses using two different prompts with both LLaMA 3.3 (128K) and LLaMA 4 Maverick (1M):

Generate a comprehensive code structure summary
Trace a complete call stack

The 128K model produced insights with noticeable gaps, particularly with modules loaded later in the codebase and cross-package dependencies. The 1M model provided more complete analysis with fewer disconnects between high-level modules and deeply nested functions.

Getting Started

If you're working with large codebases and want to experiment with expanded context windows, you can access LLaMA 4 Maverick through:

Meta AI's preview releases
Hugging Face's API
Fireworks AI's ready-to-use scripts

For projects that exceed even the million-token threshold or require real-time data access, combining long-context models with RAG techniques may still offer the best approach.

Appendix:

Llama 4 Maverick (1M Context Window):

"""
Showcase LLaMA 4 Maverick’s 1M-token context window
by querying the PyTorch repository in one shot
via the Fireworks AI API.
"""

import os
import sys
import logging
from pathlib import Path
from git import Repo
from fireworks.client import Fireworks
import tiktoken

# ──────────────
# Configuration
# ──────────────
API_KEY = os.getenv("FIREWORKS_API_KEY")
if not API_KEY:
    logging.error("Please set the FIREWORKS_API_KEY environment variable.")
    sys.exit(1)

MODEL       = os.getenv("FW_MODEL", "accounts/fireworks/models/llama4-maverick-instruct-basic")
ENDPOINT    = os.getenv("FW_ENDPOINT", "https://api.fireworks.ai/inference/v1/chat/completions")
REPO_URL    = "https://github.com/pytorch/pytorch.git"
LOCAL_PATH  = Path(os.getenv("LOCAL_PATH", "pytorch"))
MAX_TOKENS  = 1_000_000
STREAM_LOG  = logging.INFO

logging.basicConfig(level=STREAM_LOG, format="%(asctime)s [%(levelname)s] %(message)s")


def clone_repo(repo_url: str, dest: Path) -> None:
    """Clone a repo if not already present."""
    if dest.is_dir():
        logging.info("Using existing clone at %s", dest)
    else:
        logging.info("Cloning %s → %s …", repo_url, dest)
        Repo.clone_from(repo_url, dest)


def collect_source(path: Path) -> str:
    """Read all .py files under path and concatenate them."""
    logging.info("Reading .py files from %s …", path)
    texts = []
    for py_file in path.rglob("*.py"):
        try:
            content = py_file.read_text(encoding="utf-8", errors="ignore")
            header  = f"# ==== File: {py_file}\n"
            texts.append(header + content + "\n")
        except Exception:
            continue
    combined = "\n".join(texts)
    logging.info("Total characters read: %d", len(combined))
    return combined


def tokenize_and_trim(text: str, model_name: str, max_ctx: int) -> str:
    """Tokenize the text for the given model and trim to last `max_ctx` tokens."""
    logging.info("Tokenizing with tiktoken for model %s …", model_name)
    try:
        encoder = tiktoken.encoding_for_model(model_name)
    except KeyError:
        encoder = tiktoken.get_encoding("cl100k_base")

    tokens = encoder.encode(text)
    logging.info("Corpus is %d tokens", len(tokens))

    if len(tokens) > max_ctx:
        logging.info("Trimming to last %d tokens", max_ctx)
        tokens = tokens[-max_ctx:]
    else:
        logging.info("Corpus fits within %d-token window", max_ctx)

    return encoder.decode(tokens)


def query_model(client: Fireworks, prompt: str) -> str:
    """Send the prompt to LLaMA 4 Maverick and return the response."""
    logging.info("Sending prompt to model …")
    response = client.chat.completions.create(
        model=MODEL,
# 1. Global Code-Structure Summary
        messages=[
            {"role": "system", "content": "You are an expert code analyst."},
            {
                "role": "user",
                "content": (
                    "I've provided you the PyTorch codebase (last 1 million tokens).\n"
                    "1. List every top-level package and submodule.\n"
                    "2. Describe the high-level data-flow: Python API → C++ core → CUDA kernels.\n"
                    "3. Highlight cross-module helper functions reused at least three times."
                ),
            },
            {"role": "user", "content": prompt},
        ],
# # 2. End-to-End Call-Stack Trace
# messages = [
#     {"role": "system", "content": "You are an expert debugger."},
#     {
#         "role": "user",
#         "content": (
#             "I've provided you the PyTorch codebase (last 1 million tokens) plus runtime logs. "
#             "Starting from `torch.Tensor.backward()`, trace through each layer of Python, C++ and CUDA calls, "
#             "and show me the full call sequence with file paths and line numbers."
#         ),
#     },
#     {"role": "user", "content": prompt_text},
# ],
        max_tokens=16_384,
        temperature=0.2,
        top_p=1.0,
        top_k=40,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        prompt_truncate_len=MAX_TOKENS,
    )
    return response.choices[0].message.content


def main():
    clone_repo(REPO_URL, LOCAL_PATH)
    corpus = collect_source(LOCAL_PATH)
    prompt = tokenize_and_trim(corpus, MODEL, MAX_TOKENS)

    client = Fireworks(api_key=API_KEY, endpoint=ENDPOINT)
    analysis = query_model(client, prompt)

    print("\n=== Model Response ===\n")
    print(analysis)


if __name__ == "__main__":
    main()

Llama 3.3 70B Instruct Model (128K Context Window):

"""
Showcase LLaMA 3.3’s 128K-token context window
by querying the PyTorch repository in one shot
via the Fireworks AI API.
"""

import os
import sys
import logging
from pathlib import Path
from git import Repo
from fireworks.client import Fireworks
import tiktoken

# ──────────────
# Configuration
# ──────────────
API_KEY    = os.getenv("FIREWORKS_API_KEY")
if not API_KEY:
    logging.error("Please set the FIREWORKS_API_KEY environment variable.")
    sys.exit(1)

MODEL      = os.getenv("FW_MODEL", "accounts/fireworks/models/llama-v3p3-70b-instruct")
ENDPOINT   = os.getenv("FW_ENDPOINT", "https://api.fireworks.ai/inference/v1/chat/completions")
REPO_URL   = "https://github.com/pytorch/pytorch.git"
LOCAL_PATH = Path(os.getenv("LOCAL_PATH", "pytorch"))
MAX_CTX    = 128_000

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)


def clone_repo(repo_url: str, dest: Path) -> None:
    """Clone the repo if not already present."""
    if dest.is_dir():
        logging.info("Using existing clone at %s", dest)
    else:
        logging.info("Cloning %s → %s …", repo_url, dest)
        Repo.clone_from(repo_url, dest)


def collect_py_files(root: Path) -> str:
    """Read all .py files under root and concatenate them."""
    logging.info("Reading .py files from %s …", root)
    fragments = []
    for py_file in root.rglob("*.py"):
        try:
            content = py_file.read_text(encoding="utf-8", errors="ignore")
            header  = f"# ==== File: {py_file}\n"
            fragments.append(header + content + "\n")
        except Exception:
            continue
    combined = "\n".join(fragments)
    logging.info("Total characters: %d", len(combined))
    return combined


def tokenize_and_trim(text: str, model_name: str, max_tokens: int) -> str:
    """Tokenize text for the model and trim to the last max_tokens tokens."""
    logging.info("Tokenizing for model %s …", model_name)
    try:
        encoder = tiktoken.encoding_for_model(model_name)
    except KeyError:
        encoder = tiktoken.get_encoding("cl100k_base")

    tokens = encoder.encode(text)
    total = len(tokens)
    logging.info("Corpus length: %d tokens", total)

    if total > max_tokens:
        logging.info("Trimming to last %d tokens", max_tokens)
        tokens = tokens[-max_tokens:]
    else:
        logging.info("Corpus fits within %d-token window", max_tokens)

    return encoder.decode(tokens)


def query_model(client: Fireworks, prompt: str) -> str:
    """Send the prompt to LLaMA 3.3 and return the response."""
    logging.info("Sending prompt to model …")
    response = client.chat.completions.create(
        model=MODEL,
# 1. Global Code-Structure Summary
        messages=[
            {"role": "system", "content": "You are an expert code analyst."},
            {
                "role": "user",
                "content": (
                    "I've provided you the PyTorch codebase (last 128K tokens).\n"
                    "1. List every top-level package and submodule.\n"
                    "2. Describe the high-level data-flow: Python API → C++ core → CUDA kernels.\n"
                    "3. Highlight cross-module helper functions reused at least three times."
                ),
            },
            {"role": "user", "content": prompt},
        ],
# 2. End-to-End Call-Stack Trace
# messages = [
#     {"role": "system", "content": "You are an expert debugger."},
#     {
#         "role": "user",
#         "content": (
#             "I've provided you the PyTorch codebase (last 128k tokens) plus runtime logs. "
#             "Starting from `torch.Tensor.backward()`, trace through each layer of Python, C++ and CUDA calls, "
#             "and show me the full call sequence with file paths and line numbers."
#         ),
#     },
#     {"role": "user", "content": prompt_text},
# ],

        max_tokens=16_384,
        temperature=0.2,
        top_p=1.0,
        top_k=40,
        presence_penalty=0.0,
        frequency_penalty=0.0,
        prompt_truncate_len=MAX_CTX,
    )
    return response.choices[0].message.content


def main():
    clone_repo(REPO_URL, LOCAL_PATH)
    corpus = collect_py_files(LOCAL_PATH)
    prompt = tokenize_and_trim(corpus, MODEL, MAX_CTX)

    client = Fireworks(api_key=API_KEY, endpoint=ENDPOINT)
    analysis = query_model(client, prompt)

    print("\n=== Model Response ===\n")
    print(analysis)


if __name__ == "__main__":
    main()

PS: In both the code snippets above there are two prompt messages. Message 2 is commented out. When you are testing the code, test out one message at a time.

Thank you, #Meta, for partnering with me for this. #sponsored

AI with Aish

Discussion about this post