Think learn

Where Does GPU Memory Actually Go During Inference?

Sat, 28 Mar 2026 00:00:00 +0000

You load a 4B parameter diffusion model. The card says “~13GB VRAM.” You have a 24GB GPU. That leaves 11GB free. Plenty, right?

Then you try generating a 2048×2048 image and get an OOM error. Where did all the memory go?

What Is “GPU Memory”?

A GPU isn’t one pool of memory. It’s a hierarchy with very different sizes and speeds.

Memory	Size (H100)	Bandwidth	Role
HBM	80 GB	3.35 TB/s	Weights, activations, KV cache, everything
L2 Cache	50 MB	~5.5 TB/s	Recently accessed data (hardware-managed)
SMEM (per SM)	256 KB	~30 TB/s	Data actively being computed on
Registers (per SM)	256 KB	Instant	Values mid-computation

“GPU memory” and “VRAM” mean HBM. That’s what OOMs are about.

Architecture details from How to Think About GPUs by the Google DeepMind scaling team.

The Five Components: FLUX.2 klein 4B

Using FLUX.2 klein 4B . A 4B parameter rectified flow transformer, 4 inference steps, model card says ~13GB VRAM.

1. Model Weights (~8 GB), Fixed

4B params × 2 bytes (BF16) = 8 GB. Always in memory regardless of resolution or batch size. Includes the DiT transformer, text encoder, and VAE decoder.

With FP8: drops to ~4 GB. This is the single biggest win for memory-constrained deployment. Always validate output quality against a golden set after quantizing. Some models tolerate FP8 well, others show visible degradation on faces or fine text.

2. Activations (~2-8 GB), Scales with Resolution²

Intermediate tensors from every transformer layer. At 1024×1024 (sequence length ~4096): ~2 GB. At 2048×2048 (sequence length ~16384): ~8 GB.

Why quadratic: double the resolution → 4× the tokens → 4× the activations.

3. Attention and KV Cache (~1-6 GB), The Hidden Hog

Every attention layer computes Q, K, V matrices and attention scores over the full sequence. Within a single forward pass, these intermediate tensors must all live in memory simultaneously. Some newer architectures (like FLUX.2 klein 9B KV) also cache K/V across image editing references, adding further memory pressure.

At 2048×2048 (sequence length ~16K), attention memory can hit 6+ GB. This is why high-res OOMs. Weights don’t change with resolution. Attention memory does, quadratically.

4. VAE Decode (~1-4 GB), The Final Spike

Converting latent space to pixels creates large temporary tensors. Often the thing that pushes you over the edge at high resolution.

5. Framework Overhead (~0.5 GB), The Tax

CUDA context, PyTorch allocator, memory fragmentation. The “where did my last gigabyte go?” tax.

The Numbers

Component	1024×1024	2048×2048	What scales it
Weights	~8 GB	~8 GB	Parameter count (fixed)
Activations	~2 GB	~8 GB	Resolution² × batch
Attention / KV	~1.5 GB	~6 GB	Resolution² × layers
VAE Decode	~1 GB	~4 GB	Output resolution
Overhead	~0.5 GB	~0.5 GB	Fixed
Total	~13 GB ✅	~26.5 GB ❌

Bandwidth: Why FP8 Helps Speed, Not Just Size

Even when the model fits, bandwidth determines speed. Reading 8 GB of weights on an H100 (3.35 TB/s) takes 2.4ms. With 4 steps, that’s 9.6ms just moving weights.

FP8 halves the bytes → halves the transfer time → faster inference. Caching skips the recomputation, reusing a previously stored result instead.

The ratio of compute to bandwidth is the arithmetic intensity. On an H100, you need ~295 FLOPs per byte to keep the GPU busy. Below that, the GPU waits for data.

What To Do About It

Reduce weights: FP8 quantization (4B params × 1 byte = 4 GB, down from 8 GB). Always do this first.

Reduce dynamic memory: CPU offloading moves the text encoder to RAM when unused. Reducing batch size to 1 is the simplest way to cut activation memory.

Trade memory for speed: Caching (TeaCache, DBCache) stores previous transformer outputs to skip redundant passes. This uses extra memory for the cache but cuts latency in half. On a memory-constrained GPU, disabling caching frees memory for higher resolutions at the cost of slower generation.

Distribute across GPUs: Weight sharding (HSDP), sequence splitting (Ulysses-SP, Ring-Attention), spatial VAE decode (VAE Patch Parallel).

Reduce generation time: Step distillation and guidance distillation cut the number of transformer passes from 50-100 down to 4-8. This does not reduce peak memory (each step uses the same amount), but it directly reduces wall-clock time and GPU cost per image.

Key Takeaways

Weights are fixed. Everything else scales with resolution² × batch size.
The model card number is for minimum resolution. Real usage can 2-3× it.
Bandwidth matters as much as capacity. Fitting in memory ≠ fast inference.
No single technique solves everything. Quantize weights to save memory, parallelize to scale, cache to trade memory for speed.

References

Black Forest Labs. “FLUX.2 klein 4B”, 2025.
Austin et al. “How to Think About GPUs”, Google DeepMind, 2025.
NVIDIA. “H100 Tensor Core GPU”, 2023.
Black Forest Labs. “FLUX.2 klein KV Cache”, 2025.
Black Forest Labs. “FLUX.2: Frontier Visual Intelligence”, 2025.

Step Distillation vs Guidance Distillation: Making Diffusion Inference Up to 24× Faster

Sat, 28 Mar 2026 00:00:00 +0000

Diffusion models run the transformer 50-100 times per image. Two distillation techniques cut this down, but they work differently, cost differently to train, and break differently. Here’s how to choose.

Neither technique requires a special dataset. The teacher model generates the training signal. You can use the original training data, or even random noise as input.

A good example is FLUX.2 klein 4B from Black Forest Labs. It is a 4B parameter rectified flow transformer that already ships as a distilled model: 4 inference steps, sub-second generation, ~13GB VRAM. According to the official model table, it uses both step distillation (reduced to 4 inference steps) and guidance distillation (CFG baked into a single pass). At inference time you set guidance_scale=1.0 because the guidance effect is already distilled into the model. This is exactly what combining both techniques looks like in practice.

For background on where GPU memory goes during inference, see Where Does GPU Memory Actually Go?

Step Distillation: Fewer Steps

What: Train a student where 1 step = 2 teacher steps. Repeat: 50 → 25 → 12 → 8 → 4.

Why it works: Each round only matches 2 consecutive steps, not the full 50-step trajectory. The student is initialized from the teacher’s weights, so it converges fast.

The sweet spot: 4-8 steps. Below 4, faces get soft and textures lose crispness. At 1 step, you need adversarial losses (like SDXL Turbo) to recover sharpness. This reduces output diversity.

Training cost: Multiple rounds, each requiring a dataset + GPU hours. But you pay once; the distilled model serves forever.

Pitfalls:

Each halving round accumulates error, so quality degrades progressively
v-prediction parameterization is needed (standard ε-prediction breaks at high noise levels with few steps)
Reducing steps weakens classifier-free guidance because the effect relies on repeated small adjustments

Guidance Distillation: Cheaper Steps

What: Train a student to produce the guided output in 1 pass instead of 2.

Why 2 passes exist: CFG subtracts the unconditional output from the conditional output to isolate the prompt’s contribution, then amplifies it 7×. Both passes take the current noisy image as input, so neither can be precomputed.

Why it works: The student learns to directly predict the amplified result, conditioned on the guidance scale. One forward pass replaces two.

Training cost: Single round, simpler task than step distillation. The student can still vary guidance strength at inference time.

Pitfalls:

Only helps if you use CFG (most text-to-image models do)
Must be applied before step distillation. Otherwise, the guidance effect is already weakened by fewer steps
May not perfectly capture guidance at all noise levels, causing subtle prompt adherence issues

Decision Guide

Scenario	Recommendation	Speedup
Ship fast, minimal training budget	Guidance distillation	2×
Maximum throughput on limited hardware	Step distillation to 4-8 steps	6-12×
Real-time generation (< 100ms)	Both (guidance first, then steps) + FP8	~24×
Can’t retrain at all	Caching (TeaCache, DBCache)	~2×

The Third Option: No Training Required

If retraining isn’t an option, caching skips transformer passes at runtime by detecting when consecutive steps produce nearly identical outputs. Zero training cost, ~2× speedup.

Caching is orthogonal to distillation. You can stack all three.

Always validate with a golden set

Distillation and quantization both trade quality for speed. The tradeoff is different for every model, every dataset, and every use case. You must measure it before deploying.

Build a validation set of 200-500 prompts that represent your production traffic. Include edge cases: long prompts, rare styles, text rendering, faces, fine details. Generate images with both the teacher (full model, 50 steps, CFG) and the student (distilled/quantized). Compare using:

FID (Frechet Inception Distance): measures distribution-level quality. Lower is better. Expect 1-5 point increase after distillation.
CLIP score: measures prompt-image alignment. If this drops significantly, your guidance distillation may be too aggressive.
Human evaluation: no metric replaces looking at the outputs. Have 2-3 people rate 100 pairs blind. Focus on faces, text, and fine textures.
LLM-as-a-judge: use a vision-language model (like GPT-4o or Gemini) to score image quality, prompt adherence, and artifact detection at scale. This automates what human eval does manually and scales to your full validation set. Note: this is a costly option. Each image evaluation requires a VLM API call, so running it on 500 images can cost $5-15 per evaluation round. Use it for final validation, not iterative tuning.

Run this validation after every change: after quantization, after each distillation round, after combining techniques. The quality loss compounds. A 2% drop from FP8 plus a 5% drop from step distillation plus a 3% drop from guidance distillation can add up to a noticeable degradation that no single step revealed.

References

Salimans, Ho. “Progressive Distillation for Fast Sampling of Diffusion Models”, ICLR 2022.
Meng et al. “On Distillation of Guided Diffusion Models”, CVPR 2023.
Ho, Salimans. “Classifier-Free Diffusion Guidance”, NeurIPS 2021.
Dieleman. “The paradox of diffusion distillation”, 2024.
Sauer et al. “Adversarial Diffusion Distillation”, 2023.
Black Forest Labs. “FLUX.2: Frontier Visual Intelligence”, 2025.

Diffusion Acceleration, Interactive Visual Explainers

Sat, 28 Mar 2026 00:00:00 +0000

How expensive is generating one image?

A 20B parameter diffusion model (like Qwen-Image) generating a single 1024×1024 image with 50 steps and classifier-free guidance requires roughly 17,200 TFLOPs of compute.

For comparison, rendering one frame of Call of Duty at 4K with ray tracing takes about 2.8 TFLOPs. That means generating a single image costs as much compute as rendering ~6,000 frames of Call of Duty, or about 2 minutes of gameplay at 60fps.

This is why diffusion acceleration matters. The techniques below can cut that cost by up to 31×.

I built 12 interactive animated explainers covering all the diffusion acceleration techniques from vLLM-Omni. Each one walks through the “why” before the “how”, with play/pause/replay controls. Here is a summary of what each technique does and how much compute it saves.

Caching: skip what barely changed

At each denoising step, the transformer predicts a direction to remove noise. In late steps, this direction barely changes from one step to the next. Caching detects this and reuses the previous output instead of recomputing it. This skips roughly half the transformer passes, saving ~8,600 TFLOPs per image (~50 seconds of Call of Duty).

There are several ways to decide when to cache. TeaCache measures the L1 distance between consecutive outputs and compares it to a threshold. If the difference is small enough, it reuses the cached result (1.5-2× speedup). DBCache takes a cheaper approach: run only the first 2 transformer layers, measure the residual difference, and decide whether to cache the remaining 22 layers (1.85× speedup). TaylorSeer goes further by predicting the next output from the trend using Taylor expansion, at zero GPU cost. SCM avoids runtime decisions entirely with a pre-defined schedule of which steps to compute and which to cache.

After caching: ~8,600 TFLOPs per image (~3,000 CoD frames, ~50s of gameplay).

Parallelism: split work across GPUs

When a single GPU isn’t fast enough, you split the work. The most impactful technique for diffusion models is CFG-Parallel: since classifier-free guidance requires two independent transformer passes per step (one with the prompt, one without), you can run them on separate GPUs. This halves the wall-clock time per step.

For high-resolution images, the sequence length grows quadratically, making attention the bottleneck. Ulysses-SP splits the sequence across GPUs and uses all-to-all communication for attention heads (4 GPUs: 2.84× speedup). Ring-Attention takes a different tradeoff: it circulates K/V blocks in a ring, using less memory but achieving slightly lower speedup (4 GPUs: 1.94×).

When the model itself doesn’t fit on one GPU, HSDP shards the weights across GPUs and gathers them on-demand during the forward pass. It doesn’t speed up inference, but it makes large models possible on smaller hardware. Finally, VAE Patch Parallel splits the memory-hungry VAE decode step spatially across GPUs for a 4× faster decode.

Quantization: shrink the numbers

Every weight in a 20B model takes 2 bytes in BF16. That is 40GB just for weights. FP8 quantization rounds each weight to 1 byte, cutting memory in half. But the real win is speed: fewer bytes to transfer through the memory bus means faster matrix multiplications. FP8 gives roughly a 1.28× speedup, saving ~3,400 TFLOPs (~20s of CoD).

Why does rounding work? Most weight values cluster near zero. Eight bits captures the important range. The rare outliers get rounded with minimal accuracy loss. Some layers (especially attention) are more sensitive, so per-layer control lets you quantize what is safe and keep sensitive layers in BF16.

Applied alone, FP8 brings the cost to ~13,400 TFLOPs per image. Combined with other techniques, the savings compound.

Combining everything

These techniques stack. Applied together:

Optimization	TFLOPs/image	CoD frames	Gameplay at 60fps
Baseline (20B, 50 steps, CFG)	17,200	~6,100	~102s
+ FP8 quantization (1.28×)	13,400	~4,800	~80s
+ Caching, skip 50% passes (2×)	6,700	~2,400	~40s
+ Step distillation, 8 steps (6×)	1,100	~400	~6.5s
+ Guidance distillation (2×)	550	~200	~3.3s

From 102 seconds of Call of Duty per image down to 3.3 seconds. A 31× reduction.

Important: these optimizations compound quality loss too. Always validate against a golden set of 200-500 representative prompts after each optimization. Measure FID, CLIP score, and do blind human evaluation on faces, text, and fine details. A small drop at each stage can add up to a noticeable degradation.

The big picture: where GPU memory goes

The animation below shows how GPU memory is split between model weights, activations, KV cache, and VAE decode, and how each technique targets a different piece. For a deeper dive, see Where Does GPU Memory Actually Go During Inference?

▶ Launch all 12 Interactive Explainers

Also available: P90 Latency · Step Distillation · Guidance Distillation · GPU Parallelism

References

Black Forest Labs. “FLUX.2: Frontier Visual Intelligence”, 2025.
Black Forest Labs. “FLUX.2 klein 4B”, 2025.
Austin et al. “How to Think About GPUs”, Google DeepMind, 2025.
Dieleman. “The paradox of diffusion distillation”, 2024.
vLLM-Omni. “Diffusion Acceleration”, 2025.

Python Metaphors and Their Explanations

Thu, 10 May 2018 00:00:00 +0000

Python has four core metaphors that separate beginner code from expert code. They aren’t fancy tricks. Each one solves a specific, recurring problem. Understanding what problem they solve matters more than memorizing syntax.

Based on James Powell’s excellent talk at PyData 2017.

1. Decorators: Wrapping Behavior Around Functions

The problem: You want to add timing, logging, or authentication to 20 functions. Copy-pasting the same before/after code into each one is fragile and ugly.

The metaphor: A decorator wraps a function with before-and-after behavior, without modifying the function itself.

Without a decorator:

def add(a, b):
    return a + b

# Every time you call it, you manually time it
import time
start = time.time()
result = add(2, 3)
elapsed = time.time() - start
print(f"add took {elapsed:.4f}s")

With a decorator:

import time

def timer(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        elapsed = time.time() - start
        print(f"{func.__name__} took {elapsed:.4f}s")
        return result
    return wrapper

@timer
def add(a, b):
    return a + b

@timer
def multiply(a, b):
    return a * b

add(2, 3)       # prints: add took 0.0000s
multiply(4, 5)  # prints: multiply took 0.0000s

The @timer line is just syntactic sugar for add = timer(add). It takes the function, wraps it, and replaces it. One line instead of copy-pasting timing code everywhere.

When to use: Logging, timing, authentication, caching, retry logic. Anything that wraps behavior around a function without changing what the function does.

2. Generators: Lazy Computation, One Piece at a Time

The problem: You need to process a million items, but loading them all into memory at once will crash your program.

The metaphor: A generator computes one value at a time and pauses between values. It yields control back to the caller after each result.

Without a generator:

def get_squares(n):
    result = []
    for i in range(n):
        result.append(i * i)
    return result

# This creates a list of 10 million items in memory
squares = get_squares(10_000_000)
print(squares[0])  # You only needed the first one

With a generator:

def get_squares(n):
    for i in range(n):
        yield i * i

# Nothing is computed yet
squares = get_squares(10_000_000)

# Only computes the first value
print(next(squares))  # 0
print(next(squares))  # 1

# Or take just the first 5
for sq in get_squares(10_000_000):
    if sq > 10:
        break
    print(sq)  # 0, 1, 4, 9

The yield keyword pauses the function and returns a value. Next time you ask for a value, it resumes from where it paused. Memory usage stays constant regardless of how many items exist.

When to use: Reading large files line by line, streaming data, infinite sequences, any computation where you don’t need all results at once.

3. Context Managers: Pairing Setup with Teardown

The problem: You open a file, do some work, and forget to close it. Or an exception happens before the close call, and the file handle leaks.

The metaphor: A context manager pairs a setup action with a teardown action and guarantees the teardown always runs, even if an error occurs.

Without a context manager:

f = open("data.txt", "w")
f.write("hello")
# If an exception happens here, the file never gets closed
f.close()

With a context manager:

with open("data.txt", "w") as f:
    f.write("hello")
# File is automatically closed, even if write() raises an exception

Building your own (using a generator + decorator, showing how the metaphors combine):

from contextlib import contextmanager
import time

@contextmanager
def timer(label):
    start = time.time()
    yield  # This is where the "with" block runs
    elapsed = time.time() - start
    print(f"{label}: {elapsed:.4f}s")

with timer("data processing"):
    total = sum(range(1_000_000))
# prints: data processing: 0.0312s

The code before yield is the setup. The code after yield is the teardown. The with block runs in between. If the block raises an exception, the teardown still runs.

When to use: File handles, database connections, locks, temporary directories, GPU memory allocation. Anything that needs cleanup.

4. Metaclasses: Enforcing Rules on Subclasses

The problem: You’re writing a library. Users will subclass your base class. You need to make sure they implement certain methods, or follow certain naming conventions. You can’t trust documentation alone.

The metaphor: A metaclass hooks into the class creation process. Since Python creates classes at runtime (they’re just objects), you can inspect and validate them as they’re being created.

The problem, concretely:

class Plugin:
    def run(self):
        raise NotImplementedError

class MyPlugin(Plugin):
    pass  # Forgot to implement run()

p = MyPlugin()
p.run()  # Crashes at runtime, maybe in production

With a metaclass (the manual way):

class PluginMeta(type):
    def __init__(cls, name, bases, namespace):
        super().__init__(name, bases, namespace)
        if bases:  # Skip the base class itself
            if 'run' not in namespace:
                raise TypeError(f"{name} must implement run()")

class Plugin(metaclass=PluginMeta):
    def run(self):
        raise NotImplementedError

class MyPlugin(Plugin):
    pass  # TypeError: MyPlugin must implement run()

The error happens at class definition time, not at runtime. You catch the mistake immediately.

The standard library way (you rarely need to write metaclasses by hand):

from abc import ABC, abstractmethod

class Plugin(ABC):
    @abstractmethod
    def run(self):
        pass

class MyPlugin(Plugin):
    pass  # TypeError: Can't instantiate abstract class

ABC uses a metaclass under the hood (ABCMeta). You get the same enforcement without writing the metaclass yourself.

When to use: Almost never directly. Use ABC and @abstractmethod instead. Metaclasses are justified when you need to enforce constraints that ABC can’t express (like “all method names must be lowercase” or “every subclass must register itself in a global registry”).

How They Fit Together

These four metaphors are mostly orthogonal. You can combine them:

from contextlib import contextmanager

def log_calls(func):                    # Decorator
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__}")
        return func(*args, **kwargs)
    return wrapper

@contextmanager                          # Context manager (built with generator)
def database_connection(url):
    conn = connect(url)
    try:
        yield conn                       # Generator yield point
    finally:
        conn.close()

@log_calls
def process_data():
    with database_connection("db://localhost") as conn:
        for row in conn.stream_rows():   # Generator (lazy iteration)
            transform(row)

The decorator adds logging. The context manager handles connection cleanup. The generator streams rows without loading them all into memory. Each one does its job.

The Takeaway

Don’t memorize syntax. Remember what each metaphor is for:

Metaphor	Problem it solves
Decorator	Wrap behavior around functions (timing, auth, logging)
Generator	Compute lazily, one value at a time (memory, streaming)
Context Manager	Pair setup with guaranteed teardown (files, connections)
Metaclass	Enforce rules on subclasses at definition time (libraries)

Expert Python code doesn’t use every feature. It uses the right feature for the right problem.

Pandas distinct

Fri, 09 Feb 2018 00:00:00 +0000

import pandas as pd
df = pd.read_csv("all_words.txt", header=None, names=['key', 'val'])
df.describe()

df = df.drop_duplicates(subset='key', keep="last")
df.describe()

Securing containerized MongoDB

Sun, 24 Sep 2017 00:00:00 +0000

Getting the Docker container

$ docker run --name mongoDB -d -p 27217:27017 -v /mongo-data:/data/db mongo --auth

Note mongoDB is the name of my contianer yours may differ. Also I am exposing a 27217 port not the standard 27017.

The above line will pull the latest official mongo docker image, and deploy it. The -p option exposes the port 27217 publicly and the -v option helps you persist data even if your docker container gets killed.

The last –auth flag is very important for our use case, it forces the mongodb container to check for authentication for all requests which is what we want.

Once you have done this connect to your docker continer using:

$ docker exec -it mongoDB bash

Note mongoDB is the name of my contianer yours may differ.

Connect to the mongo service by:

$ mongo

Create a new admin user:

> db.createUser(
{
user: "super",
pwd: "superpassword",
roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
}
)

Create another user for controlling access to your db:

> db.createUser(
  {
    user: "client",
    pwd: "clientpassword",
    roles: [{ role: "readWrite", db: "appdb" } ]
  }
)

Now you can connect to your db by using

MONGO_URL = 'mongodb://client:clientpasssword@<public_ip>:27217/appdb?authSource=admin'

mongo appdb -u client -p clientpassword --host <public_ip> --port 27217

Pandas apply functions on groups

Thu, 24 Aug 2017 00:00:00 +0000

Groupby with custom aggregation function

# Defining a small dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
        'name': ['Miller', 'Jacobson', 'Ali', 'Miller', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Let’s do a groupby on regiment and see how many people in each regiment have similar names.

dd = defaultdict(int)
def fns(name):
    global dd
    print ( "\t" + name, dd)
    dd[name] += 1
    return dd

for name, reg in df.groupby('regiment'):
    print (name)
    reg['name'].apply(fns)
    print (dd)
    dd.clear()

Dota 2 Win Prediction: A Weekend ML Project

Mon, 15 Aug 2016 00:00:00 +0000

After months of optimizing ML models to fit in 50MB on budget phones, I wanted a weekend project with zero constraints. Dota 2 match prediction: given the 10 hero picks in a match, predict which team wins. Simple problem, fun dataset, and a good excuse to play more Dota.

The Dataset and Features

I pulled 50,000 ranked matches from the OpenDota API. Each match has 5 heroes per team (out of 113 total heroes) and a binary outcome. The simplest feature representation: a 113-dimensional binary vector for Radiant picks and another for Dire picks.

import numpy as np
import requests

def fetch_matches(n=1000):
    matches = []
    for _ in range(n // 100):
        resp = requests.get('https://api.opendota.com/api/publicMatches')
        matches.extend(resp.json())
    return matches

def match_to_features(match, num_heroes=114):
    features = np.zeros(num_heroes * 2)
    for player in match['players']:
        hero_id = player['hero_id']
        if player['player_slot'] < 128:  # Radiant
            features[hero_id] = 1
        else:  # Dire
            features[num_heroes + hero_id] = 1
    return features, 1 if match['radiant_win'] else 0

Logistic Regression Gets You to 65%

No neural networks needed. Logistic regression on hero pick features alone hit 58% accuracy. Adding hero pair interaction features (did hero A and hero B appear on the same team) pushed it to 65%.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

X = np.array([match_to_features(m)[0] for m in matches])
y = np.array([match_to_features(m)[1] for m in matches])

# Basic hero picks
model = LogisticRegression(C=0.1, max_iter=1000)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Hero picks only: {scores.mean():.3f}")  # ~0.58

# Add pairwise interactions for same-team heroes
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_interact = poly.fit_transform(X)

scores2 = cross_val_score(model, X_interact, y, cv=5, scoring='accuracy')
print(f"With interactions: {scores2.mean():.3f}")  # ~0.65

65% might sound low, but Dota has enormous variance. Pro analysts estimate that draft advantage accounts for about 10-15% of win probability. The rest is player skill, execution, and itemization. A model that only sees hero picks is fundamentally capped.

Feature Engineering > Model Complexity

I tried a random forest, gradient boosting, and a small neural network. None beat logistic regression by more than 1-2%. The gains came from better features, not better models.

Adding average hero win rate as a feature helped. Adding a “team synergy” score based on historical win rates of hero pairs helped more. The model didn’t need to learn that Anti-Mage is good from raw match data if I told it directly.

def add_winrate_features(X, hero_winrates):
    radiant_avg_wr = np.array([
        np.mean([hero_winrates[h] for h in range(114) if row[h] == 1])
        for row in X
    ])
    dire_avg_wr = np.array([
        np.mean([hero_winrates[h] for h in range(114) if row[114 + h] == 1])
        for row in X
    ])
    return np.column_stack([X, radiant_avg_wr, dire_avg_wr])

What I Learned

This was a good reminder that feature engineering matters more than model complexity. At work we spent weeks squeezing 1% accuracy out of model architectures. Here, 30 minutes of feature engineering gave me 7% accuracy improvement over the baseline.

The other lesson: know your ceiling. No model will predict Dota matches at 90% accuracy from draft alone. Understanding the theoretical limit saves you from chasing diminishing returns. Same principle applies to production ML. If your training data has 5% label noise, don’t expect 99% accuracy.

Fun project. Took a Saturday afternoon. Would recommend as a first ML project for anyone who plays Dota.

Docker for ML Model Deployment: From Dev Chaos to Reproducible Builds

Fri, 12 Aug 2016 00:00:00 +0000

At Indus OS, our ML pipeline touched TensorFlow (Python 2.7, specific commit hash), Android NDK r12b, Bazel 0.3.1, and custom C libraries for the TTS vocoder. If any version drifted, the build broke silently and produced models that crashed on device. Three engineers on the team, three different Ubuntu versions, three different sets of installed libraries. Docker fixed this.

The Dependency Problem

Building our TensorFlow Android library required exact versions of everything. NDK r12b worked. NDK r13 introduced a libc++ change that broke our NEON-optimized matrix multiply. Bazel 0.3.2 changed a flag that affected selective op registration. We lost two days to each of these issues before we started using Docker.

FROM ubuntu:14.04

# Exact versions that produce working builds
ENV NDK_VERSION=r12b
ENV BAZEL_VERSION=0.3.1
ENV TF_COMMIT=a23f5d7

RUN apt-get update && apt-get install -y \
    build-essential git python2.7 python-pip wget unzip openjdk-8-jdk

# Android NDK
RUN wget -q https://dl.google.com/android/repository/android-ndk-${NDK_VERSION}-linux-x86_64.zip \
    && unzip -q android-ndk-${NDK_VERSION}-linux-x86_64.zip -d /opt \
    && rm android-ndk-${NDK_VERSION}-linux-x86_64.zip
ENV ANDROID_NDK_HOME=/opt/android-ndk-${NDK_VERSION}

# Bazel
RUN wget -q https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh \
    && chmod +x bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh \
    && ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh

# TensorFlow at exact commit
RUN git clone https://github.com/tensorflow/tensorflow.git /tf \
    && cd /tf && git checkout ${TF_COMMIT}

WORKDIR /tf

Images, Containers, and Volumes for Model Files

The key concepts that mattered for us:

An image is a frozen snapshot of the build environment. We built it once and pushed it to our private registry. Every engineer and the CI server pulled the same image. No more “works on my machine.”

A container is a running instance of that image. We ran builds inside containers and extracted the output artifacts.

Volumes let us mount model weight files and training data without baking them into the image. Models changed daily. The build environment changed maybe once a month.

# Build the TF Android library inside the container
docker run --rm \
  -v $(pwd)/models:/models \
  -v $(pwd)/output:/output \
  indus-ml-build:latest \
  bash -c "cd /tf && bazel build -c opt \
    --copt='-DSELECTIVE_REGISTRATION' \
    //tensorflow/contrib/android:libtensorflow_inference.so \
    && cp bazel-bin/tensorflow/contrib/android/*.so /output/"

CI/CD for the ML Pipeline

Before Docker, our “CI” was someone running the build on their laptop and copying the .so file to a shared drive. With Docker, we set up a proper pipeline:

Train model in Python container, export frozen graph
Quantize and optimize in TF tools container
Build Android .so with NDK container
Run on-device tests via ADB from the same container

#!/bin/bash
# build_pipeline.sh

# Step 1: Freeze and quantize model
docker run --rm -v $(pwd):/work indus-ml-train:latest \
  python /work/scripts/freeze_and_quantize.py \
    --input_model /work/checkpoints/latest \
    --output /work/artifacts/quantized_model.pb

# Step 2: Build Android library with model baked in
docker run --rm -v $(pwd):/work indus-ndk-build:latest \
  ndk-build -C /work/android_app APP_ABI=armeabi-v7a

echo "Build artifacts in ./artifacts/"

Each step used a different image with only the tools it needed. The train image had Python and TF. The NDK image had the Android toolchain. No cross-contamination.

What Worked and What Didn’t

Docker eliminated environment drift completely. Build times went from “2 hours plus debugging” to “45 minutes, deterministic.” New engineers could build on day one instead of spending a week setting up their environment.

What didn’t work: Docker on macOS was painfully slow for our builds because of the filesystem virtualization layer. We moved CI to a Linux server and kept macOS only for development. Also, the Docker images were large (2.5GB for the full TF build environment). We considered splitting the Dockerfile into separate build and runtime images, but the complexity was not worth it for our small team. Storage was cheap. Engineer time was not.

Porting Physics-Based Animations from C++ to Python

Tue, 02 Aug 2016 00:00:00 +0000

My M.S. thesis at IIT Madras was a Smoothed Particle Hydrodynamics (SPH) fluid simulator written in C++. It ran at 60fps for 10,000 particles. I wanted to port it to Python/Kivy for a cross-platform demo. The problem: naive Python is roughly 100x slower than C++ for this kind of tight numerical loop.

Why SPH Is Computationally Expensive

SPH simulates fluids as particles. Each particle interacts with neighbors within a smoothing radius h. For every timestep, you compute density, pressure, and viscosity forces by summing contributions from all neighbors. With N particles, a brute-force approach is O(N^2).

The core computation looks like this in C++:

// Density computation for particle i
for (int j = 0; j < num_neighbors; j++) {
    float r = distance(pos[i], pos[neighbors[j]]);
    if (r < h) {
        float q = r / h;
        density[i] += mass * kernel(q);  // cubic spline kernel
    }
}

Translating this directly to Python with for-loops gave about 2fps for 1000 particles. Completely unusable.

NumPy Vectorization

The key insight: replace per-particle Python loops with bulk NumPy operations. Instead of iterating over particles, compute all pairwise distances as matrix operations.

import numpy as np

def compute_density(positions, mass, h, neighbor_lists):
    N = len(positions)
    density = np.zeros(N)

    for i in range(N):
        nbrs = neighbor_lists[i]
        if len(nbrs) == 0:
            continue
        diff = positions[nbrs] - positions[i]  # vectorized subtraction
        r = np.linalg.norm(diff, axis=1)
        mask = r < h
        q = r[mask] / h
        # Cubic spline kernel, vectorized
        w = (1.0 - 1.5*q**2 + 0.75*q**3)
        w[q > 0.5] = 0.25 * (2.0 - q[q > 0.5])**3
        density[i] = mass * np.sum(w)

    return density

This still has a Python loop over particles, but the inner computation is vectorized. For 1000 particles with ~30 neighbors each, this ran at 18fps. Better, but not interactive.

Spatial Hashing for Neighbor Search

The O(N^2) neighbor search was the real bottleneck. We used a spatial hash grid: divide space into cells of size h, and only check particles in adjacent cells.

def build_spatial_hash(positions, h):
    grid = {}
    cell_size = h
    for i, pos in enumerate(positions):
        key = (int(pos[0] / cell_size), int(pos[1] / cell_size))
        grid.setdefault(key, []).append(i)
    return grid

def find_neighbors(grid, positions, i, h):
    cell_size = h
    cx = int(positions[i][0] / cell_size)
    cy = int(positions[i][1] / cell_size)
    neighbors = []
    for dx in (-1, 0, 1):
        for dy in (-1, 0, 1):
            key = (cx + dx, cy + dy)
            if key in grid:
                for j in grid[key]:
                    if i != j:
                        r = np.linalg.norm(positions[i] - positions[j])
                        if r < h:
                            neighbors.append(j)
    return neighbors

With spatial hashing, neighbor search dropped from O(N^2) to roughly O(N). Combined with vectorized force computation, we hit 30fps for 1000 particles on a laptop. Good enough for an interactive demo.

When to Use Python vs C

For 1000 particles, Python with NumPy was adequate. For 10,000+ particles, you need C/C++ or at minimum Cython for the inner loops. The thesis code handled 50,000 particles in C++ at 30fps. Python topped out at about 2000 particles for interactive rates.

The Kivy rendering was not the bottleneck. Drawing 1000 circles as Kivy widgets was fast. The physics computation dominated. If I were doing this again, I would write the simulation in C as a shared library and call it from Python via ctypes, keeping Kivy only for rendering.

What Worked and What Didn’t

NumPy vectorization gave a 9x speedup over naive Python loops. Spatial hashing gave another 4x. Together they made 1000-particle simulations interactive. The Kivy framework was pleasant to work with for 2D rendering and handled touch input well on both desktop and mobile.

What didn’t work: I tried using Python multiprocessing to parallelize the force computation. The overhead of serializing NumPy arrays between processes was larger than the computation itself for 1000 particles. Parallelism only helps at larger scales where the per-particle work dominates the communication cost.