Where Does GPU Memory Actually Go During Inference?

You load a 4B parameter diffusion model. The card says “~13GB VRAM.” You have a 24GB GPU. That leaves 11GB free. Plenty, right?

Then you try generating a 2048×2048 image and get an OOM error. Where did all the memory go?

What Is “GPU Memory”?

A GPU isn’t one pool of memory. It’s a hierarchy with very different sizes and speeds.

Memory	Size (H100)	Bandwidth	Role
HBM	80 GB	3.35 TB/s	Weights, activations, KV cache, everything
L2 Cache	50 MB	~5.5 TB/s	Recently accessed data (hardware-managed)
SMEM (per SM)	256 KB	~30 TB/s	Data actively being computed on
Registers (per SM)	256 KB	Instant	Values mid-computation

“GPU memory” and “VRAM” mean HBM. That’s what OOMs are about.

Architecture details from How to Think About GPUs by the Google DeepMind scaling team.

The Five Components: FLUX.2 klein 4B

Using FLUX.2 klein 4B . A 4B parameter rectified flow transformer, 4 inference steps, model card says ~13GB VRAM.

1. Model Weights (~8 GB), Fixed

4B params × 2 bytes (BF16) = 8 GB. Always in memory regardless of resolution or batch size. Includes the DiT transformer, text encoder, and VAE decoder.

With FP8: drops to ~4 GB. This is the single biggest win for memory-constrained deployment. Always validate output quality against a golden set after quantizing. Some models tolerate FP8 well, others show visible degradation on faces or fine text.

2. Activations (~2-8 GB), Scales with Resolution²

Intermediate tensors from every transformer layer. At 1024×1024 (sequence length ~4096): ~2 GB. At 2048×2048 (sequence length ~16384): ~8 GB.

Why quadratic: double the resolution → 4× the tokens → 4× the activations.

3. Attention and KV Cache (~1-6 GB), The Hidden Hog

Every attention layer computes Q, K, V matrices and attention scores over the full sequence. Within a single forward pass, these intermediate tensors must all live in memory simultaneously. Some newer architectures (like FLUX.2 klein 9B KV) also cache K/V across image editing references, adding further memory pressure.

At 2048×2048 (sequence length ~16K), attention memory can hit 6+ GB. This is why high-res OOMs. Weights don’t change with resolution. Attention memory does, quadratically.

4. VAE Decode (~1-4 GB), The Final Spike

Converting latent space to pixels creates large temporary tensors. Often the thing that pushes you over the edge at high resolution.

5. Framework Overhead (~0.5 GB), The Tax

CUDA context, PyTorch allocator, memory fragmentation. The “where did my last gigabyte go?” tax.

The Numbers

Component	1024×1024	2048×2048	What scales it
Weights	~8 GB	~8 GB	Parameter count (fixed)
Activations	~2 GB	~8 GB	Resolution² × batch
Attention / KV	~1.5 GB	~6 GB	Resolution² × layers
VAE Decode	~1 GB	~4 GB	Output resolution
Overhead	~0.5 GB	~0.5 GB	Fixed
Total	~13 GB ✅	~26.5 GB ❌

Bandwidth: Why FP8 Helps Speed, Not Just Size

Even when the model fits, bandwidth determines speed. Reading 8 GB of weights on an H100 (3.35 TB/s) takes 2.4ms. With 4 steps, that’s 9.6ms just moving weights.

FP8 halves the bytes → halves the transfer time → faster inference. Caching skips the recomputation, reusing a previously stored result instead.

The ratio of compute to bandwidth is the arithmetic intensity. On an H100, you need ~295 FLOPs per byte to keep the GPU busy. Below that, the GPU waits for data.

What To Do About It

Reduce weights: FP8 quantization (4B params × 1 byte = 4 GB, down from 8 GB). Always do this first.

Reduce dynamic memory: CPU offloading moves the text encoder to RAM when unused. Reducing batch size to 1 is the simplest way to cut activation memory.

Trade memory for speed: Caching (TeaCache, DBCache) stores previous transformer outputs to skip redundant passes. This uses extra memory for the cache but cuts latency in half. On a memory-constrained GPU, disabling caching frees memory for higher resolutions at the cost of slower generation.

Distribute across GPUs: Weight sharding (HSDP), sequence splitting (Ulysses-SP, Ring-Attention), spatial VAE decode (VAE Patch Parallel).

Reduce generation time: Step distillation and guidance distillation cut the number of transformer passes from 50-100 down to 4-8. This does not reduce peak memory (each step uses the same amount), but it directly reduces wall-clock time and GPU cost per image.

Key Takeaways

Weights are fixed. Everything else scales with resolution² × batch size.
The model card number is for minimum resolution. Real usage can 2-3× it.
Bandwidth matters as much as capacity. Fitting in memory ≠ fast inference.
No single technique solves everything. Quantize weights to save memory, parallelize to scale, cache to trade memory for speed.

References

Black Forest Labs. “FLUX.2 klein 4B”, 2025.
Austin et al. “How to Think About GPUs”, Google DeepMind, 2025.
NVIDIA. “H100 Tensor Core GPU”, 2023.
Black Forest Labs. “FLUX.2 klein KV Cache”, 2025.
Black Forest Labs. “FLUX.2: Frontier Visual Intelligence”, 2025.