Diffusion Acceleration Explained
Interactive visual guide to caching, parallelism, and quantization for diffusion models
🔄 Why Cache? — The Redundancy Problem
Consecutive diffusion steps produce nearly identical transformer outputs. Why recompute what barely changed?
Click to replay
🍵 TeaCache — Simple Adaptive Caching
Compare outputs between steps. If similar enough (below threshold), reuse the cached result. 1.5–2x speedup.
Click to replay
🧱 DBCache — Dual Block Cache
Run only the first N transformer blocks, measure residual difference. If small → cache the rest. If large → compute all.
Click to replay
🔮 TaylorSeer — Predict Instead of Compute
Use Taylor expansion to forecast the next step's output from previous derivatives. Skip computation entirely.
Click to replay
🎭 SCM — Step Computation Masking
Pre-define which steps to compute and which to cache. Like a schedule: "compute, cache, cache, compute..."
Click to replay
🔀 CFG-Parallel — Split Guidance Branches
Run positive and negative CFG prompts on separate GPUs, merge results. 2x faster per step.
Click to replay
⚔️ Ulysses-SP — Split Sequence, All-to-All Heads
Split input sequence across GPUs. Each GPU computes a subset of attention heads via all-to-all communication.
Click to replay
💍 Ring-Attention — Circulate K/V Blocks
Split sequence across GPUs in a ring. K/V blocks circulate around the ring, accumulating attention results.
Click to replay
🧩 HSDP — Hybrid Sharded Data Parallel
Shard model weights across GPUs using FSDP2. Each GPU holds a fraction, weights gathered on-demand during forward pass.
Click to replay
🧩 VAE Patch Parallelism — Spatial Decode Split
Split the VAE decode spatially across GPUs. Each GPU decodes a patch, then stitch together.
Click to replay
📐 FP8 / Int8 Quantization
Reduce DiT linear layers from BF16 to FP8 or Int8. ~1.28x speedup with minimal quality loss.
Click to replay
🧠 GPU Memory Map — Where It All Goes
A single GPU must hold model weights, activations, KV cache, and VAE decode tensors. Here's how the techniques we've explored reduce each piece.
Click to replay