Diffusion Acceleration — Interactive Explainers

🔄 Why Cache? — The Redundancy Problem

Consecutive diffusion steps produce nearly identical transformer outputs. Why recompute what barely changed?

Click to replay

🍵 TeaCache — Simple Adaptive Caching

Compare outputs between steps. If similar enough (below threshold), reuse the cached result. 1.5–2x speedup.

Click to replay

🧱 DBCache — Dual Block Cache

Run only the first N transformer blocks, measure residual difference. If small → cache the rest. If large → compute all.

Click to replay

🔮 TaylorSeer — Predict Instead of Compute

Use Taylor expansion to forecast the next step's output from previous derivatives. Skip computation entirely.

Click to replay

🎭 SCM — Step Computation Masking

Pre-define which steps to compute and which to cache. Like a schedule: "compute, cache, cache, compute..."

Click to replay

🔀 CFG-Parallel — Split Guidance Branches

Run positive and negative CFG prompts on separate GPUs, merge results. 2x faster per step.

Click to replay

⚔️ Ulysses-SP — Split Sequence, All-to-All Heads

Split input sequence across GPUs. Each GPU computes a subset of attention heads via all-to-all communication.

Click to replay

💍 Ring-Attention — Circulate K/V Blocks

Split sequence across GPUs in a ring. K/V blocks circulate around the ring, accumulating attention results.

Click to replay

🧩 HSDP — Hybrid Sharded Data Parallel

Shard model weights across GPUs using FSDP2. Each GPU holds a fraction, weights gathered on-demand during forward pass.

Click to replay

🧩 VAE Patch Parallelism — Spatial Decode Split

Split the VAE decode spatially across GPUs. Each GPU decodes a patch, then stitch together.

Click to replay

📐 FP8 / Int8 Quantization

Reduce DiT linear layers from BF16 to FP8 or Int8. ~1.28x speedup with minimal quality loss.

Click to replay

🧠 GPU Memory Map — Where It All Goes

A single GPU must hold model weights, activations, KV cache, and VAE decode tensors. Here's how the techniques we've explored reduce each piece.

Click to replay

Diffusion Acceleration Explained

🔄 Why Cache? — The Redundancy Problem

🍵 TeaCache — Simple Adaptive Caching

🧱 DBCache — Dual Block Cache

🔮 TaylorSeer — Predict Instead of Compute

🎭 SCM — Step Computation Masking

🔀 CFG-Parallel — Split Guidance Branches

⚔️ Ulysses-SP — Split Sequence, All-to-All Heads

💍 Ring-Attention — Circulate K/V Blocks

🧩 HSDP — Hybrid Sharded Data Parallel

🧩 VAE Patch Parallelism — Spatial Decode Split

📐 FP8 / Int8 Quantization

🧠 GPU Memory Map — Where It All Goes