For my final semester project at Columbia, my teammate and I decided to tackle one of the most annoying problems in AI right now: waiting for AI videos to generate.
If you’ve played with models like Sora or Hunyuan, you know the drill. It starts fast, but if you try to generate anything longer than a few seconds, the progress bar basically crawls to a halt. That’s because standard attention mechanisms are quadratic (O(N^2)). Double the video length, and you quadruple the pain ;-;
We wanted to see if we could fix this. We took Wan 2.1 (1.3B) and tried to swap out its attention mechanism for something smarter called Radial Attention.
Here’s what happened, how I broke the code (and fixed it), and the actual numbers on whether it worked.
The logic behind Radial Attention (NeurIPS 2025) is actually pretty simple. Think about a video. If you look at a pixel in Frame 10, it's probably really related to Frame 9 and Frame 11. It's somewhat related to Frame 8 and 12. But does it really care about Frame 100? Probably not.
This is what the paper calls "spatiotemporal energy decay."
Standard attention is wasteful as it calculates the relationship between every single token pair, even if they are totally unrelated. Radial Attention just forces the model to ignore the far-away stuff using a static mask. In theory, this brings the math down from O(N^2) to O(N logN).
If you've ever tried to implement a research paper from GitHub, you know the vibe. You clone the repo, run it, and... it crashes.
The official Radial Attention code was written for an older version of Wan. When I tried to drop it into the modern Hugging Face diffusers library, it was a disaster. Tensor shapes were mismatched, the attention class interfaces were different—it was a mess.
I spent a good chunk of the project just patching the Wan 2.1 interface. I had to rewrite the WanSparseAttention class to make sure it actually played nice with the diffusers pipeline. The hardest part was getting the static mask to broadcast correctly across the batch and head dimensions without silently failing (which is the worst kind of bug).
Once I finally got the tensors lined up and the patch stable, I fired it up on our benchmark rig: a single NVIDIA RTX 5090.
Short answer: Yes.
We ran a bunch of tests ranging from short clips (59 frames) to longer sequences (241 frames). For short videos, it was okay: about 1.37x faster. But the cool part about O(N log N) is that it shines when things get heavy. By the time we pushed it to 241 frames, the Radial Attention implementation was 1.82x faster than the baseline.
You can see the gap widening here. Blue is the standard dense attention and orange is us.
Honestly, I expected our memory usage to drop off a cliff. But when I checked the logs, the peak VRAM was almost identical , hovering around 20GB for both.
I was confused at first, but after digging into the snapshots, it made sense. For a model this size (1.3B parameters), the VRAM is mostly eaten up by the fixed costs (i.e., the model weights and the activation buffers). The attention map itself is huge, sure, but sparsifying it saved us time, not really space.
This was my biggest worry tbh. You can make anything fast if you don't care about the output. We tested it with many prompts, and it actually held up! (I can't seem to figure out how to upload videos here yet... but hopefully I'll be able to upload it soon ... till then you can enjoy this video still :p).
I didn't want to just say "it's faster" and call it a day. I wanted to know why. Was I hitting cache better? Was it a bandwidth thing?So I decided to use NVIDIA Nsight Compute to profile the actual CUDA kernels running on the GPU. And the difference was stark...
Dense Attention: 30 Million SM Cycles
Radial Attention: 9 Million SM Cycles
A 70% reduction in Streaming Multiprocessor (SM) cycles! The kernel was simply doing less math. The speedup was a brute-force reduction in Floating Point Operations (FLOPs). But that said, the profiler also showed me that our Radial kernel had lower GPU occupancy than the super-optimized FlashAttention kernel used in the baseline. So, the code is faster, but it's arguably "lazier".
Overall, this was a fun project. It proves that we don't always need to pay the quadratic cost for high-quality video. Sometimes you just need to tell the model to focus on what matters.
Code is up on GitHub if you want to try breaking it yourself: https://github.com/naandip/radial-attention