This November and December, I was interested on MoE efficiency, speculative decoding, and efficient attention methods.
Efficient MoE
It seems like increasing sparsity in MoE layers will become a challenge for existing hardware-aware algorithms, both in training and inference. Increasing sparsity lowers the arithmetic intensity of the MoE layers, since tokens share weights less often. Unlike dense models, where each token loads the same weight, tokens in MoE often load different weights (experts). Therefore, increasing batch size or sequence length is not always enough to reach a compute bound regime. The idea of efficient MoE is very interesting to me. My prediction is that as linear attention methods improve, MoE will become a critical bottleneck, especially for inference. If we can shrink KV cache size, the largest memory transfer between hbm and sbuf will be MoE weights.
SonicMoE is one of the papers that caught my eye this month. It aims to accelerate MoE through the following optimizations.
- Minimizing Activation Memory - SonicMoE does some tricks (see paper) to only cache the input to the MoE layer and the output of the up projection. This avoids memory complexity, where t is the number of tokens, k is the number of experts, and d is dimension of the input. Rather, their activation memory complexity is . Since k and n are inversely proportional when keeping flops constant, this scales better as MoE sparsity increases.
- IO Overlap - SonicMoE uses async memory gathers/loads/stores to increase utilization of tensor cores.
- Token Rounding - SonicMoE rounds the number of tokens per expert to a multiple of their tile size by either dropping low affinity tokens or adding additional tokens.
Overall SonicMoE was a very useful paper to read, especially to understand GPU optimizations. For TPU style accelerators, IO overlap and token rounding may already be partially handled by the compiler. IO overlap could also speed up inference, although the ratio of IO to compute would be drastically different.
Speculative Decoding
The interaction between speculative decoding and MoE is very interesting to me. As previously mentioned, increasing sequence length does not always improve the arithmetic intensity of MoE layers. Increasing batch size does not amortize the cost of loading weights like a dense layer. MoESD approaches this behavior from a theoretical standpoint. From a high level, MoESD finds that speculative decoding can accelerate MoE inference for moderate batch sizes. At small batch size, a single decoding step may only activate a subset of experts. At large batch sizes, the model can become compute-bound, so adding more verification compute does increase runtime. However, with moderate batch sizes, verifying tokens does not significantly increase how many distinct expert weights you load and the task is still memory bound. I wonder if there is a way to design a drafter that generates batches that MoE target models can efficiently verify (through low expert selection entropy).
Efficient Attention Methods
As with many people, I have also been interested in efficient attention methods such as state space models (mamba) and linear attention methods (kimi-linear, gated delta net). I really like the combination of new architectures and hardware-efficient implementations. One of the favorite things I read this year was Songlin Yang’s blogs on parallelizing delta net. On the plane ride home from China, I read these blogs instead of sleeping :D
I believe the main issue with linear attention methods is still retrieval. Full attention is like being able to flip through a book to find a certain section. However, linear attention is like reading the entire book and then being asked to recite a certain section. Recursive Language Models showed that tool use (similar to grep) can drastically help in retrieval tasks. I believe this type of tool use will benefit linear attention models greatly. It is like allowing linear methods to flip through the pages of the book again. Since tool use can also blow up sequence length, linear attention methods also complement tool use.
Papers and Blogs
- SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
- Tiny-TPU: the why and how
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
- Inside vLLM: Anatomy of a High-Throughput LLM Inference System
- Recursive Language Models
- MoESD: Unveil Speculative Decoding’s Potential for Accelerating Sparse MoE
- EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
- Kimi Linear: An Expressive, Efficient Attention Architecture
- Gated Delta Networks: Improving Mamba2 with Delta Rule
- MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
- DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
- Gated Linear Attention Transformers with Hardware-Efficient Training
- Parallelizing Linear Transformers with the Delta Rule over Sequence Length
- (Probably missed a lot. I got to start keeping track 🙃)