DeepSeek’s recent paper on Manifold-Constrained Hyper-Connections blew up on Twitter. At a high level, it tries to stabilize a matrix multiply inside the residual stream by constraining the matrix so that its rows and columns sum to 1 to prevent vanishing or exploding activations.
However, I think the contribution of the original Hyper-Connections Show information for the linked content paper (that Deepseek’s paper is based on) is even more interesting. It reveals a broader trend in LLM architecture design. Compared to classic ResNet-style residuals, the key shift is that Hyper-Connections introduce many residual pathways, rather than relying on a single high-dimensional residual.
What’s interesting is how often this pattern shows up elsewhere in Transformers:
- Multihead Attention: many heads, each with a small head dimension
- MoE: many expert MLPs, with sparse routing per token (especially when top_k is large)
- Hyper-Connections: many residual branches
Across these components, a recurring recipe seems to be “more parallel computation paths, each with lower dimension.” Empirically, this trains faster than pushing everything through one wide path. If that is the pattern, the natural question is where else we can apply it. Attention, MLP, and residual pathways already cover most of the repeated blocks in a Transformer. What other parts could benefit from being split into many small, parallel routes? Off the top of my head, layer norm is the only remaining part that I can think of. Just a thought :)
Also another thought: I wonder how this would work for methods that extract hidden states from intermediate layers like Eagle 3. Would the features be more informative or would we just reuse the projection to combine the hidden states?