NVIDIA与SakanaAILabs合作发布ICML2026稀疏Transformer优化论文

NVIDIA AI(@NVIDIAAI)

NVIDIA AI(@NVIDIAAI)2026年5月8日

NVIDIA与SakanaAILabs合作发布ICML2026稀疏Transformer优化论文

8.7Score

TL;DR · AI 摘要

NVIDIA与SakanaAILabs合作发表ICML2026论文，提出TwELL稀疏打包与融合CUDA内核，实现20%+推理/训练加速。

核心要点

TwELL稀疏打包格式可实现99%以上神经元稀疏度，对下游性能影响小于1%
融合CUDA内核在大规模模型上带来20%+的推理与训练速度提升
通过L1正则化诱导稀疏性，使FFN层中超过95%神经元保持静默

结构提纲

按章节快速跳转。

§研究背景与动机
现代大语言模型虽天然具有高稀疏性（>95%神经元静默），但传统硬件无法高效支持稀疏计算，导致资源浪费。
§核心技术方案
提出TwELL稀疏打包格式与专用融合CUDA内核，适配现代NVIDIA GPU执行管线，实现高效稀疏计算。
·稀疏性生成方法
使用简单L1正则化即可诱导超过99%的稀疏度，且对模型下游任务性能影响极小。
·性能验证结果
在大规模模型上，该方案实现20%以上的推理与训练速度提升，并显著降低能耗与内存占用。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

稀疏Transformer优化：TwELL与GPU加速
- 核心挑战
  - LLM前馈层95%+神经元静默
  - 传统硬件不支持稀疏计算
- 技术方案
  - TwELL稀疏打包格式
  - 融合CUDA内核设计
- 关键成果
  - 99%+稀疏度（L1正则）
  - 20%+推理/训练加速
  - 能效与内存效率提升

金句 / Highlights

值得收藏与分享的关键句。

超过95%的前馈层神经元在任意词输入时保持静默，但现有硬件惩罚这种稀疏性。
— Quote from hardmaru
⬇︎ 下载 PNG 𝕏 分享到 X
L1正则化可诱导超过99%的稀疏度，对下游性能影响可忽略不计。
— Image description
⬇︎ 下载 PNG 𝕏 分享到 X
TwELL稀疏打包与融合CUDA内核在大规模模型上带来20%+的推理与训练加速。
— NVIDIA AI post
⬇︎ 下载 PNG 𝕏 分享到 X

#Transformer#稀疏计算#NVIDIA GPU#LLM优化#ICML2026

打开原文

• TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale

Paper + code below 👇" / X

Great collab with

on an #ICML26 paper about sparse transformer kernels + formats optimized for modern NVIDIA GPU execution. • TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale Paper + code below Image 2: 👇

Quote

hardmaru

@hardmaru

12h

The human brain Image 3: 🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most x.com/SakanaAILabs/s…

Image 4: Sparser, Faster, Lighter Transformer Language Models Scaling autoregressive LLMs has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale.

read image description