T
traeai
登录
返回首页
NVIDIA AI(@NVIDIAAI)

NVIDIA与SakanaAILabs合作发布ICML2026稀疏Transformer优化论文

8.7Score
NVIDIA与SakanaAILabs合作发布ICML2026稀疏Transformer优化论文

TL;DR · AI 摘要

NVIDIA与SakanaAILabs合作发表ICML2026论文,提出TwELL稀疏打包与融合CUDA内核,实现20%+推理/训练加速。

核心要点

  • TwELL稀疏打包格式可实现99%以上神经元稀疏度,对下游性能影响小于1%
  • 融合CUDA内核在大规模模型上带来20%+的推理与训练速度提升
  • 通过L1正则化诱导稀疏性,使FFN层中超过95%神经元保持静默

结构提纲

按章节快速跳转。

  1. 现代大语言模型虽天然具有高稀疏性(>95%神经元静默),但传统硬件无法高效支持稀疏计算,导致资源浪费。

  2. 提出TwELL稀疏打包格式与专用融合CUDA内核,适配现代NVIDIA GPU执行管线,实现高效稀疏计算。

  3. 使用简单L1正则化即可诱导超过99%的稀疏度,且对模型下游任务性能影响极小。

  4. 在大规模模型上,该方案实现20%以上的推理与训练速度提升,并显著降低能耗与内存占用。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • 稀疏Transformer优化:TwELL与GPU加速
    • 核心挑战
      • LLM前馈层95%+神经元静默
      • 传统硬件不支持稀疏计算
    • 技术方案
      • TwELL稀疏打包格式
      • 融合CUDA内核设计
    • 关键成果
      • 99%+稀疏度(L1正则)
      • 20%+推理/训练加速
      • 能效与内存效率提升

金句 / Highlights

值得收藏与分享的关键句。

  • 超过95%的前馈层神经元在任意词输入时保持静默,但现有硬件惩罚这种稀疏性。

    Quote from hardmaru

    ⬇︎ 下载 PNG𝕏 分享到 X
  • L1正则化可诱导超过99%的稀疏度,对下游性能影响可忽略不计。

    Image description

    ⬇︎ 下载 PNG𝕏 分享到 X
  • TwELL稀疏打包与融合CUDA内核在大规模模型上带来20%+的推理与训练加速。

    NVIDIA AI post

    ⬇︎ 下载 PNG𝕏 分享到 X
#Transformer#稀疏计算#NVIDIA GPU#LLM优化#ICML2026
打开原文

• TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale

Paper + code below 👇" / X

Image 1: Square profile picture

Great collab with

on an #ICML26 paper about sparse transformer kernels + formats optimized for modern NVIDIA GPU execution. • TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale Paper + code below Image 2: 👇

Quote

hardmaru

@hardmaru

12h

The human brainImage 3: 🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most x.com/SakanaAILabs/s…

Image 4: Sparser, Faster, Lighter Transformer Language Models Scaling autoregressive LLMs has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale.

read image description

AI 可能会生成不准确的信息,请核实重要内容