T
traeai
登录
返回首页
cohere(@cohere)

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4...

9.0Score
Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4...

TL;DR · AI 摘要

Cohere实现了生产级W4A8推理优化,并集成到vLLM中,显著提升性能。

核心要点

  • 结合4-bit权重和8-bit激活实现内存与计算平衡。
  • 相比W4A16,TTFT提升58%,TPOT提升45%。
  • 优化方案已集成至开源项目vLLM。
#推理优化#vLLM#Cohere#机器学习
打开原文

Cohere on X: "Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper. https://t.co/M37wT5KS8Z" / X

Don’t miss what’s happening

Image 3: Square profile picture

Cohere

@cohere

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.

Image 4: Image

8:38 PM · Apr 22, 2026

·

5,241 Views

3

14

99

35

AI 可能会生成不准确的信息,请核实重要内容

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4... | cohere(@cohere) | traeai