Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4...

cohere(@cohere)

cohere(@cohere)2026年4月22日

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4...

9.0Score

TL;DR · AI 摘要

Cohere实现了生产级W4A8推理优化，并集成到vLLM中，显著提升性能。

核心要点

结合4-bit权重和8-bit激活实现内存与计算平衡。
相比W4A16，TTFT提升58%，TPOT提升45%。
优化方案已集成至开源项目vLLM。

#推理优化#vLLM#Cohere#机器学习

打开原文

Cohere on X: "Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper. https://t.co/M37wT5KS8Z" / X

Don’t miss what’s happening

Cohere

@cohere

Excited to share our work on production-ready W4A8 inference, now integrated in vLLM! By combining 4-bit weights (low memory) with 8-bit activations (high compute), we hit the sweet spot for both decoding and prefill — up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper.

8:38 PM · Apr 22, 2026

·

5,241 Views

3

14

99

35