SWE-bench Pro 最近有什么新动态？

traeai 已收录 9 篇与 SWE-bench Pro 相关的内容。最新一篇是「Claude Fable 5省钱秘诀来了：调成Low档比Opus更便宜」，由量子位发布。

概念

SWE-bench Pro

Q: 什么是 SWE-bench Pro？

用于评估模型编程能力的基准测试。

别名：SWE-bench

用于评估模型编程能力的基准测试。

已跟踪 9 条高相关材料

TraeAI 观察

如果只读 3 篇

Claude Fable 5省钱秘诀来了：调成Low档比Opus更便宜

量子位 · 8.5 分

Claude Fable 5在低档位下表现优于Opus 4.8，且在复杂任务中更省成本。

Claude Fable 5 thinks document parsing is beneath it It is absolutely crushing on all reasoning-int...

Jerry Liu(@jerryjliu0) · 8.5 分

Claude Fable 5 在推理任务上表现卓越，但在文档解析任务上与 Gemini 3 Flash 相当，且成本高 10-15 倍。

There are no shortcuts to the frontier. Disciplined, patient, meticulous attention to detail is crit...

Mustafa Suleyman(@mustafasuleyman) · 8.5 分

微软发布MAI-Thinking-1等7款模型，其中推理模型SWE-Bench Pro达53%媲美Opus 4.6，转录模型MAI-Transcribe-1.5支持43种语言且速度提升5倍。

Claude Fable 5省钱秘诀来了：调成Low档比Opus更便宜

量子位6月11日2414 字 (约 10 分钟)

Claude Fable 5在低档位下表现优于Opus 4.8，且在复杂任务中更省成本。

入选理由：Fable 5低档位下表现优于Opus 4.8

精选文章#Claude#AI模型#成本优化中文

Claude Fable 5 thinks document parsing is beneath it It is absolutely crushing on all reasoning-int...

Jerry Liu(@jerryjliu0)6月10日281 字 (约 2 分钟)

Claude Fable 5 在推理任务上表现卓越，但在文档解析任务上与 Gemini 3 Flash 相当，且成本高 10-15 倍。

入选理由：Claude Fable 5 在 SWE-Bench Pro 等推理任务中表现优异。

精选推文#Claude Fable 5#Gemini 3 Flash#文档解析#AI 模型中英混合

There are no shortcuts to the frontier. Disciplined, patient, meticulous attention to detail is crit...

微软发布MAI系列模型：Thinking-1推理能力对标Opus 4.6，Transcribe-1.5转录速度提升5倍

Mustafa Suleyman(@mustafasuleyman)6月5日419 字 (约 2 分钟)

微软发布MAI-Thinking-1等7款模型，其中推理模型SWE-Bench Pro达53%媲美Opus 4.6，转录模型MAI-Transcribe-1.5支持43种语言且速度提升5倍。

入选理由：MAI-Thinking-1在SWE-Bench Pro得分53%，与Opus 4.6并列顶尖编码推理水平。

精选推文#MAI-Thinking-1#SWE-Bench#Microsoft AI#多模态模型英文

MiniMax 发布 M3 开源模型：首个融合编码、代理与长上下文能力的前沿模型

OpenRouter(@OpenRouterAI)6月1日82 字 (约 1 分钟)

MiniMax 推出 M3 开源模型，首次融合编码、代理与长上下文能力，在 SWE-Bench Pro 等基准上达 59%+，支持 1M 上下文窗口，推动开源大模型向多能型前沿迈进。

入选理由：MiniMax M3 在 SWE-Bench Pro 基准测试中取得 59.0% 正确率，领先多数开源模型。

精选推文#开源模型#大语言模型#编码能力#长上下文#MiniMax英文

Super excited to announce seven new world-class MAI models today. They represent what we consider a ...

Mustafa Suleyman 宣布推出七款全新 MAI 模型

Mustafa Suleyman(@mustafasuleyman)6月2日448 字 (约 2 分钟)

Mustafa Suleyman 宣布推出七款全新 MAI 模型，包括 MAI-Thinking-1、MAI-Image-2.5 和 MAI-Code-1-Flash，这些模型在推理、图像编辑和代码生成等领域表现出色。

入选理由：MAI-Thinking-1 是一款拥有 35B 参数的 MoE 模型，在 AIME 2025 上达到 97% 的准确率，优于 Sonnet 4.6。

精选推文#AI#模型#微软#MAI#芯片英文

New open model: MiniMax M3 by @MiniMax_AI is live in the Arena!

Find it across Text, Vision, Docume...

新开源模型：MiniMax M3 已上线 Arena！

lmarena.ai(@lmarena_ai)6月1日124 字 (约 1 分钟)

MiniMax M3 是首个开源权重模型，同时支持文本、视觉、文档和代码任务，在 SWE-Bench Pro 等基准测试中表现优异，上下文长度达 1M tokens。

入选理由：MiniMax M3 在 SWE-Bench Pro 达到 59.0%，Terminal Bench 2.1 达 66.0%，是当前开源模型中编程能力最强之一。

精选推文#MiniMax#开源模型#多模态#SWE-Bench英文

.@MiniMax_AI M3 model is available on Ollama's Cloud!

In partnership with MiniMax, the M3 model on...

MiniMax M3 模型现已上线 Ollama Cloud！

ollama(@ollama)6月1日153 字 (约 1 分钟)

MiniMax M3 模型已通过 Ollama Cloud 发布，支持 US 部署与零数据保留，专为编码和代理任务设计，在 SWE-Bench Pro 基准中达 59%+ 正确率，结合稀疏注意力实现 1M 上下文长度。

入选理由：M3 在 SWE-Bench Pro 基准中取得 59.0% 正确率，优于多数开源模型。

精选推文#M3#Ollama#MiniMax#编码 AI#代理 AI英文

Auggie 对比 Claude Code 基准测试：质量提升 33% 成本优势

Augment Code(@augmentcode)5月20日890 字 (约 4 分钟)

Augment Code 发布的基准测试显示，其 AI 编程助手 Auggie 在使用 Opus 4.7 模型时，以 67.4% 的通过率略高于 Claude Code 的 66.3%，同时成本降低约 33%，这主要归功于其 Context Engine 语义索引技术实现的精准检索和 token 效率优化。

入选理由：Auggie 在 Terminal Bench 2.0 上以 67.4% vs 66.3% 的通过率略胜 Claude Code，同时 token 使用量减少 32%，成本降低 33%

精选推文#AI编程助手#基准测试#成本优化#Token效率#Augment Code英文

Ollama 推出 GLM-5.1

ollama(@ollama)5月15日66 字 (约 1 分钟)

Ollama 推出新一代旗舰模型 GLM-5.1，代码生成能力显著提升。

入选理由：GLM-5.1 是 Ollama 的新一代旗舰模型。

精选推文#AI 模型#代码生成#Ollama英文

跨材料问答 · SWE-bench Pro

回答基于：SWE-bench Pro 相关 9 条材料