The setup: ✔️40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests) ✔️ Thr...

TL;DR · AI 摘要
该推文仅披露了一项代码生成评测实验的初步配置(40个PR、3个模型、2种提示变体等),但未提供任何结果、分析或方法论细节,信息密度低,属预告性碎片内容。
核心要点
- 实验使用40个中等复杂度OpenClaw PR作为测试用例
- 对比Auggie、Claude Code和Codex三个代码生成模型在两种提示文档下的表现
- 由LLM裁判从完整性、正确性、最佳实践等五维度评分
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 代码生成模型横向评测(预告)
✔️40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests) ✔️ Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4 ✔️ Two variants per PR: baseline AGENTS.md (~18K chars) vs. AGENTS-karpathy.md (~20.5K chars) ✔️ 6 runs" / X
Augment Code on X: "@karpathy @jiayuan_jy @openclaw The setup: ✔️40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests) ✔️ Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4 ✔️ Two variants per PR: baseline AGENTS.md (~18K chars) vs. AGENTS-karpathy.md (~20.5K chars) ✔️ 6 runs" / X
Don’t miss what’s happening

The setup: 40 hand-selected PRs from OpenClaw, mid-complexity (100–300 LOC excluding tests)
Three runners: Auggie on Opus 4.7, Claude Code on Opus 4.7, Codex on GPT-5.4
Two variants per PR: baseline AGENTS.md (~18K chars) vs. AGENTS-karpathy.md (~20.5K chars)
6 runs per config, total 18 repeats per individual PR
Scored by an LLM judge on completeness, correctness, best practices, code reuse, and unsolicited documentation
·
1
1
3