Parameter Golf挑战赛教会了我们什么

Q: 训练优化

一些最佳结果来自现有组件的精心调优。

OpenAI Blog

OpenAI Blog2026年5月12日

Parameter Golf挑战赛教会了我们什么

8.5Score

TL;DR · AI 摘要

Parameter Golf挑战赛展示了AI辅助研究的潜力，包括训练优化、量化、测试策略和新模型创意。

核心要点

超过1,000名参与者提交了2,000多个方案，展示了广泛的技术创造力。
AI编码代理降低了实验成本，使更多人能够参与竞赛。
竞赛成为发现人才的有效途径，揭示了机器学习领域的优秀技术和持久力。

结构提纲

按章节快速跳转。

§引言
介绍Parameter Golf挑战赛的目的和规则。
§技术印象
总结参赛作品中的技术亮点，包括训练优化、量化、测试策略和新模型创意。
·训练优化
一些最佳结果来自现有组件的精心调优。
·量化
多个提交在压缩和导出方面进行了深入探索。
·测试时间和评估策略
部分提交在模型改进和评估策略之间进行了创新。
·新模型和数据创意
一些提交引入了特别有创意的模型和数据表示方法。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

Parameter Golf挑战赛
- 引言
- 技术印象
  - 训练优化
  - 量化
  - 测试时间和评估策略
  - 新模型和数据创意

金句 / Highlights

值得收藏与分享的关键句。

超过1,000名参与者提交了2,000多个方案，展示了广泛的技术创造力。
— 第 2 段
⬇︎ 下载 PNG 𝕏 分享到 X
AI编码代理降低了实验成本，使更多人能够参与竞赛。
— 第 4 段
⬇︎ 下载 PNG 𝕏 分享到 X
竞赛成为发现人才的有效途径，揭示了机器学习领域的优秀技术和持久力。
— 第 5 段
⬇︎ 下载 PNG 𝕏 分享到 X

#AI#机器学习#竞赛#参数优化

打开原文

We launched Parameter Golf to engage and support the machine learning research community in exploring a new, tightly constrained machine learning problem. We wanted the challenge to be interesting enough to reward real technical creativity, while remaining conceptually simple and easy to verify.

Participants had to minimize held-out loss on a fixed FineWeb dataset while staying within a 16 MB artifact limit, including both model weights and training code, and a 10-minute training budget on 8×H100s. We provided a baseline, dataset, and evaluation scripts so participants could fork the repo, improve the model, and submit their results through GitHub.

Over the course of eight weeks, we received more than 2,000 submissions from over 1,000 participants. We were impressed by the technical breadth, creativity, and rule-bending across the submissions, from careful optimizer tuning and quantization work to new modeling ideas and test-time training.

One of the most exciting parts of the challenge was seeing how widely participants used AI coding agents. Agents helped lower the cost of experimentation, made it easier for more people to participate, and changed the pace of the competition. They also created new challenges for submission review, attribution, and scoring.

The challenge also became a meaningful talent discovery surface for us. That was one of our goals for Parameter Golf, and it was a useful signal that open-ended technical challenges can reveal exceptional machine learning taste and persistence.

In this post, we highlight some of the submissions we found surprising and interesting, and share what we learned from running a coding contest in the age of powerful AI agents.

Technical impressions

We judged and independently reproduced each submission on the record-track leaderboard, and verified that each submission was record-breaking at the time it was submitted. Several themes stood out.

_Training optimization_

Some of the strongest results came from careful tuning of existing components.

SubmissionContributorTechniqueWhy it mattered #60@notapplica Combined prior wins from #50, #42, and likely #39, then made a deeper model work with Muon weight decay, spectral embedding initialization, residual-mix scheduling, and compiled evaluation.A strong example of disciplined leaderboard work: identifying which existing improvements matter and combining them cleanly.

_Quantization_

Several submissions pushed hard on compression and export.

SubmissionContributorTechniqueWhy it mattered #414@signalrush Used GPTQ-lite to quantize weights after training.The first leaderboard submission to successfully use GPTQ-lite, leading to better evaluation. #1060@dexhunter Built on #634 by @raahilshah to successfully use full Hessian GPTQ.Extended earlier quantization work into a stronger compression path.

_Test-time and evaluation strategies_

Some submissions pushed the boundary between model improvement and evaluation strategy. These approaches were valid under the rules, but they required careful review from us as organizers.

SubmissionContributorTechniqueWhy it mattered #77@samacqua Used score-first, per-document LoRA test-time training: score first, adapt only on already-scored chunks, and reset at document boundaries.Pushed the boundary between model improvement and evaluation strategy while staying reviewable under the rules. #1019@abaybektursun Used self-generated GPTQ calibration: generate calibration text from the trained model, then build GPTQ Hessians from those activations.A creative calibration strategy that required careful review from organizers.

_New modeling and data ideas_

A few submissions introduced modeling or data ideas that were especially creative.

SubmissionContributorTechniqueWhy it mattered #1729@romeerp Introduced the CaseOps tokenizer: lossless capitalization operator tokens with original-byte BPB sidecar accounting.A creative tokenizer and data-representation idea. #265@unnir Introduced XSA, an efficient partial Exclusive Self Attention approach with GQA-aware grouped views.Brought an efficient attention variant into the challenge. #65@aquariouseworkman Introduced SmearGate and BigramHash: a learned previous-token embedding blend plus adjacent-token-pair hash features.Added new feature mechanisms from scratch. #1204@msisovic Introduced mini depth recurrence: repeated layers 4 and 5, delayed recurrence until mid-training, and partially untied the repeated MLPs.The first accepted leaderboard row to make recurrent layers work effectively.

We chose to highlight these nine submissions because they represent the range of results we hoped the challenge would surface. Some participants found wins through careful tuning. Others pushed quantization and low-rank techniques. Some explored edges of the evaluation rules. And several introduced modeling or data ideas, from the literature or from scratch, that produced unexpected gains.

The nonrecord track was home to many creative submissions. We highlighted 15 favorites, including approaches ranging from non-autoregressive text modeling to dynamic tokenization.

Because this track was more experimental, we focused less on raw performance and more about whether the approach was technically interesting. Three submissions stood out in particular:

These were our favorite three nonrecord submissions, even though they were not necessarily the top three by performance.

That said, the nonrecord track was still competitive. Half of nonrecord leaderboard entries beat the naive baseline of 1.22 BPB, and the top-ranked entry reached 1.12 BPB.

We found this encouraging. Even against strong transformer baselines, alternative approaches could sometimes hold their own against the dominant architecture.

We also think that this track benefits especially from the availability of strong coding agents. Agents made it much cheaper to prototype speculative ideas, including approaches that may previously have felt too time-consuming or uncertain to try in a short competition.

Takeaways

A major difference between Parameter Golf and earlier competitions like it was the widespread use of coding agents. The vast majority of submitters mentioned using agents as part of their work.

That lowered the barrier to entry. Participants could set up experiments faster, inspect unfamiliar code, and test ideas with less friction. RunPod’s sponsorship of $1,000,000 in compute also played a major role in making the challenge accessible to more people.

At the same time, agent use created new issues for submission and scoring. Many submissions were small changes to existing top scorers, rather than fundamentally new approaches. This was often useful: strong ideas spread quickly and were refined by others. But it also created noise. When submissions that fell outside the competition guidelines produced unusually strong scores, other agents sometimes copied those ideas and continued down the same invalid path.

The volume of submissions also changed how we had to run the competition. We could not manually inspect every submission and still keep the leaderboard moving. During the challenge, we developed an internal Codex-based triage bot to monitor new submissions and flag them for human review. This became especially important during periods when we received hundreds of submissions a day.

AI agents also became part of the community around the challenge. For much of the competition, @notapplica and their coding agent ran a “Live Updates” bulletin, tracking major events, explaining leaderboard approaches, and helping other participants follow the competition. Community review tools also appeared to help less experienced participants check whether their submissions were within the rules and avoid common invalid approaches.

What’s next?

Our primary goal was to launch a challenge that eligible participants⁠(opens in a new window) could take part in and experience machine learning research. Parameter Golf brought in a wide range of technically strong and creative submissions, and it gave us a clearer view of how open research competitions may change as AI agents become more capable and widely used.