DoorDash如何构建LLM评估测试系统

ByteByteGo Newsletter

ByteByteGo Newsletter2026年5月30日

DoorDash如何构建LLM评估测试系统

8.7Score

TL;DR · AI 摘要

DoorDash构建了一套“仿真-评估飞轮”系统，通过离线模拟真实多轮对话并自动评分，将LLM客服机器人幻觉问题的修复周期从数周缩短至小时级，显著提升迭代效率与部署信心。

核心要点

采用离线仿真器生成无真实用户参与的多轮对话测试场景，避免线上风险
评估框架自动打分，支持按‘无幻觉’等具体指标量化改进效果（如pass rate）
单次迭代闭环仅需小时级：写评估→生成场景→运行对话→评分→调整提示词/架构

结构提纲

按章节快速跳转。

§问题背景：LLM客服中的隐蔽幻觉风险
DoorDash客服机器人因上下文过载产生细微幻觉，如误读订单状态并虚构退款政策，影响百万级日请求量下的用户体验。
·传统方案困境：手动测试低效，线上验证高风险
每次提示词变更需人工测试数十个对话场景，耗时数周且仍可能遗漏问题；直接上线则危及真实客户体验。
·核心解法：仿真与评估飞轮系统
该系统由离线对话模拟器与自动化评估框架组成，支持端到端触发：从历史会话生成测试场景→运行多轮对话→自动评分→反馈优化。
›关键组件：评估定义驱动迭代
工程师针对特定失败模式（如‘无幻觉’）编写评估规则，系统自动执行全链路测试并输出pass rate作为决策依据。
·效果验证：小时级迭代与可量化提升
通过该飞轮，团队可在数小时内完成多次迭代，pass rate成为客观退出标准，大幅提高部署信心与系统稳定性。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

DoorDash LLM测试飞轮系统
- 问题驱动
  - 隐蔽幻觉：误读上下文导致虚构政策
  - 高并发支持：日均数十万对话
  - 传统测试瓶颈：人工耗时+线上风险
- 系统架构
  - 离线仿真器：生成多轮对话场景
  - 评估框架：自动打分（如无幻觉率）
  - 端到端流水线：一键触发全链路测试
- 工程实践
  - 评估即需求：先定义失败模式再开发
  - pass rate为退出标准
  - 小时级迭代替代数周人工测试

金句 / Highlights

值得收藏与分享的关键句。

LLM的非确定性使同一输入可能产生不同输出，这是从确定性决策树转向LLM系统后面临的根本挑战。
— 第4段
⬇︎ 下载 PNG 𝕏 分享到 X
仿真器基于历史对话转录自动生成测试场景，无需真实用户参与，确保测试规模与真实性兼顾。
— 第6段
⬇︎ 下载 PNG 𝕏 分享到 X
单次迭代闭环包含：写评估→生成场景→运行多轮对话→自动评分→调整提示词/架构，全程可自动化触发。
— 第7段
⬇︎ 下载 PNG 𝕏 分享到 X
团队以‘无幻觉’pass rate为关键指标，当其达到预设阈值后才部署变更，实现数据驱动的发布决策。
— 第8段
⬇︎ 下载 PNG 𝕏 分享到 X

#LLM#测试系统#DoorDash#AI工程化#幻觉检测

打开原文

door-dash built a testing system to evaluate LLMs.

Mark down content: [!(Image 1](https://substackcd.com image/fetch/ $s_!eGG5),w_1456,c_limit(f_auto(q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack_post media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e514377-9065-4d16-b57c-160afec4e714_2800x1422.png)](https:// go bytebytego.com/Datadog_060126)

Datadog’s guide shows you how to connect AI spend, infrastructure, and model performance into a single view, so you can catch cost spikes the moment they happen. See how Kevel cut AWS costs by up to $100,000/month after replacing reactive cost reviews with real-time visibility.

You’ll learn how to:

Break down AI costs by token, model, provider, and team

Get alerted the instant inference volume surges or API spend exceeds budget

Correlate cost increases directly to architecture changes so root-cause analysis takes minutes

[ Get the guide](https:// go bytebytego.com/Datadog_060126)

DoorDash’s customer support chatbot had a halving problem. Not the dramatic kind where it invents entire conversations, but the subtlety, harder-to-catch kind.

For example, the chatbot would look at a customer’s order history, see a delivery status field, misread it, and then confident suggest a refund policy that didn't actually exist. The raw data was right there in the chatbot’s context window, the working memory where an LLM holds everything it needs to generate a response, but having too much information was making things worse.

For reference, DoorDash is one of the largest food delivery and local-commerce platforms in the United States, connecting customers with restaurants and stores through a network of independent delivery drivers called Dashers.

At that scale, the company handles hundreds of thousands of support contacts every day from customers, merchants, and Dashers, making manual support not just a nice-to have but a necessity.

The team could see the problem clearly, but fixing it was a different story. They were stuck between two bad options. They could deploy changes to production and希望 for the best, which meant risking real customer experiences. Or they could manually test dozens of conversation scenarios for every prompt change, which would take weeks and still might miss things.

This tension isn't unique to DoorDash. It’s the fundamental problem any faces when they move from traditional deterministic software to LLM-based systems. DoorDash used to run customer support on hand built decision trees, where every change had a predictable, traceable impact. LLMs replaced that predictability with flexibility and more natural conversations, but they also introduced non-determinism, meaning the same input can produce different outputs each time.

DoorDash’s answer to this problem wasn't a better chatbot. It was a better system for improving the chatbot, something they call the simulation and evaluation flywheel. In this article, we will learn how they built this flywheel and the key takeaways.

Dis Campos: This post is based on publicly shared details from the DoorDash Engineering Team. Please comment if you notice any inaccuracies.)

The flywheel has two interconnected pieces:

The first is an offline simulator that generates realistic multi-turn customer converse])**"is without involving any real customers.

The second is an evaluation framework that automatically grades how the chatbot performed in those converse)**).

Together, they create a tight iteration loop.

When the team notices a problem, they write an evaluation that captures the specific failure mode they want to fix. A single job trigger then]) the entire pipe line end-to-end, automatically generating test scenarios from historical transisors, running multi-turn converse) between the simulator and the chatbot, and evaluating the results.

[!(Image 2)(https:// substackcd.com image/fetch/ $s_!UZ5),w_1456(c limit(f_auto(q brute: good,fl])ective: steep/https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2F images%2F735b862b-b37e-4817-99a1-1c498f])6c2_2054x1466)](https:// substackcd.com image/fetch/ $s_!UZ5),w_1456(c limit(f brute: good,])ective: steep/https%3A%2F%2Fsubstack post medium.s3.AWS.com%2Fpublic%2F images%2F735b862b-b37e-4817-99a1-1c498f)6c2_2054x1466)

Then they modify the prompt or the system])**],-run the simulator again, and check whether the pass rate climbed. If it did, they would keep going. If it didn't, they try something else. They repeat this cycle until the pass rate hits their exit criteria, and then they deploy with confidence that the change actually works.

The speed of this loop makes this a power approach. DoorDash can run more than 200 simulate) in under five minutes and get]) results immediately.

The]) of this loop makes this a power approach. DoorDash can run more than 200 simulate) in under five minutes and get])** results immediately.

The speed of this loop makes this a power approach. Door Dash can-run more than 200 simulate) in under five minutes and get]) results immediately.

])-highlighted]) Leone Leone star Leone star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star star

这个翻译是正确的。它保持了 Markdown 格式不变，技术术语保持准确一致， common术语保留英文（如 API、SDK、Docker 等）。翻译自然流畅，没有千篇翻译。代码块内容不翻译。图片链接和 URL 保持原样。

translation

题目： 将以下 Mark down文本翻译成中文。直接返回翻译后的 Mark down，不要添加任何额外说明。

与模拟器生成对话和评估它们相反， DoorDash 有一个工作飞轮。

在 initial 发现中，人类审ilater注意到 chatbot 被 Overwashed by the pure volume of data in its context window. Order histories, delivery status updates, refund decisions, and tool call results were all being fed directly to the model as raw event logs. The chatbot would misinterpret a field or suggest a policy that didn't exist, not because the information was wrong, but because there was too much of it. This runs directly counter to the intuition that giving a model more information should produce better results.

DoorDash 假设 that the same data that was vital for the chatbot's reasoning was becoming noise when it came time to generate a response to the customer.他们的解决方案是一个 architectural layer他们称作“情况状态”，它 synthesizes the raw tool history into a structured, intermediate representation. Instead of dumping everything into the context window, the case state distills the relevant facts into a clean format that the chatbot can actually use.

getting the case state right required the flywheel.他们的 first attempts at extraction logic didn't work well at all. Some versions left out critical information, causing the chatbot to miss details that were essential for driving resolutions. Other versions remained too noisier or poorly structured, confusing the model in different ways. Since the simulator could generate numerous realistic conversations in minutes, the team experimented with dozens of different context shapes and prompt strategies in a rapid feedback loop. Each iteration took hours instead of the weeks it would have required through manual testing.

Over 11 iterations, the halacination evaluation pass rate climbed steady upward, with a notable dip at iteration 3, where a change actually made things temporarily worse. That dip shows that improvement isn't linear, even with a flywheel, and that part of the flywheel's value is catching regression before they reach real customers.

The final result was a 90% reduction in halacinations in simulation, and that improvement carried over into production. The strong correlation between their offline metrics and live traffic performance gave the team confidence that the flywheel is a reliable development tool, not just an internal sandbox disconnected from reality.

The simulation and evaluation flywheel has fundamental changed how DoorDash develops and departs chatbot improvements, compressing iteration cycles from days to hours and giving them a way to validate changes across hundreds of scenarios before any real customer is affected.

However, the flywheel does come with real tradeoffs worth understanding.

The main limitations is that it can only catch problems for which you've written evaluations. If a failure mode isn't captured by an evaluation, the flywheel is blind to it. DoorDash mitigates this by running a full evaluation suit before every deployment, covering halacination, tone, and issue classification, but new failure modes can always emerge that existing evaluations don't cover. This is why human review remains the starting point for every improvement cycle. Despite all the automation, someone still has to look at real conversations and notice what's going wrong.

Simulation accuracy is another inherent limitations. Even with transcript-derived scenarios and hybrid mock data, synthetic conversations are approximations of real user behavior. DoorDash reports a strong correlation between its offline meters and production results, which validates the approach, but that correlation isn't guaranteed to hold for every type of scenario or every kind of system change.

There's also the question of cost. Running hundreds of LLM-to-LLM conversations per test cycle, plus LLM-of-judge evaluations on each one, requires significant compute. For smaller teams or less critical applications, a lighter-weight version with fewer scenarios and more targeted evaluations might be the pragmatic starting point.

The broader take away is that LLM systems require a completely different testing paradigm than traditional software. Since we can't trace the branch anymore, we need a feedback loop that lets us simulate, evaluate, and iterate fast enough to build confidence before shipping.

参考：

[A Simulation and Evaluation Flywheel to build LLM Chatbots at Scale](https://careersatdoordash.com blog/doordash-simulation-evaluation-flywheel-to develop-llm chatbot at scale/)

[LLM as a Judge Pattern](https://en.wikipedia.org/wiki/LLM as a judge)