Can LLMs Replace Survey Respondents?

TL;DR · AI 摘要
Large language models (LLMs) can replicate average responses of major household surveys, but they fail to capture the dispersion of responses, leading to a 'mode collapse' where the model's responses are too homogeneous. The paper 'Can LLMs Mimic Household Surveys?' explores this issue and attempts to address it through unlearning techniques, showing some improvement in capturing the variability of human responses.
核心要点
- LLMs can accurately replicate average survey responses but fail to capture the diversity of individual responses.
- Mode collapse is a significant issue in LLM-based survey simulations, where the model's responses are too similar, lacking the variability seen in human surveys.
- Unlearning techniques, such as gradient ascent and negative preference optimization, can help mitigate mode collapse and improve the dispersion of LLM-generated survey responses.
结构提纲
按章节快速跳转。
Explores the capability of LLMs to simulate survey responses and the issue of mode collapse in their outputs.
Discusses how LLMs accurately capture average survey responses but fail to represent the diversity of individual responses.
Introduces methods like gradient ascent and negative preference optimization to improve the dispersion of LLM-generated survey responses.
Summarizes the findings and implications of using LLMs for survey simulations, highlighting the need for further research in addressing mode collapse.
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- LLMs in Survey Simulations
金句 / Highlights
值得收藏与分享的关键句。
Recent papers find that large language models can replicate the average responses of major household surveys to within a percentage point.
The same Llama-3 model that hits the SCE median to within a percentage point places 95% of its simulated respondents inside a two-percentage-point window.
Unlearning strategies to mitigate mode collapse include gradient ascent and negative preference optimization.
作为专业的技术文档翻译专家,我将尽力提供准确、自然的中文翻译,同时保持原文的Markdown格式和代码块不变。在翻译过程中,我会注意保留常见的技术术语,并确保翻译的流畅性。以下是我对第一段内容的翻译:
标题:LLMs能否替代调查受访者?
来源URL:https://towardsdatascience.com/can-llms-replace-survey-respondents/
发布时间:2026-05-20T18:26:36+00:00
Markdown内容:
你让一个LLM模拟6000个美国家庭回答关于通货膨胀的问题,会怎么样?最近的论文发现,大型语言模型可以将主要家庭调查的平均响应复制到百分之一的精度内(Zarifhonarvar,2026)。2020年,《消费者预期调查》(SCE)报告了一年后的中位通胀率为约3%。通过具有现实人物和知识截止指令的提示LLM生成的中位数也是大约3%。这已经足够接近,以至于LLMs被提议作为SCE、密歇根调查和专业预测者调查的低成本、高频补充。
在最近与杜伊斯堡-埃森大学的Ami Dalloul合著的论文《LLMs能否模仿家庭调查?》中,我们研究了第二时刻,即概率分布的一部分,它告诉你模型代表的是一个意见还是一千个意见。正是在这里,基于LLM的调查的表面上的成功消失了。同一个Llama-3模型,在中位数上与SCE的精度达到一个百分点,将其模拟受访者中的95%置于两个百分点的窗口内。真实的2020年SCE响应范围从大约负25%到正27%。简而言之,平均值是对的,但其背后的人口并不存在。因此,用几千个LLM人物运行模拟最终归结为一个代表性代理人。
图1:真实世界和合成调查人口的分散

注释:左面板绘制了2020年SCE受访者围绕其平均值的分散情况。扩散辐射反映了受访者之间的异质信念。中间面板对具有与SCE人口统计分布相匹配的人物的Llama-3.1-8B-Instruct模型的提示合成响应应用相同的构建。散点图坍缩到一个近似点。模型恢复了平均值,但放弃了其他一切。右面板使用相同的Llama模型,通过梯度上升(GA)进行未学习。未学习的模型实现了更现实的分散,并且没有围绕模式坍缩。
模式坍缩
我们用五个LLM(Llama-3-8B、Llama-3-70B、Claude-3.7-Sonnet、DeepSeek-V3、GPT-4o)对SCE、密歇根调查和专业预测者调查进行了基准测试。在人类调查中,44%到70%的受访者给出的答案与模态回复相差超过3个百分点;在LLM样本中,这个比例基本上为零。
调查模拟文献中的标准补救措施并没有改善这个问题。基于人口普查的人物,具有复杂和多样的特征,零样本知识截止指令(“你不知道2018年6月之后的事件”),以及明确的“不要查找统计信息”提示,都默认为相同的窄分布。很可能的原因是,LLMs在训练语料库中看到了CPI表格、FRBNY调查发布的新闻报道和学术复制。当被问及2020年的中位通胀预期时,模型是在检索记忆中的数据。训练数据的重量压倒了提示指令要求它做的事情。
使LLMs遗忘
如果记忆的统计数据是问题所在,一个可能的解决方法是将它们从权重中移除,而不是要求模型忽略它们。我们对Llama-3.1-8B-Instruct,一个开源模型,允许我们修改其权重,应用了两种未学习方法:
- 梯度上升(GA)最大化忘记集上的预测损失,该忘记集包含CPI系列和调查汇总,同时在微调查推理上最小化保留损失,以使一般能力存活。
- 负偏好优化(NPO)将忘记集视为不受欢迎的完成,并通过参考模型最小化有界偏好损失。
我们要求模型忘记的数据是官方的通胀记录本身:每月的CPI系列和公布的FRBNY SCE和密歇根调查的平均通胀预期。未学习对响应分布的影响见表1。
表1 不同未学习策略的尾部准确性

注释:缓解模式坍缩的未学习策略。梯度上升(GA)是一种针对性的未学习方法,其中模型被微调以最大化忘记集上CPI统计数据的损失,同时最小化损失或保留(RT)在微调查数据上,以保持一般能力。负偏好优化(NPO)将官方统计数据视为不受欢迎的完成,并将其最小化,同时将保留(RT)样本视为正面。合成调查回复的通胀预期作为百分比偏差从模式和平均值(在括号内)在精确匹配、±1和>3%偏差的bin中。尾部准确度衡量与FRBNY尾部分散基准(>±3.0=44.38%)的接近程度。
基线Llama-3(包括提示基础的未学习)在92%的回复中产生精确的模式匹配,且没有超过3pp的回复。因此,与SCE基准44%的尾部准确度为零。经过GA处理后,精确匹配下降到24%,43%的回复移动到±3pp以上;尾部准确度达到97%。NPO在37%和43%方面相似,尾部准确度为98%。换句话说,两种未学习方法似乎恢复了更现实的分布。
图2 LLMs与未学习模型的分散
翻译说明
在翻译过程中,我尽量保持原文的结构和术语不变,特别是技术术语如LLM、SCE、CPI等保留英文,因为这些术语在中文环境中也常被直接使用。同时,我确保了Markdown格式的完整性,包括标题、列表和图片的处理。代码块内容未进行翻译,因为它们通常不需要本地化。图片链接和URL保持原样,以确保读者可以访问原始资源。
在翻译时,我注重保持语言的自然流畅,避免生硬的逐字翻译,以便中文读者能够轻松理解原文的内容和含义。对于专业术语和表达,我参考了现有的技术文档和翻译惯例,以确保翻译的准确性。
请注意,我将翻译以下Markdown文章为中文。我会保持Markdown格式不变,如标题、列表、代码块和链接等。技术术语将保持准确一致,常见的术语如API、SDK、Docker等将保留英文。代码块内容不会被翻译,图片链接和URL也将保持原样。翻译将追求自然流畅,避免逐字翻译。
原文:

_Note_: The left-hand side plots kernel density estimates of 2020 inflation expectations from the FRBNY SCE and two Llama-3 variants trained with unlearning methods, gradient ascent (GA) and negative preference optimization (NPO). Both unlearning variants cover the range where FRBNY SCE places probability mass, though they still remain more concentrated than the human benchmark and slightly skewed to higher means. The right-hand side compares the KDEs of prompted LLM-generated expectations (GPT-4o, Llama-3, etc.) to FRBNY SCE in 2020. The LLM curves (left axis) are tightly clustered around a narrow region, while the FRBNY SCE curve remains much broader. The LLMs can match central tendency yet fail to reproduce the cross-sectional spread of survey micro-data. Bandwidth = 0.5 for all KDEs.
The kernel densities (Figure 2) show that off-the-shelf models pile probability mass into a thin spike near the mean. The unlearned variants spread mass across the range where the human respondents of the SCE put it.
Simulating a randomized controlled trial
A wider distribution is necessary but not sufficient for the application that motivated our paper: replicating survey RCTs with synthetic versions. RCTs are expensive. After data collection ends, a researcher cannot go back to test a theory that emerged later or vary a treatment. Synthetic agents would let us do exactly that, if their behavior matches what real respondents produce.
To test this, we replicate a real-world RCT by Coibion, Gorodnichenko, and Weber (2022). Respondents are randomly assigned to one of several groups: a control group sees no information, several treatment groups each receive a different economic piece of information (the actual past inflation rate, the Fed’s 2% target, etc.), and a placebo group is shown content unrelated to inflation. All respondents first report a prior inflation expectation, then see whatever their group is assigned, and then report a new posterior expectation. The difference between posterior and prior is the respondent’s revision.
A treatment works if its revisions differ visibly from the control group’s, and if the direction of the shift matches what economic theory expects: downward revisions from FOMC communication, upward revisions from news of higher gasoline prices. The check for our synthetic agents is whether their revisions separate the same way the human respondents did.
We built 30,000 synthetic personas with Census-derived demographics, and estimated the average treatment effect on each of the three LLMs, including our unlearned ones. The first check is on the priors themselves: the inflation expectations agents report before they see any information. Figure 3 plots the mean and standard deviation of these priors across demographic subgroups for the human benchmark and the three LLMs. One unlearning model (Llama-GA) comes close to the human aggregate in both level and dispersion. While one unlearning method worked (GA), the other did not (NPO). So unlearning may not be a one-size-fits-all remedy.
Figure 3 Model Estimates of Perceived Inflation

_Note_: Each panel plots by demographic subgroup for the human benchmark (Coibion et al., 2022), the baseline Llama-3, and its two unlearned variants (GA, NPO). The dashed line marks the human “All” value. Left-hand side: Llama-3 and Llama-NPO are essentially flat across demographic characteristics; Llama-GA tracks the human level on average but does not reproduce the within-demographic ordering (e.g. predicting the highest mean for “college or more” and “Inc T3,” contrary to the human pattern). Right-hand side: the unlearned GA model recovers most of the dispersion collapsed by the base model.
The next check is on how the priors get updated after the information treatment. In the baseline Llama-3 and Llama-NPO models, revisions are essentially identical across every treatment and the models do not register a treatment effect at all. Llama-GA is the only one where the treatments separate, and within its largest subgroup of agents (80% of the sample) the four monetary-policy treatments (past inflation, Fed target, FOMC forecast, FOMC statement) produce negative and significant revisions of the same sign and rough magnitude as the human respondents in Coibion et al.
What to take from this
For researchers and practitioners deciding whether to use LLMs to conduct surveys, the summary is:
- LLMs are unable to imitate different personas. Simulating surveys comes down to one agent answering the same question thousands of times, hitting something very close to the mean every time, sometimes up to four decimal places.
- Targeted unlearning recovers most of the dispersion and a respectable share of the treatment effects in an RCT with human respondents. However, unlearning methods achieve different levels of success.
- The gap between mean accuracy and distributional accuracy is large enough that any paper using synthetic respondents should report the second.
Future work should treat distributional accuracy and data leakage as joint constraints rather than secondary concerns. Progress will depend on methods that account for both what models know and how their outputs are evaluated, with greater attention paid to dispersion, tails, and belief updating rather than averages alone.
References
Coibion, O., Y. Gorodnichenko, and M. Weber (2022). Monetary policy communications and their effects on household inflation expectations. _Journal of Political Economy_ _130_(6), 1537–1584.
翻译:

注:左侧绘制了2020年FRBNY SCE和两种使用无学习方法训练的Llama-3变体(梯度上升GA和负偏好优化NPO)的通胀预期的核密度估计。两种无学习变体涵盖了FRBNY SCE赋予概率质量的范围,尽管它们仍然比人类基准更集中,并且稍微偏向更高的均值。右侧将提示LLM生成的预期(如GPT-4o、Llama-3等)的KDE与2020年的FRBNY SCE进行比较。LLM曲线(左轴)紧密地聚集在一个狭窄的区域,而FRBNY SCE曲线则宽得多。LLM可以匹配中心趋势,但无法再现调查微观数据的横截面扩散。所有KDE的带宽为0.5。
核密度(图2)显示,现成的模型将概率质量堆积在接近均值的薄尖峰附近。未经学习的变体在人类SCE受访者放置概率质量的范围内扩散质量。
模拟随机对照试验
对于我们的论文所激励的应用——用合成版本复制调查RCT,更广泛的分布是必要的但不充分的。RCT很昂贵。在数据收集结束后,研究人员不能回去测试后来出现的理论或改变治疗。如果合成代理的行为与真实受访者一致,合成代理将让我们精确地做到这一点。
为了测试这一点,我们复制了Coibion、Gorodnichenko和Weber(2022)的真实世界RCT。受访者被随机分配到几个组之一:控制组不接收任何信息,几个治疗组各自接收不同的经济信息(实际过去的通胀率、美联储2%的目标等),以及一个安慰剂组展示与通胀无关的内容。所有受访者首先报告一个先验通胀预期,然后看到他们组分配的信息,然后报告一个新的后验预期。后验和先验之间的差异是受访者的修订。
如果治疗的修订明显不同于控制组,并且 Shift 的方向与经济理论预期一致(例如,从FOMC沟通中向下修订,从汽油价格上涨新闻中向上修订),则治疗是有效的。对我们合成代理的检查是他们的修订是否以与人类受访者相同的方式分离。
我们构建了30,000个具有人口普查衍生人口统计学的合成人物,并估计了三个LLM(包括我们的未经学习的那些)的平均治疗效果。第一个检查是先验本身:代理在看到任何信息之前报告的通胀预期。图3绘制了人类基准和三个LLM在不同人口统计子组中的这些先验的均值和标准差。一种无学习模型(Llama-GA)在水平和分散方面都接近人类总体。虽然一种无学习方法有效(GA),另一种则没有(NPO)。因此,无学习可能不是万能的解决方案。
图3 模型对感知通胀的估计

注:每个面板按人口统计子组绘制人类基准(Coibion等,2022)、基线Llama-3及其两种无学习变体(GA、NPO)。虚线标记人类“所有”值。左侧:Llama-3和Llama-NPO在人口统计特征上基本平坦;Llama-GA在平均意义上跟踪人类水平,但没有再现组内人口统计排序(例如,预测“大学或以上”和“Inc T3”的最高均值,这与人类模式相反)。右侧:未经学习的GA模型恢复了基线模型压缩的大部分分散。
接下来的检查是先验在信息治疗后如何更新。在基线Llama-3和Llama-NPO模型中,修订在每种治疗中几乎相同,模型根本未注册治疗效果。Llama-GA是唯一一种治疗分离的模型,在其最大代理子组(样本的80%)中,四种货币政策治疗(过去通胀、美联储目标、FOMC预测、FOMC声明)产生负的、显著的修订,其符号和大致 magnitude与Coibion等的人类受访者一致。
从中吸取什么
对于决定是否使用LLM进行调查的研究人员和实践者,总结如下:
- LLM无法模仿不同的角色。模拟调查归结为一个代理成千上万次回答同一个问题,每次都接近均值,有时精确到小数点后四位。
- 针对性的无学习恢复了人类受访者在RCT中大部分的分散性和可敬的治疗效果份额。然而,无学习方法取得的成功水平不同。
- 均值准确性和分布准确性之间的差距足够大,任何使用合成受访者的论文都应报告第二个。
未来的工作应该将分布准确性和数据泄露视为联合约束,而不仅仅是次要关注点。进展将取决于考虑模型知道什么以及如何评估其输出的方法,并且更加关注分散性、尾部和信念更新,而不仅仅是平均值。
参考文献
Coibion, O., Y. Gorodnichenko, and M. Weber (2022). 货币政策沟通及其对家庭通胀预期的影响。_政治经济学杂志_ _130_(6), 1537–1584。
达卢尔,A.,普费弗,M.(2026年)。大型语言模型能否模拟家庭调查?:从代表性代理人到人口分布。SSRN预印本。工作论文链接
扎里夫霍纳尔瓦尔,A.(2026年)。使用大型语言模型生成通胀预期。《货币经济学杂志》157,103859
复制数据
达卢尔,A.,普费弗,M.(2026年)。复制数据:《大型语言模型能否模拟家庭调查?:从代表性代理人到人口分布》,https://doi.org/10.7910/DVN/CRIRVJ,哈佛数据档案库,V1。