提升代理工具调用准确性：使用SFT和DPO在Amazon SageMaker AI上

AWS Machine Learning Blog

AWS Machine Learning Blog2026年6月3日

提升代理工具调用准确性：使用SFT和DPO在Amazon SageMaker AI上

8.5Score

TL;DR · AI 摘要

通过使用监督微调（SFT）和直接偏好优化（DPO）技术，可以显著提高小语言模型在Amazon SageMaker AI上的工具调用准确性。这些方法结合了高质量数据集和人类反馈，以优化模型与数字工具的交互能力。

核心要点

使用SFT和DPO技术可以提高AI代理执行复杂任务时选择正确工具的能力。
SFT和DPO共同构成了一个强大的框架，用于训练语言模型与各种数字工具进行交互。
通过这些技术，可以构建能够理解并生成类似人类文本、自主与外部应用交互以完成复杂任务的AI系统

结构提纲

按章节快速跳转。

§引言
介绍使用SFT和DPO技术提高小语言模型工具调用准确性的背景和目的。
·核心机制
解释SFT和DPO的工作原理及其如何结合使用。
›解决方案概述
描述如何在Amazon SageMaker AI上使用SFT和DPO进行模型微调。
›先决条件
列出进行模型微调所需的AWS资源和权限。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

Improve your agent’s tool-calling accuracy with SFT and DPO on Amazon SageMaker
- Fine-tuning methodologies
  - Supervised Fine-Tuning (SFT)
  - Direct Preference Optimization (DPO)
- Solution overview
  - Amazon SageMaker AI training jobs
- Prerequisites
  - AWS account and IAM role
  - Development environment

金句 / Highlights

值得收藏与分享的关键句。

例如，HuggingFace TRL库用于DPO的训练样例格式如下:
— 第 4 段
⬇︎ 下载 PNG 𝕏 分享到 X
这种基于反馈的方法允许根据训练数据中的真实使用模式迭代改进模型的工具交互能力。
— 第 6 段
⬇︎ 下载 PNG 𝕏 分享到 X

#监督微调#直接偏好优化#Amazon SageMaker AI

打开原文

提高 Amazon SageMaker AI 上 SFT 和 DPO 优化的代理工具调用准确性 | Amazon Web Services

发布时间：2026-06-03T07:56:50-08:00

Markdown 内容

AI 代理可以自主处理复杂的多步骤任务，但其效果取决于调用正确的工具来获取信息或采取行动。当代理选择了错误的工具、格式参数不正确，或者破坏了工作流链时，任务完成时间会增长，错误率会上升，支持成本会增加，用户体验会恶化。随着越来越多的组织将代理应用程序从试点阶段迁移到生产阶段，拥有能够为每个请求选择正确工具的代理对于可靠的自动化至关重要。

在这篇文章中，您将学习如何使用监督微调（SFT）和直接偏好优化（DPO）相结合的方法来提高小型语言模型（SLM）的工具调用准确性。该示例使用 Amazon SageMaker AI 训练作业，因此您可以专注于训练代码而不是管理自己的训练基础设施。您还将了解如何评估工具调用准确性，并将基础模型与几个微调变体进行比较，以便做出关于模型质量的数据驱动决策。

微调方法论

监督微调涉及创建一个高质量的数据集，该数据集与模型的预期功能紧密对齐，提供明确的示例，说明模型应该如何执行某些任务或与特定工具交互。这种方法特别有效，可以教模型识别工具特定语言、命令和约束的细微差别。

直接偏好优化通过将人类反馈或预定义目标直接纳入训练循环中来细化这些交互。DPO 更紧密地使模型的输出与目标结果对齐，强调偏好某些类型的回答或行为胜过其他类型。DPO 中的训练数据包含“喜欢这个，不喜欢那个”的偏好，这优化了与强化学习相同的目标准则，而无需奖励函数或奖励模型。这种方法减少了资源需求和训练时间，同时保持质量。

图1：显示直接偏好优化训练流程的图表，该流程将首选和拒绝的响应进行比较，以使模型输出与人类偏好对齐

来源：arXiv:2305.18290[cs.LG]

例如，HuggingFace TRL 库中的 DPO 在训练样本中采用以下格式：

code

{
    "prompt": ["<array of input samples>"],
    "chosen": "<complete preferred response (j)>",  # rated better than k
    "rejected": "<complete non-preferred response (k)>",  # rated worse than j
}

Python

这种基于反馈的方法允许根据训练数据中的真实使用模式迭代改进模型的工具交互能力。

SFT 和 DPO 一起构成了一个强大的框架，用于微调语言模型以与各种数字工具接口。通过使用这些技术，您可以构建理解并生成类似人类文本的 AI 系统，并通过自主与外部应用程序交互执行复杂任务，从而扩展 AI 在消费和企业环境中的范围和用途。

要了解 Amazon SageMaker Studio 笔记本和 Amazon SageMaker AI 训练作业相关的费用，请参阅 SageMaker AI 定价页面。

解决方案概览

在本节中，我们将概述如何在 Amazon SageMaker AI 训练作业上微调 Qwen3 1.7B，这是一个完全托管的服务，支持分布式多 GPU 和多节点配置。使用 SageMaker AI 训练作业，您可以按需启动高性能集群，更快地训练百亿参数模型，并在作业完成后自动关闭资源。来自基础设施和训练循环内部的指标被发送到 MLflow on SageMaker AI 进行后续分析。

先决条件

要在 SageMaker AI 上微调功能调用模型，您需要以下先决条件：

一个包含你的 AWS 资源的 AWS 账户。
一个 AWS 身份与访问管理（IAM）角色，用于访问 SageMaker AI。要了解 IAM 如何与 SageMaker AI 配合，请参阅 Amazon SageMaker AI 的 AWS 身份与访问管理。
一个配置好的开发环境，用于访问你的 AWS 账户。你可以从你偏好的环境中运行笔记本，包括集成开发环境（IDEs），如 PyCharm 或 Visual Studio Code。要设置本地环境，请参考配置 AWS 命令行界面 (AWS CLI) 的设置。我们推荐使用 Amazon SageMaker Studio，以获得在 SageMaker AI 上的流畅体验。
要使用 MLflow 在 SageMaker AI 中跟踪你的实验，请遵循 SageMaker AI 文档中的说明。
访问本教程中使用的 SageMaker AI 计算实例。我们使用 SageMaker AI 训练作业和一个名为 ml.p4d.24xlarge 的单一训练实例进行训练。要检查配额，请查看 AWS 服务配额中的 AWS 管理控制台。
在 Service Quotas 控制台中查看 SageMaker AI ml.p4d.24xlarge 训练作业使用配额。
如果 已应用的账户级配额值 为 0，则请求将账户级别的配额增加到 1。

访问本教程的 GitHub 存储库。

设置你的环境

在以下部分中，我们将从一个 SageMaker Studio JupyterLab 笔记本实例. 你也可以使用你喜欢的 IDE，比如 VS Code 或 PyCharm。请确保你的本地环境已配置为与 AWS 兼容，正如先决条件中列出的那样。

完成以下步骤来设置你的环境：

在 SageMaker AI 控制台中，选择导航栏中的域，然后打开你的域。
在导航栏下的 应用程序和 IDEs 下，选择 Studio。
在 用户资料 标签页中，找到你的用户资料，然后选择启动和 Studio。
在 SageMaker Studio 中，启动一个具有至少 50 GB 存储空间的 ml.t3.medium JupyterLab 笔记本实例。由于微调任务是在带有 NVIDIA 加速器的独立暂存训练任务实例上运行的，因此不需要大型笔记本实例。
要开始微调，请克隆 GitHub 存储库：git clone https://github.com/aws-samples/amazon-sagemaker-generativeai.git。
导航到 6_use_cases/usecases/function-calling-sft-dpo 目录。
启动一个使用 Python 3.12 或更高版本内核的 `run_training_job.ipynb` 笔记本。

数据集准备

选择并创建合适的训练数据集是微调基础模型 (FMs) 的重要第一步。本例使用 NVIDIA 发布的 When2Call 数据集，这是一个基准测试，用于评估工具调用决策制定对 FMs 的性能。它包括何时生成工具调用、何时询问后续问题、何时表明无法使用提供的工具回答问题，以及如果问题似乎需要工具使用但无法进行工具调用时该怎么办。

用于生成数据集的评估代码和合成数据生成脚本位于 NVIDIA 的 GitHub 存储库中。

这些数据集包含三个不同的部分。

用于监督式微调 (SFT) 的数据集，包含 15,000 个样本。 ```

from datasets import load_dataset train_sft_ds = load_dataset("nvidia/When2Call", "train_sft") train_sft_ds DatasetDict({ train: Dataset({ features: ['tools', 'messages'], num_rows: 15000 })

code

Python  
2. 用于偏好对齐的数据集，本例中使用直接偏好优化 (DPO)。此数据包含 9,000 个样本。 ```
from datasets import load_dataset
train_pref_ds = load_dataset("nvidia/When2Call", "train_pref")
train_pref_ds

DatasetDict({
    train: Dataset({
        features: ['tools', 'messages', 'chosen_response', 'rejected_response'],
        num_rows: 9000
    })
})

Python

测试性能的数据集有两个文件：多选题评估 (mcq) 和 LLM-as-a-judge (llm_judge)，这是 MCQ 评估集的一个子集，并可以作为单个 DatasetDict 下载。 ```

from datasets import load_dataset test_ds = load_dataset("nvidia/When2Call", "test") test_ds

DatasetDict({ llm_judge: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 300 }) mcq: Dataset({ features: ['uuid', 'source', 'source_id', 'question', 'correct_answer', 'answers', 'target_tool', 'tools', 'orig_tools', 'orig_question', 'held_out_param'], num_rows: 3652 }) })

code

Python

对于这个用例，我们需要对数据集进行一些预处理以匹配TRL的 [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer) 和 [`DPOTrainer`](https://huggingface.co/docs/trl/main/en/dpo_trainer) 所期望的格式。要做到这一点，我们需要构建一个系统提示，其中包含可用工具的列表，并将该系统提示添加到原始数据集的 `messages` 列表中。

def generate_and_tokenize_prompt(data_point): """ 根据患者信息生成一个基于提示的工具。

参数: data_point (dict): 包含 target 和 meaning_representation 关键字的字典

返回: dict: 包含格式化提示的字典 """ full_prompt = f""" 你是一个具有访问以下工具或函数调用权限的帮助助手。data_point["tools"]。你的任务是根据用户语句生成响应所需的工具或函数调用序列。根据需要使用以下工具或函数调用： {data_point["tools"] """ return {"system_prompt": full_prompt.strip()}

dstrain_sft = dstrain_sft.map( generate_and_tokenize_prompt, batched=False

convos=[] for mess, sys in zip(dstrain_sft['train']['messages'], dstrain_sft['train']['system_prompt']: content": f"{sys}", "role": "system" convos.append([message, mess[0], mess[1]] dstrain_sft = dstrain_sft.rename_column("messages", "messages_1") dstrain_sft['train'] = dstrain_sft['train'].add_column("messages", convos]

code


Python

此外，我们还需要为 DPO 准备数据。TRL 中的 `DPOTrainer` 接受一种特定格式，其中除了 `messages` 列外还包括标记为 `chosen` 和 `rejected` 的列，因此我们需要创建 `messages` 列并重命名 `chosen_response` 和 `rejected_response`。

ds_train_pref = ds_train_pref.map( generate_and_tokenize_prompt, batched=False

ds_train_pref = ds_train_pref.rename_column("chosen_response", "chosen") ds_train_pref = ds_train_pref.rename_column("rejected_response", "rejected")

code


Python

现在，将 SFT 和 DPO 数据集保存到 Amazon 简单存储服务（Amazon S3）中，以便用于训练。

使用我们的 SageMaker 会话将 train_dataset 保存到 s3

input_path = f's3://{sagemaker_session.default_bucket()}/datasets/nvidia_function_calling'

将数据集保存到 s3

由于研讨会的计算资源有限，我们将只微调 20 条记录

dstrain_sft["train"].to_json(f"{input_path}/train/dataset.json", orient="records") sft_dataset_s3_path = f"{input_path}/train/dataset.json" ds_train_pref["train"].to_json(f"{input_path}/pref/dataset.json", orient="records") perf_dataset_s3_path = f"{input_path}/pref/dataset.json"

print(f"Training data uploaded to:")

print(sft_dataset_s3_path)

print(f"DPO data uploaded to:")

print(perf_dataset_s3_path)

print(f"https://s3.console.aws.amazon.com/s3/buckets/{sagemaker_session.default_bucket()}/?region={sagemaker_session.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/")

code


Python

## 监督微调（SFT）在基础模型上

下面的示例演示如何对 Qwen3-1.7B 模型进行微调。仓库中的食谱位于 `scripts` 目录中，你可以修改基础模型和 SFT 的训练参数。此示例使用了一个 [基于频谱的](https://aws.amazon.com/blogs/machine-learning/using-spectrum-fine-tuning-to-improve-fm-training-efficiency-on-amazon-sagemaker-ai/) 微调食谱，但你也可以使用其他 PEFT 技术，如 LoRA 或 QLoRA。

食谱包含模型和训练参数的配置：

模型参数

model_name_or_path: Qwen/Qwen3-1.7B tokenizer_name_or_path: Qwen/Qwen3-1.7B model_revision: main torch_dtype: bfloat16 attn_implementation: flash_attention_2 bf16: true tf32: true output_dir: /opt/ml/model/Qwen3-1.7B-function-calling

数据集参数

dataset_id_or_path: /opt/ml/input/data/dataset/dataset.json max_seq_length: 2048 packing: true

频谱参数

spectrum_config_path: /opt/ml/input/data/code/spectrum-layer/snr_results_Qwen-Qwen3-1.7B_unfrozenparameters_50percent.yaml

训练参数

num_train_epochs: 10 per_device_train_batch_size: 4 gradient_accumulation_steps: 2 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true learning_rate: 5.0e-5 lr_scheduler_type: cosine warmup_ratio: 0.1

日志参数

logging_strategy: steps logging_steps: 5 report_to:

wandb

Hugging Face Hub

push_to_hub: false

hub_model_id: # 如果未定义，则与 output_dir 相同

hub_strategy: every_save

code


YAML

### 创建一个使用 SageMaker AI ModelTrainer 进行微调的模型。 用strain_sftoken

在 MLflow 中启用实验跟踪，需要将 MLflow 跟踪服务器的 ARN 提供给作业。

MLflow tracker

tracking_server_arn = "<YOUR MLFLOW TRACKING ARN>" env["MLFLOW_TRACKING_ARN"] = tracking_server_arn

code


Python

训练设置中的 `Compute` 部分确定了训练所需的基础设施要求。在 `SourceCode` 部分中，我们定义了将被导入到训练作业中的本地代码路径。

compute = Compute( instance_count=1, instance_type="ml.p4d.24xlarge", volume_size_in_gb=96, keep_alive_period_in_seconds=3600, )

source_code = SourceCode( source_dir="./scripts", requirements="requirements.txt", entry_script="run_training_sft.sh", )

code


Python

以下是针对 SageMaker AI 训练作业进行微调的目录结构。我们还在 `scripts` 目录中提供了 `requirements.txt` 文件，`ModelTrainer` 会自动检测并安装运行时所需的依赖项。对于如禁用构建隔离等高级场景，您可以提供一个 Bash 脚本作为入口点，在开始训练前运行 shell 命令。

scripts/ ├── accelerate_configs/ # 加速配置文件 ├── run_training_sft.sh # 在 SageMaker 训练作业上使用加速器启动分布式训练的启动脚本 ├── run_training_dpo.sh # 在 SageMaker 训练作业上使用加速器启动分布式训练的启动脚本 ├── run_sft.py # 主监督微调（SFT）训练脚本 ├── run_dpo.py # 主直接偏好优化（DPO）训练脚本 ├── recipes/ # 预定义的训练配置食谱（YAML） └── requirements.txt # 运行时安装的 Python 依赖项

code


纯文本

接下来，指定用于训练容器的 Amazon 弹性容器注册表（Amazon ECR）位置、模型检查点的存储位置以及 SageMaker AI 训练作业的名称。这些值将传递给 `ModelTrainer` API 以配置作业。

image_uri = f"763104351884.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/pytorch-training:2.8.0-gpu-py312-cu129-ubuntu22.04-sagemaker"

checkpoint_s3_path = f"s3://{bucket_name}/function-calling-sft-checkpoints/checkpoints"

job_prefix = f"model-trainer-distributed-function-calling-sft"

model_trainer = ModelTrainer( training_image=image_uri, compute=compute, hyperparameters=hyperparameters, environment=env, source_code=source_code, stopping_condition=StoppingCondition( max_runtime_in_seconds=90000, ), checkpoint_config=CheckpointConfig( s3_uri=f"{checkpoint_s3_path}/{job_prefix}", ), base_job_name=job_prefix )

code


Python

最后，配置训练数据的位置参数，并使用 `.train()` 启动 SFT 训练作业。

training_data = InputData( channel_name="training_dataset", data_source=sft_dataset_s3_path, )

model_trainer.train(input_data_config=[training_data], wait=True)

code


Python

为了在多个 GPU 上进行微调，我们使用 [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index) 和 [DeepSpeed ZeRO-3](https://huggingface.co/docs/accelerate/v0.10.0/en/deepspeed)，它们协同工作以更高效地在多个 GPU 或节点上训练模型。Hugging Face Accelerate 通过自动处理设备放置、进程管理和混合精度设置来简化分布式训练的启动。DeepSpeed ZeRO-3 通过将优化器状态、梯度和参数分割到多个 GPU 上来减少内存使用，使得百亿参数模型能够更快地适应和训练。

您可以使用以下命令运行您的 `SFTTrainer` 脚本：

NUM_GPUS=$(nvidia-smi --list-gpus | wc -l) echo "Detected ${NUM_GPUS} GPUs on the machine" accelerate launch \ --config_file accelerate_configs/deepspeed_zero3.yaml \ --num_processes ${NUM_GPUS} run_sft.py \ --config receipes/Qwen3-0.6B-spectrum.yaml

code


Bash

当 SFT 模型元数据准备好后，您可以将其作为 DPO 训练的基本模型使用。DPO 训练食谱与 SFT 类似，但有一些小改动。

* `beta` – 这是 DPO 特有的超参数，通常绑定在 0–2 之间，控制微调模型向其原始参考模型偏移的程度。值越接近 0 越激进，越接近 2 越保守。典型的起点是 0.1 到 0.5，这可以驱动显著的行为变化。然而，这可能会导致高方差甚至退化。最优值高度依赖于数据集。
* `learning_rate` – DPO 受益于较低的学习率（例如 5e-7），并带有 `warmup_ratio` 防止过拟合。这个值与之前运行的 SFT 学习率（5e-5）不同。尽管本例使用了常数 `lr_scheduler_type`，但余弦退火是另一种常见选项。
* `batch_size` – 大批量大小通常表现更好。本例中的批量大小故意较小，以降低资源需求。

模型参数

model_name_or_path: /opt/ml/input/model/Qwen3-1.7B-function-calling/ tokenizer_name_or_path: Qwen/Qwen3-1.7B model_revision: main torch_dtype: bfloat16 attn_implementation: flash_attention_2 bf16: true tf32: true output_dir: /opt/ml/model/sft-dpo-qwen-3-1.7b-function-calling dataset_id_or_path: /opt/ml/input/data/dataset/dataset.json

训练参数

beta: 0.1 # 控制微调模型允许偏离其原始参考模型的程度 max_length: 1536 max_prompt_length: 768 loss_type: sigmoid num_train_epochs: 10 per_device_train_batch_size: 2 gradient_accumulation_steps: 8 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true learning_rate: 5.0e-7 lr_scheduler_type: constant warmup_ratio: 0.03

Logging 参数

logging_strategy: steps logging_steps: 5 report_to:

mlflow

save_strategy: "no" seed: 42

hyperparameter that controls how much the fine-tuned model is allowed to diverge from its original, reference model

max_length: 1536 max_prompt_length: 768 loss_type: sigmoid num_train_epochs: 10 per_device_train_batch_size: 2 gradient_accumulation_steps: 8

Logging 参数

logging_strategy: steps logging_steps: 5 report_to:

训练参数

beta: 0.1 # 控制微调模型允许从其原始参考模型 max_length: 1536 max_prompt_length: 768 loss_type: sigmoid num_train_epochs: 10 per_device_train_batch_size: 2 bf16: true

Dataset 参数

dataset_id_or_path: /opt/ml/input/data/dataset/dataset.json

Logging 参数

beta: 0.1 # hyperparamer that controls how much the fine-tuned model is allowed to diverge from its original, reference model

Training 参数

beta: 0.1 # hyperparamer that controls how much the fine-tuned model is alloweded to diverge from its original, reference model report_to:

Logging 参数

beta: 0.1 # hyperparamer that controls how much the fine-tuned model is allowed to diverge from its original, reference model

0.1 # hyperparamer that control how much the fine-tuned model is allowed to diverge from its original, reference model

Logging 参数

beta: 0.1 # hyperparamer that control how much the fine-tuned model is allowed to diverge from its original, reference model

Logging 参数

logging_strater: steps report_to:

mlflow

save_strategy: "no"

训练参数

beta: 0.1 # hyperparamers.

code


YAML

您可以提供多个损失值来执行 [混合偏好优化](https://arxiv.org/abs/2403.19443)，这允许组合和加权多种损失类型。在这个例子中，有 SFT 训练数据和 DPO 训练数据分别运行。如果您只有 DPO 训练数据，可以使用 MPO 并使用 `sft` 损失类型进行混合偏好优化（Mixed Preference Optimization），这允许组合和加权多种损失类型。在这个例子中，有 SFT 训练数据和 DPO 训练数据分别运行。如果您只有 DPO 训练数据，可以使用 MPO 并使用 `sft` 损失类型来利用 DPO 数据中的 `accepted` 列用于 SFT。如果可能的话，提供单独、独特的数据集会生成更大的语料库并获得更好的结果。

MPO (混合偏好优化): 将 DPO（sigmoid）用于偏好和 BCO（bco_pair）用于质量的结合

loss_type : ["sigmoid", "bco_pair", "sft"], # 要结合的损失类型 loss_weights : [0.8, 0.2, 1.0] # 在 MPO 论文中的相应权重

code


Python

如果省略 `loss_weights`，所有损失类型将具有相同的权重（默认为 1.0）。

## 直接偏好优化（DPO）在 SFT 训练模型上的训练

在 DPO 示例中，我们展示了如何将配置数据作为超参数或环境变量传递到训练容器中。前者由训练脚本通过 `TRLParser` 接收，后者则通过 Python `os.environ` 引用接收。

DPO 训练配置定义如下：

from sagemaker.config import load_sagemaker_config from sagemaker.modules.train import ModelTrainer from sagemaker.modules.configs import Compute, SourceCode, InputData, StoppingCondition, CheckpointConfig

configs = load_sagemaker_config()

env = {} env["FI_PROVIDER"] = "efa" env["NCCL_PROTO"] = "simple" env["NCCL_SOCKET_IFNAME"] = "eth0" env["NCCL_IB_DISABLE"] = "1" env["NCCL_DEBUG"] = "WARN" env["HF_token"] = os.environ['hf_token'] # 必要时用于受控模型，其他可省略 env["data_location"] = perf_dataset_s3_path env["model_location"] = model_data

MLflow 追踪器

tracking_server_arn = "<YOUR MLFLOW TRACKING ARN>" env["MLFLOW_TRACKING_ARN"] = tracking_server_arn

compute = Compute( instance_count=1, instance_type="ml.p4d.24xlarge", volume_size_in_gb=96, keep_alive_period_in_seconds=3600, )

image_uri = f"763104351884.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/pytorch-training:2.8.0-gpu-py312-cu129-ubuntu22.04-sagemaker"

checkpoint_s3_path = f"s3://{bucket_name}/function-calling-dpo-checkpoints/checkpoints"

job_prefix = f"model-trainer-distributed-function-calling-dpo"

hyperparameters = { "dataset_path": "/opt/ml/input/data/dataset", "model_dir": "/opt/ml/model", }

source_code = SourceCode( source_dir="./scripts", requirements="requirements.txt", entry_script="run_training_dpo.sh", )

model_trainer = ModelTrainer( training_image=image_uri, compute=compute, hyperparameters=hyperparameters, environment=env, source_code=source_code, stopping_condition=StoppingCondition( max_runtime_in_seconds=90000, ), checkpoint_config=CheckpointConfig( s3_uri=f"{checkpoint_s3_path}/{job_prefix}", ), base_job_name=job_prefix )

training_data = InputData( channel_name="training_dataset", data_source=perf_dataset_s3_path, )

code


Python

然后启动 DPO 的训练任务：

`model_trainer.train(input_data_config=[training_data], wait=True)`

Python

## 结果

我们对三种不同的模型进行了实验，使用了 NVIDIA 提供的评估脚本（[https://github.com/NVIDIA/When2Call]），取得了以下结果。在基础模型中，Qwen3-0.6B 即使是最小的模型，表现最强，超过了 Qwen3-1.7B 大约 6%，也超过了 Llama-3.2-3B-instruct 大约 1%。

经过一轮微调后，排名发生变化。Qwen3-1.7B 模型的准确率提高了大约 19%，并且比其他模型高出大约 4–7%。这一轮偏好优化也很有效，又增加了大约 10.5% 的准确率，并以大约 8–9% 的优势领先于其他模型。

这表明多步骤方法对模型定制的有效性。Qwen3-1.7B 的整体准确率提高了 30%，并且比拥有几乎两倍参数量的 Llama-3.2-3B 模型性能提高了 9%。使用更小的模型实现类似或更好的性能可以在部署模型时降低成本并提高吞吐量。

**模型****调整技术****Acc-Norm**
Llama 3.2 3B 指令 基础 46.50%
Llama 3.2 3B 指令 光谱 SFT 53.41%
Llama 3.2 3B 指令 光谱 SFT + DPO**62.67%**
Qwen3-0.6B 基础 47.64%
Qwen3-0.6B 光谱 SFT 56.10%
Qwen3-0.6B 光谱 SFT + DPO**62.02%**
Qwen3-1.7B 基础 41.57%
Qwen3-1.7B 光谱 SFT 60.43%
Qwen3-1.7B 光谱 SFT + DPO**71.06%**

## 清理

为了避免因不再需要而产生的费用，请完成以下清理步骤：

* 删除您启动的任何 SageMaker AI 训练作业。成功完成的训练作业不会继续产生费用，但您可以从 SageMaker AI 控制台或 AWS CLI 中清理记录。
* 删除您上传到 Amazon S3 的数据集：`aws s3 rm s3://<your-bucket>/datasets/nvidia_function_calling/ --recursive`
Bash  
* 停止或删除 SageMaker Studio JupyterLab 笔记本实例以避免空闲费用。
* 删除您不再需要存储在 Amazon S3 中的任何模型检查点。

## 总结

在这篇文章中，我们展示了如何通过在 Amazon SageMaker AI 上结合监督微调（SFT) 和直接偏好优化 (DPO) 来提高代理的工具调用准确性。 SFT 使用带标签的数据集来细化模型参数，因此模型通过学习专家标注的示例来发展基础理解。 DPO 然后通过直接反馈使模型输出与人类偏好或特定性能指标相一致，而无需定义奖励函数。

通过将这两种方法结合起来，你可以获得一个性能更好的模型，它既受益于SFT的结构化、知识驱动的方法，又得益于DPO的适应性和以用户为中心的优化。结果是一个更准确、更相关，并且更好地与用户希望其行为的方式相一致的模型。

要了解更多关于微调基础模型的示例，请访问 [AWS SageMaker AI生成AI GitHub样例库](https://github.com/aws-samples/amazon-sagemaker-generativeai)。有关在SageMaker AI中训练模型的更多信息，请参阅 [SageMaker AI文档](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html)。

* * *

## 关于作者

提升代理工具调用准确性：使用SFT和DPO在Amazon SageMaker AI上

TL;DR · AI 摘要

核心要点

结构提纲

思维导图

金句 / Highlights

提高 Amazon SageMaker AI 上 SFT 和 DPO 优化的代理工具调用准确性 | Amazon Web Services

Markdown 内容

微调方法论

解决方案概览

先决条件

设置你的环境

数据集准备

使用我们的 SageMaker 会话将 train_dataset 保存到 s3

将数据集保存到 s3

由于研讨会的计算资源有限，我们将只微调 20 条记录

print(f"Training data uploaded to:")

print(sft_dataset_s3_path)

print(f"DPO data uploaded to:")

print(perf_dataset_s3_path)

模型参数

数据集参数

频谱参数

训练参数

日志参数

Hugging Face Hub

hub_model_id: # 如果未定义，则 与 output_dir 相同

MLflow tracker

模型参数

训练参数

Logging 参数

hyperparameter that controls how much the fine-tuned model is allowed to diverge from its original, reference model

Logging 参数

训练参数

Dataset 参数

Logging 参数

Training 参数

Logging 参数

0.1 # hyperparamer that control how much the fine-tuned model is allowed to diverge from its original, reference model

Logging 参数

Logging 参数

训练参数

MPO (混合偏好优化): 将 DPO（sigmoid）用于偏好和 BCO（bco_pair）用于质量的结合

MLflow 追踪器

hub_model_id: # 如果未定义，则与 output_dir 相同