TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

强调参数精简（<100M）、架构适配（如状态空间模型 SSM 替代 Transformer）、训练后量化与知识蒸馏。

介绍其轻量 IR、内存零拷贝调度、NPU/GPU/ARM CPU 多后端支持及动态 KV 缓存管理。

描述基于 TLM 的本地规划器、工具注册机制、受限 sandbox 环境及自反思 prompt 链。

在 Pixel 8 和 Raspberry Pi 5 上达成 <80ms 响应，支持离线语音助手、实时翻译与传感器协同决策。

AI Engineer视频2026年5月3日

7.2Score

可直接观看的视频资源打开原视频

TL;DR · AI 摘要

Google 提出 TLMs（Tiny Language Models）与 LiteRT-LM 框架，支持在边缘设备上高效部署轻量级 LLM 和自主 Agent，兼顾低延迟、隐私保护与离线能力。

按章节快速跳转。

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

值得收藏与分享的关键句。

‘TLMs aren’t just quantized Llama — they’re rethought from the silicon up for memory bandwidth, not FLOPs.’
— 12:45
⬇︎ 下载 PNG 𝕏 分享到 X
LiteRT-LM’s ‘token budget scheduler’ dynamically allocates compute across sub-tasks in an agent loop, preventing OOM on 2GB RAM devices.
— 28:11
⬇︎ 下载 PNG 𝕏 分享到 X
The local agent doesn’t call APIs — it loads tool binaries (e.g., SQLite, libusb) directly into its runtime sandbox.
— 35:20
⬇︎ 下载 PNG 𝕏 分享到 X
No cloud fallback by default: if network is down, the agent degrades gracefully using cached context and cached tool schemas.
— 41:03
⬇︎ 下载 PNG 𝕏 分享到 X

#LLM#edge computing#Google#LiteRT-LM#TLM