加速设备端AI:解析Arm和Google AI Edge优化

TL;DR · AI 摘要
文章介绍了Arm和Google合作优化边缘AI推理,通过SME2架构和Google AI Edge工具链提升设备端AI性能。
核心要点
- Arm SME2使CPU成为高性能AI加速器,推理速度提升5倍
- Google AI Edge提供从PyTorch到硅片的完整转换、优化、部署流程
- 使用LiteRT-Torch可直接将PyTorch模型转为tflite格式
结构提纲
按章节快速跳转。
- §引言
AI正向多模态能力演进,设备端AI推理面临性能与延迟的挑战。
Arm SME2通过集成矩阵计算单元,使CPU成为高性能AI加速器,推理速度提升5倍。
Google AI Edge提供完整的转换、优化、部署流程,简化设备端AI开发。
LiteRT自动利用SME2进行推理加速,支持iGeMM和GeMM等数学密集型内核。
AI Edge Quantizer处理模型压缩,Model Explorer帮助识别性能瓶颈。
思维导图
用一张图看清主题之间的关系。
查看大纲文本(无障碍 / 无 JS 友好)
- 设备端AI优化
- Arm SME2
- 集成矩阵计算单元
- Google AI Edge
- 转换-优化-部署流程
- LiteRT
- AI Edge Quantizer
- 模型优化
- 混合精度实现
- 性能热点识别
金句 / Highlights
值得收藏与分享的关键句。
Arm SME2使CPU成为高性能AI加速器,推理速度提升5倍
Google AI Edge提供从PyTorch到硅片的完整转换、优化、部署流程
使用LiteRT-Torch可直接将PyTorch模型转为tflite格式
MAY 14, 2026
Na LiEngineering Manager
AI is evolving beyond simple text interactions toward rich multimodal capabilities, such as on-device image and audio generation, enabling developers to create highly personalized consumer experiences. While the CPU has always been the ubiquitous option for inference, running large complex models at the edge has historically required choosing between high-latency CPU execution and fragmented, specialized accelerators.
Arm Scalable Matrix Extension 2 (SME2) eliminates this tradeoff by integrating a dedicated matrix-compute unit directly into the CPU cluster. This architecture enables the CPU to function as a high-performance AI accelerator, delivering up to 5x faster inference for the matrix-heavy workloads at the heart of generative AI.
Running on-device AI on Arm hardware is dramatically streamlined with Google AI Edge, an integrated stack designed to simplify your development journey. LiteRT automatically leverages Arm SME2 at runtime through XNNPACK and Arm KleidiAI integration. It identifies and selects math-intensive kernels like iGeMM and GeMM, delivering specialized hardware acceleration. To further ease deployment, AI Edge Quantizer handles complex model compression, and Model Explorer provides a visual map to quickly identify and resolve performance hotspots.
The power of this integration is proven through deploying Stability AI’sstable-audio-open-small model entirely on Arm CPUs delivering major performance uplift. In this blog post, we’ll walk you through transforming the original floating-point PyTorch stable-audio-open-small model into a highly optimized, mixed-precision (FP16/Int8) implementation ready for high-performance acceleration on Arm CPU.
Stable Audio Open Small running on LiteRT and Arm CPU with SME2
**The challenge: Balancing model quality and mobile reality**
To generate high-quality audio, such as 11-second stereo clips from a single prompt, directly on a wide range of mobile devices, practical considerations usually require a manageable model footprint, typically around 1 billion parameters. Even within this Small Language Model (SLM) range, developers face Challenging Deployment Hurdles:
- Complexity Gap: Finding the optimal quantization configuration among many possibilities can be challenging. Furthermore, naively quantizing the entire model's weights yields a severe loss in audio quality.
- Device Coverage: Unlocking a path for efficient CPU-based audio generation is more than a technical milestone. It is an opportunity to scale innovative apps across the billions of CPU-powered devices that represent the global smartphone market.
**Google AI Edge: A seamless path from PyTorch to silicon**
By using a diffusion-based model as the optimization target, we demonstrate a complete end-to-end path with the Google AI Edge software stack. As shown below, this synergy provides a streamlined Convert → Optimize → Deploy pipeline.

Given the KleidiAI optimizations are embedded directly into XNNPACK, developers gain specialized AI acceleration automatically. There is no need to write low-level assembly or custom hardware code; the stack handles the "translation" from high-level model to silicon-optimized execution.
**Convert**: Convert from PyTorch to .tflite with LiteRT Torch
Start by converting the PyTorch version of the Stable-audio-open-small model into the AI Edge ecosystem. **LiteRT-Torch** allows for a direct conversion path for PyTorch models, minimizing friction of moving from a research environment to a production mobile environment.
import litert_torch
from litert_torch.quantize import quant_config
from litert_torch.generative.quantize import quant_recipe, quant_recipe_utils
# Specify the quantization format
quant_config_int8 = quant_config.QuantConfig(
generative_recipe=quant_recipe.GenerativeQuantRecipe(
default=quant_recipe_utils.create_layer_quant_dynamic(),
)
)
# Initiate the conversion
edge_model = ai_edge_torch.convert(
model, example_inputs, quant_config=quant_config_int8
)Python
Copied
Find the code snippet to illustrate how LiteRT-Torch works in practice here
**Optimize:** Optimize with Model Explorer and AI Edge Quantizer
Previously, identifying which layers of a model were suitable for quantization was a manual, error-prone process of inspecting individual layers.
With Google’s **Model Explorer**, developers can now visualize the entire model graph. The new **node data overlay** plugin allows us to see exactly which operators are most compute-intensive or as shown below which are "quantization-safe". This visual verification ensures we only target layers where moving to INT8 won't degrade audio output quality.
For example, to improve the inference efficiency of the diffusion step, we applied dynamic INT8 quantization to the DiT (Diffusion Transformers) submodule:

Median difference ratio error reported for each DiT transformer block.
As shown in the screenshot above, all layers in the DiT submodule are green, indicating low error values within the DiT transformer (FP32 vs. FP32+INT8). Therefore, we expect the dynamically quantized INT8 DiT submodule to achieve quality comparable to FP32.

Fully connected layer with INT8 dynamic quantization. The error rate, reported under “NODE DATA PROVIDERS,” is approximately 1%.
Once the suitability of INT8 quantization was confirmed, we utilized the AI Edge Quantizer to optimize the model from FP32 to INT8.
This decision resulted in 3x performance improvement in the DiT submodule, along with a 4x reduction of its memory usage.
fp32_model_path = "./dit_model_fp32.tflite"
dynamic_quant_model_path = "./dit_model_int8+fp32.tflite"
the_recipe = [
dict({
'regex': '.*',
'operation': '*',
'algorithm_key': 'min_max_uniform_quantize',
'op_config': {
'weight_tensor_config': {
'num_bits': 8,
'symmetric': True,
'granularity': 'CHANNELWISE',
'dtype': 'INT',
'block_size': 0,
},
'compute_precision': 'INTEGER',
'explicit_dequantize': False,
'skip_checks': False,
'min_weight_elements': 0
},
})
]
# Define the quantizer, with fp32 tflite model, and the recipe.
qt = quantizer.Quantizer(fp32_model_path, the_recipe)
quant_result = qt.quantize().export_model(dynamic_quant_model_path, overwrite=True)Python
Copied
**Deploy:** High-performance inference with LiteRT via XNNPack & KleidiAI
The final step is the runtime.
When you run this quantized model in LiteRT on an Android mobile device, it defaults to the XNNPACK delegate for CPU inference. Because XNNPACK integrates KleidiAI directly within the latest LiteRT API, developers get these optimizations automatically. These micro-kernels ensure that the core INT8 and FP16 matrix multiplications of the audio model run with maximum efficiency on the CPU.
Below is a representative snippet of how LiteRT inference is implemented in C++ using the CompiledModel API. Instructions in this guide are provided for running the audiogen app with LiteRT either on an Android™ device or macOS®.
#include "litert/cc/litert_compiled_model.h"
#include "litert/cc/litert_environment.h"
#include "litert/cc/litert_tensor_buffer.h"
// 1. Initialize the LiteRT Environment
auto env = litert::Environment::Create({}).value();
// 2. Create the CompiledModel from the .tflite file
// Hardware acceleration (e.g., SME2 via KleidiAI) is handled automatically
auto compiled_model = litert::CompiledModel::Create(
env, "autoencoder_model.tflite", litert::HwAccelerators::kCpu).value();
// 3. Prepare input and output buffers
auto autoencoder_inputs = compiled_model.CreateInputBuffers().value();
auto autoencoder_outputs = compiled_model.CreateOutputBuffers().value();
// 4. Write input data (e.g., random noise or conditioned embeddings)
auto auto_in_lock_and_ptr = scoped_lock<float>(autoencoder_inputs[0],
litert::TensorBuffer::LockMode::kWrite);
// Fill the input
// 5. Execute inference
compiled_model.Run(autoencoder_inputs, autoencoder_outputs);
// 6. Access and read the generated audio waveform from the output buffer
auto auto_out_lock_and_ptr = scoped_lock<const float>(autoencoder_outputs[0], litert::TensorBuffer::LockMode::kRead);
// Read the outputC++
Copied
**Results**: Faster, smaller, and high-quality audio generation with a lower footprint

We now take our quantized fp16/int8 model from the prior section and benchmark both CPU single threaded and multi-threaded (MT) performance with the original FP32 Stable Audio Open Small model against our KleidiAI-optimized FP16 + INT8 model on an SME2-based Android device and on an Apple MacBook with M4.
- Speed: We saw over 2x reduction in audio generation time, from 10 seconds to just 4.3 seconds on an Apple MacBook M4 and down from 14 seconds to 6.6 seconds on an Arm SME2-based Android device with 1 thread.
- Memory: The DiT submodel size decreased by roughly4x, significantly lowering RAM usage during inference.
- Quality: Crucially, the generated audio maintained perceptual parity with the FP32 version.
As shown in the bar chart above, SME2 delivers more than a 2x performance improvement over the NEON instruction set, specialized for signal processing tasks. Even with a single core, it can generate 11 seconds of audio in under 8 seconds, which is acceptable from a user-experience perspective.
**Ready to learn more?**
These optimizations are available for developers today. Start experimenting immediately using Google AI Edge tools and KleidiAI-accelerated LiteRT.
#### Start experimenting today
Explore Arm’s sample repository to access the complete end-to-end journey for Stable Audio Open:
- Convert: Use LiteRT-torch to bring your PyTorch models into the ecosystem.
- Optimize: Use Google AI Edge tools, including Model Explorer and the AI Edge Quantizer, to visualize and compress your models for the edge.
- Deploy: Run the sample code for Stable Audio Open Small on Arm-powered phones and laptops to see the Arm SME2 acceleration in action.
**Developer resources**
- Download**LiteRT**: Access the latest OSS orMaven versions to ensure you are using the newest XNNPACK engine, now supercharged with Arm KleidiAI micro-kernels.
- Google AI Edge Documentation: Visit the LiteRT Documentation for comprehensive development guides on model conversion and hardware delegation.
- Arm Developer Portal: Explore more about Arm SME2 and KleidiAI to understand how to unlock maximum throughput on the latest Armv9-A CPUs.
_Acknowledgements_
_Arm: Adnan Alsinan, Anitha Raj, Aude Vuilliomenet, Bala Gattu, Declan Cox, and Gian Marco Iodice_
_Stability AI credit: This post uses the Stable Audio Open Small model by Stability AI, released under the Stability AI Community License. Audio samples were generated using the model running on test devices via LiteRT & Arm Keidi AI._
_Google: Advait Jain, Andrei Kulik, Changmin Sun, Cormac Brick, Dillon Sharlet, Eric Yang, Jinjiang Li, Jing Jin, Lu Wang, Maria Lyubimtsev, Meghna Johar, Pedro Gonnet, Ram Iyengar, Sachin Kotwani, Terry (Woncheol) Heo, Vitalii Dziuba_
[](https://developers.googleblog.com/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk/) Previous
Next
[](https://developers.googleblog.com/building-with-gemini-embedding-2/)