# ExecuTorch
[ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) is a lightweight runtime for model inference on edge devices. It exports a PyTorch model into a portable, ahead-of-time format. A small C++ runtime plans memory and dispatches operations to hardware-specific backends. Execution and memory behavior is known before the model runs on device, so inference overhead is low.
Export a Transformers model with the [optimum-executorch](https://huggingface.co/docs/optimum-executorch/en/index) library.
```bash
optimum-cli export executorch \
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
--task "text-generation" \
--recipe "xnnpack" \
--output_dir="./smollm2_exported"
```
```py
from transformers import AutoTokenizer
from optimum.executorch import ExecuTorchModelForCausalLM
model = ExecuTorchModelForCausalLM.from_pretrained(
"HuggingFaceTB/SmolLM2-135M-Instruct",
recipe="xnnpack",
)
model.save_pretrained("./smollm2_exported")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
```
## Transformers integration
The export process uses several Transformers components.
1. [`~PreTrainedModel.from_pretrained`] loads the model weights in safetensors format.
2. Optimum applies graph optimizations and runs [torch.export](https://docs.pytorch.org/docs/stable/export.html) to create a `model.pte` file targeting your hardware backend.
3. [`AutoTokenizer`] or [`AutoProcessor`] loads the tokenizer or processor files and runs during inference.
4. At runtime, a C++ runner class executes the `.pte` file on the ExecuTorch runtime.
```c++
#include
using namespace executorch::extension::llm;
int main() {
// Load tokenizer and create runner
auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0);
auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer));
// Load the model
runner->load();
// Configure generation
GenerationConfig config;
config.max_new_tokens = 100;
config.temperature = 0.8f;
// Generate text with streaming output
runner->generate("The capital of France is", config,
[](const std::string& token) { std::cout << token << std::flush; },
nullptr);
return 0;
}
```
## Resources
- [ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) docs
- [torch.export](https://docs.pytorch.org/docs/stable/export.html) docs
- [Exporting to production](../serialization#executorch) guide