# ExecuTorch [ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) is a lightweight runtime for model inference on edge devices. It exports a PyTorch model into a portable, ahead-of-time format. A small C++ runtime plans memory and dispatches operations to hardware-specific backends. Execution and memory behavior is known before the model runs on device, so inference overhead is low. Export a Transformers model with the [optimum-executorch](https://huggingface.co/docs/optimum-executorch/en/index) library. ```bash optimum-cli export executorch \ --model "HuggingFaceTB/SmolLM2-135M-Instruct" \ --task "text-generation" \ --recipe "xnnpack" \ --output_dir="./smollm2_exported" ``` ```py from transformers import AutoTokenizer from optimum.executorch import ExecuTorchModelForCausalLM model = ExecuTorchModelForCausalLM.from_pretrained( "HuggingFaceTB/SmolLM2-135M-Instruct", recipe="xnnpack", ) model.save_pretrained("./smollm2_exported") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") ``` ## Transformers integration The export process uses several Transformers components. 1. [`~PreTrainedModel.from_pretrained`] loads the model weights in safetensors format. 2. Optimum applies graph optimizations and runs [torch.export](https://docs.pytorch.org/docs/stable/export.html) to create a `model.pte` file targeting your hardware backend. 3. [`AutoTokenizer`] or [`AutoProcessor`] loads the tokenizer or processor files and runs during inference. 4. At runtime, a C++ runner class executes the `.pte` file on the ExecuTorch runtime. ```c++ #include using namespace executorch::extension::llm; int main() { // Load tokenizer and create runner auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0); auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer)); // Load the model runner->load(); // Configure generation GenerationConfig config; config.max_new_tokens = 100; config.temperature = 0.8f; // Generate text with streaming output runner->generate("The capital of France is", config, [](const std::string& token) { std::cout << token << std::flush; }, nullptr); return 0; } ``` ## Resources - [ExecuTorch](https://docs.pytorch.org/executorch/stable/index.html) docs - [torch.export](https://docs.pytorch.org/docs/stable/export.html) docs - [Exporting to production](../serialization#executorch) guide