first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled

This commit is contained in:
陈赣
2026-06-05 16:53:03 +08:00
commit 06f1fd69a6
6047 changed files with 1895387 additions and 0 deletions

2
benchmark_v2/.gitignore vendored Normal file
View File

@@ -0,0 +1,2 @@
benchmark_results/
benchmark_results_profiles/

138
benchmark_v2/README.md Normal file
View File

@@ -0,0 +1,138 @@
# Benchmarking v2
A comprehensive benchmarking framework for transformer models that supports multiple execution modes (eager, compiled, kernelized), detailed performance metrics collection, and structured output format.
## Quick Start
### Running All Benchmarks
```bash
# Run all benchmarks with default settings
python run_benchmarks.py
# Specify output directory
python run_benchmarks.py --output-dir my_results
# Run with custom parameters
python run_benchmarks.py \
--warmup-iterations 5 \
--measurement-iterations 10 \
--num-tokens-to-generate 200
```
### Uploading Results to HuggingFace Dataset
You can automatically upload benchmark results to a HuggingFace Dataset for tracking and analysis:
```bash
# Upload to a public dataset with auto-generated run ID
python run_benchmarks.py --upload-to-hub username/benchmark-results
# Upload with a custom run ID for easy identification
python run_benchmarks.py --upload-to-hub username/benchmark-results --run-id experiment_v1
# Upload with custom HuggingFace token (if not set in environment)
python run_benchmarks.py --upload-to-hub username/benchmark-results --token hf_your_token_here
```
**Dataset Directory Structure:**
```
dataset_name/
├── 2025-01-15/
│ ├── runs/ # Non-scheduled runs (manual, PR, etc.)
│ │ └── 123-1245151651/ # GitHub run number and ID
│ │ └── benchmark_results/
│ │ ├── benchmark_summary_20250115_143022.json
│ │ └── model-name/
│ │ └── model-name_benchmark_20250115_143022.json
│ └── benchmark_results_abc123de/ # Scheduled runs (daily CI)
│ ├── benchmark_summary_20250115_143022.json
│ └── model-name/
│ └── model-name_benchmark_20250115_143022.json
└── 2025-01-16/
└── ...
```
**Authentication for Uploads:**
For uploading results, you need a HuggingFace token with write permissions to the target dataset. You can provide the token in several ways (in order of precedence):
1. Command line: `--token hf_your_token_here`
3. Environment variable: `HF_TOKEN`
### Running Specific Benchmarks
```bash
# Include only specific benchmarks
python run_benchmarks.py --include llama
# Exclude specific benchmarks
python run_benchmarks.py --exclude old_benchmark
## Output Format
Results are saved as JSON files with the following structure:
```json
{
"model_name": "llama_2_7b",
"benchmark_scenarios": [
{
"scenario_name": "eager_variant",
"metadata": {
"timestamp": "2025-01-XX...",
"commit_id": "abc123...",
"hardware_info": {
"gpu_name": "NVIDIA A100",
"gpu_memory_total": 40960,
"cpu_count": 64
},
"config": {
"variant": "eager",
"warmup_iterations": 3,
"measurement_iterations": 5
}
},
"measurements": {
"latency": {
"mean": 2.45,
"median": 2.43,
"std": 0.12,
"min": 2.31,
"max": 2.67,
"p95": 2.61,
"p99": 2.65
},
"time_to_first_token": {
"mean": 0.15,
"std": 0.02
},
"tokens_per_second": {
"mean": 87.3,
"unit": "tokens/sec"
}
},
"gpu_metrics": {
"gpu_utilization_mean": 85.2,
"gpu_memory_used_mean": 12450
}
}
]
}
```
### Debug Mode
```bash
python run_benchmarks.py --log-level DEBUG
```
## Contributing
To add new benchmarks:
1. Create a new file in `benches/`
2. Implement the `ModelBenchmark` interface
3. Add a runner function (`run_<benchmark_name>` or `run_benchmark`)
4. run_benchmarks.py

View File

@@ -0,0 +1,443 @@
"""
Continuous batching overall benchmark suite.
Runs CB in-process across many configurations (GSM8K prompts and synthetic
data) and can compare throughput against a previously-saved run.
"""
import argparse
import gc
import json
import time
import types
from collections.abc import Callable
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Any
import torch
from lighteval.models.model_output import ModelResponse
from lighteval.tasks.lighteval_task import LightevalTask, LightevalTaskConfig
from lighteval.tasks.prompt_manager import PromptManager
from lighteval.tasks.registry import Registry
from lighteval.tasks.requests import Doc
from tabulate import tabulate
from transformers import AutoModelForCausalLM, AutoTokenizer, ContinuousBatchingConfig, GenerationConfig
# Defaults
RESULTS_DIR = Path(__file__).parent.parent / "benchmark_results/cb_overall/"
def _fmt(val: Any, spec: str = "", missing: str = "X") -> str:
"""Format `val` per `spec`, or return `missing` if val is None."""
return format(val, spec) if val is not None else missing
def _build_gsm8k_platinum_module() -> types.ModuleType:
"""Define the gsm8k_platinum custom task inline so lighteval's Registry can pick it up via `custom_tasks=`."""
def gsm8k_platinum_prompt(line, task_name=None):
return Doc(
task_name=task_name,
query=f"Question: {line['question']}\nAnswer:",
choices=[f" {line['answer']}"],
gold_index=0,
)
metrics = list(Registry().load_all_task_configs()["gsm8k"].metrics)
mod = types.ModuleType("_gsm8k_platinum_inline")
mod.TASKS_TABLE = [
LightevalTaskConfig(
name="gsm8k_platinum",
prompt_function=gsm8k_platinum_prompt,
hf_repo="madrylab/gsm8k-platinum",
hf_subset="main",
evaluation_splits=("test",),
few_shots_split="test",
few_shots_select="random_sampling",
generation_size=256,
stop_sequence=["Question:"],
metrics=metrics,
),
]
return mod
def _build_lighteval_inputs_scorer(
tokenizer: AutoTokenizer,
*,
task_spec: str,
task_name: str,
use_chat_template: bool,
custom_tasks: Any = None,
primary_metric: str | None = None,
stop_sequences: tuple[str, ...] = (),
) -> tuple[list[list[int]], Callable[[Any], float]]:
"""Tokenize prompts and build a per-sample scorer for any lighteval task."""
r = Registry(tasks=task_spec, **({"custom_tasks": custom_tasks} if custom_tasks else {}))
metric = r.task_to_configs[task_name][0].metrics[0]
tasks_dict = r.load_tasks()
LightevalTask.load_datasets(tasks_dict, 1)
docs = next(iter(tasks_dict.values())).get_docs()
pm = PromptManager(use_chat_template=use_chat_template, tokenizer=tokenizer, system_prompt=None)
prompts = [pm.prepare_prompt(doc) for doc in docs]
inputs = tokenizer(prompts, add_special_tokens=not use_chat_template)["input_ids"]
def score(outputs) -> float:
scores = []
for doc, (_, out) in zip(docs, outputs.items()):
text = tokenizer.decode(out.generated_tokens, skip_special_tokens=True)
for s in stop_sequences:
text = text.split(s, 1)[0]
value = metric.sample_level_fn.compute(doc, ModelResponse(text=[text]))
# Grouped metrics return a dict keyed by sub-metric — pick the primary one.
scores.append(value[primary_metric] if isinstance(value, dict) else value)
return sum(scores) / len(scores)
return inputs, score
# Data helpers
def get_tokenized_gsm8k(
tokenizer: AutoTokenizer, n_fewshot: int = 8
) -> tuple[list[list[int]], Callable[[Any], float]]:
"""GSM8K-Platinum few-shot inputs and scorer using the same lighteval extractive_match as the gsm8k task."""
return _build_lighteval_inputs_scorer(
tokenizer,
task_spec=f"gsm8k_platinum|{n_fewshot}",
task_name="gsm8k_platinum",
use_chat_template=False,
custom_tasks=_build_gsm8k_platinum_module(),
stop_sequences=("Question:",),
)
def get_tokenized_ifeval(tokenizer: AutoTokenizer) -> tuple[list[list[int]], Callable[[Any], float]]:
"""IFEval inputs (chat-templated, 0-shot) and scorer reporting prompt-level strict accuracy."""
return _build_lighteval_inputs_scorer(
tokenizer,
task_spec="ifeval|0",
task_name="ifeval",
use_chat_template=True,
primary_metric="prompt_level_strict_acc",
)
def get_random_data(batch_size: int, num_tokens: int, vocab_size: int = 16000) -> list[list[int]]:
"""Random token sequences of fixed length, for raw throughput tests."""
rng = torch.Generator().manual_seed(0)
return [torch.randint(0, vocab_size, (num_tokens,), generator=rng).tolist() for _ in range(batch_size)]
# Benchmark entries and collection
@dataclass
class BenchmarkEntry:
"""Single CB run: what was fed in, which configs were used, and the resulting metrics."""
label: str
num_samples: int
avg_input_tokens: float
max_new_tokens: int
cb_config: dict[str, Any]
gen_config: dict[str, Any]
time_seconds: float | None = None
num_tokens: int | None = None
throughput_tok_per_sec: float | None = None
peak_memory_gb: float | None = None
accuracy: float | None = None
error: str | None = None
def _config_summary(cfg: Any) -> dict[str, Any]:
"""Extract a JSON-friendly summary of a dataclass/config object."""
raw = cfg.to_dict() if hasattr(cfg, "to_dict") else cfg.__dict__
return {k: v for k, v in raw.items() if isinstance(v, (int, float, str, bool, type(None)))}
class BenchmarkResults:
"""Holds all CB benchmark runs and the shared model they execute against."""
def __init__(self, model_id: str, attn_impl: str, tp_size: int = 1):
self.model_id = model_id
self.attn_impl = attn_impl
self.tp_size = tp_size
self.entries: list[BenchmarkEntry] = []
def cleanup(self) -> None:
torch.cuda.empty_cache()
gc.collect()
torch.cuda.reset_peak_memory_stats()
def _get_model(self) -> Any:
self.cleanup()
# tp_plan and device_map are mutually exclusive — TP uses its own placement.
placement = {"tp_plan": "auto"} if self.tp_size > 1 else {"device_map": 0}
model = AutoModelForCausalLM.from_pretrained(self.model_id, attn_implementation=self.attn_impl, **placement)
return model.eval()
def add_benchmark(
self,
data: list[list[int]],
max_new_tokens: int,
cb_config: ContinuousBatchingConfig,
gen_config: GenerationConfig | None = None,
label: str | None = None,
score_fn: Callable[[Any], float] | None = None,
) -> BenchmarkEntry:
"""Run one CB benchmark and record time, tokens, and peak memory."""
gen_config = GenerationConfig() if gen_config is None else gen_config
gen_config.max_new_tokens = max_new_tokens
model = self._get_model()
avg_input = sum(len(x) for x in data) / max(len(data), 1)
entry = BenchmarkEntry(
label=label or f"bench_{len(self.entries)}",
num_samples=len(data),
avg_input_tokens=avg_input,
max_new_tokens=max_new_tokens,
cb_config=_config_summary(cb_config),
gen_config=_config_summary(gen_config),
)
print(f"\n[{entry.label}] samples={entry.num_samples} avg_in={avg_input:.1f} max_new={max_new_tokens}")
self.cleanup()
try:
outputs = model.generate_batch(
inputs=data,
generation_config=gen_config,
continuous_batching_config=cb_config,
progress_bar=False,
)
gen_start = min(out.created_time for out in outputs.values())
gen_end = max(out.lifespan[1] for out in outputs.values())
gen_time = gen_end - gen_start
num_tokens = sum(len(out.generated_tokens) for out in outputs.values())
entry.time_seconds = gen_time
entry.num_tokens = num_tokens
entry.throughput_tok_per_sec = num_tokens / gen_time if gen_time > 0 else 0.0
entry.peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
if score_fn is not None:
entry.accuracy = score_fn(outputs)
print(
f" {gen_time:.2f}s, {num_tokens} tokens, "
f"{entry.throughput_tok_per_sec:.2f} tok/s, peak {entry.peak_memory_gb:.2f} GB"
+ (f", acc {entry.accuracy:.3f}" if entry.accuracy is not None else "")
)
except Exception as e:
entry.error = str(e)
print(f" ERROR: {e}")
self.entries.append(entry)
self.cleanup()
return entry
# Persistence
def save(self, name: str) -> Path:
"""Save all entries to a timestamped JSON file keyed by name."""
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
filename = RESULTS_DIR / f"{name}__{int(time.time())}.json"
payload = {
"model_id": self.model_id,
"attn_impl": self.attn_impl,
"entries": [asdict(e) for e in self.entries],
}
filename.write_text(json.dumps(payload, indent=2))
print(f"\nResults saved to {filename}")
return filename
@classmethod
def load_most_recent(cls, name: str) -> "BenchmarkResults":
"""Load the most recent JSON file matching name."""
candidates = sorted(RESULTS_DIR.glob(f"{name}__*.json"))
if not candidates:
raise FileNotFoundError(f"No baseline with name '{name}' in {RESULTS_DIR}")
data = json.loads(candidates[-1].read_text())
instance = cls(
model_id=data.get("model_id"),
attn_impl=data.get("attn_impl"),
)
instance.entries = [BenchmarkEntry(**e) for e in data["entries"]]
print(f"Loaded baseline from {candidates[-1]}")
return instance
# Display
def print_summary(self) -> None:
rows = [
{
"label": e.label,
"samples": e.num_samples,
"avg_in": f"{e.avg_input_tokens:.1f}",
"max_new": e.max_new_tokens,
"time (s)": _fmt(e.time_seconds, ".2f"),
"tokens": _fmt(e.num_tokens, "d"),
"tok/s": _fmt(e.throughput_tok_per_sec, ".2f", "ERROR"),
"mem (GB)": _fmt(e.peak_memory_gb, ".2f"),
"acc": _fmt(e.accuracy, ".3f", "-"),
}
for e in self.entries
]
print("\n" + tabulate(rows, headers="keys", tablefmt="github"))
def compare_to(self, baseline: "BenchmarkResults") -> None:
"""Print a side-by-side throughput comparison against a baseline run."""
base_tps = {e.label: e.throughput_tok_per_sec for e in baseline.entries}
def diff(cur: float | None, base: float | None) -> str:
if cur is None or not base:
return "N/A"
return f"{(cur - base) / base * 100:+.1f}%"
rows = [
{
"label": e.label,
"baseline (tok/s)": _fmt(base_tps.get(e.label), ".2f", "N/A"),
"current (tok/s)": _fmt(e.throughput_tok_per_sec, ".2f", e.error or "N/A"),
"diff": diff(e.throughput_tok_per_sec, base_tps.get(e.label)),
}
for e in self.entries
]
print(f"\nComparison against baseline (model={baseline.model_id}):")
print(tabulate(rows, headers="keys", tablefmt="github"))
# Main
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--name", type=str, default=None, help="Name of the benchmark run (for saving).")
parser.add_argument("--compare-to", type=str, default=None, help="Name of a previous run to compare against.")
parser.add_argument("--model-id", type=str, default="meta-llama/Llama-3.1-8B-Instruct")
parser.add_argument("--attn", type=str, default="kernels-community/flash-attn3")
parser.add_argument("--tp-size", type=int, default=1, help="Tensor parallel size (1 = no TP).")
parser.add_argument(
"--rollouts-lengths",
"-rl",
type=int,
nargs="+",
help="If this is specified, only the rollouts benchmarks run, with the given sizes (in tokens).",
)
cli_args = parser.parse_args()
results = BenchmarkResults(model_id=cli_args.model_id, attn_impl=cli_args.attn, tp_size=cli_args.tp_size)
tokenizer = AutoTokenizer.from_pretrained(cli_args.model_id, padding_side="left")
if cli_args.rollouts_lengths is not None:
rollouts_only = True
rollout_sizes = cli_args.rollouts_lengths
else:
rollouts_only = False
rollout_sizes = [1024, 2048, 4096, 8192, 16384]
if not rollouts_only:
# GSM8K benchmarks (256 max new tokens) — gsm8k_platinum dataset, 8-shot, lighteval extractive_match
gsm8k_data, gsm8k_score_fn = get_tokenized_gsm8k(tokenizer)
## No options
results.add_benchmark(
data=gsm8k_data,
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(),
gen_config=GenerationConfig(eos_token_id=-1),
label="gsm8k_default",
score_fn=gsm8k_score_fn,
)
## With sampling. Recommended chat sampling (T=0.6, top_p=0.9), low enough that math reasoning isn't derailed
results.add_benchmark(
data=gsm8k_data,
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(),
gen_config=GenerationConfig(eos_token_id=-1, do_sample=True, temperature=0.6, top_p=0.9),
label="gsm8k_sampling",
score_fn=gsm8k_score_fn,
)
## With compile
results.add_benchmark(
data=gsm8k_data,
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(use_default_compile_configs=True),
gen_config=GenerationConfig(eos_token_id=-1),
label="gsm8k_compile",
score_fn=gsm8k_score_fn,
)
## No decode fast path
results.add_benchmark(
data=gsm8k_data,
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(max_blocks_per_request=0),
gen_config=GenerationConfig(eos_token_id=-1),
label="gsm8k_no_fast_decode",
score_fn=gsm8k_score_fn,
)
## Bare-bones CB config
results.add_benchmark(
data=gsm8k_data,
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(
max_blocks_per_request=0, use_async_batching=False, use_cuda_graph=False
),
gen_config=GenerationConfig(eos_token_id=-1),
label="gsm8k_bare_bones",
score_fn=gsm8k_score_fn,
)
# IFEval: 0-shot chat prompts; uses real EOS so instruction-following metrics see the model's natural stop.
ifeval_data, ifeval_score_fn = get_tokenized_ifeval(tokenizer)
results.add_benchmark(
data=ifeval_data,
max_new_tokens=1280,
cb_config=ContinuousBatchingConfig(),
label="ifeval_default",
score_fn=ifeval_score_fn,
)
# Raw benchmarks (various options)
## Few blocks — tight cache pressure
results.add_benchmark(
data=get_random_data(batch_size=20, num_tokens=256),
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(num_blocks=16),
gen_config=GenerationConfig(eos_token_id=-1),
label="few_blocks",
)
## Multiple return sequences (sampling + parallel decoding)
results.add_benchmark(
data=get_random_data(batch_size=50, num_tokens=256),
max_new_tokens=256,
cb_config=ContinuousBatchingConfig(),
gen_config=GenerationConfig(eos_token_id=-1, do_sample=True, num_return_sequences=8),
label="multi_return_seq",
)
## RL rollouts: small batch, growing generation lengths
for length in rollout_sizes:
results.add_benchmark(
data=get_random_data(batch_size=32, num_tokens=256),
max_new_tokens=length,
cb_config=ContinuousBatchingConfig(use_default_compile_configs=True),
gen_config=GenerationConfig(eos_token_id=-1),
label=f"rollouts_{length}",
)
# Post processing and display. Only on rank 0 in TP runs to avoid duplicate output / file writes.
is_rank_zero = not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0
if is_rank_zero:
results.print_summary()
if cli_args.compare_to:
baseline = BenchmarkResults.load_most_recent(cli_args.compare_to)
results.compare_to(baseline=baseline)
if cli_args.name:
results.save(cli_args.name)

View File

@@ -0,0 +1,287 @@
import hashlib
import itertools
import json
import logging
from functools import lru_cache
from typing import Any
import torch
from transformers.generation.configuration_utils import CompileConfig
from transformers.utils import is_torch_accelerator_available
from transformers.utils.import_utils import is_flash_attn_2_available, is_kernels_available
KERNELIZATION_AVAILABLE = False
try:
from kernels import Mode, kernelize # noqa: F401
KERNELIZATION_AVAILABLE = True
except ImportError:
pass
logger = logging.getLogger(__name__)
@lru_cache
def is_fa2_or_kernel_available() -> bool:
"""Returns True if the flash_attn_2 or a fallback kernel is available"""
# Early return if flash_attn_2 is available
if is_flash_attn_2_available():
return True
# Early return if kernels is not available
if not is_kernels_available():
logger.warning(
"flash_attention_2 is not available. kernels is not installed. Benchmarking flash_attention_2 will not "
"be possible."
)
return False
# If kernels is available, try to get the flash_attn_2 kernel
try:
from kernels import get_kernel
# TODO: Pass the 'version' kwarg to specify the binary version once kernels >= 0.12.0 is supported.
get_kernel("kernels-community/flash-attn2")
except Exception as _:
logger.warning(
"flash_attention_2 is not available. kernels is installed, but the flash_attn kernel is not available."
"Benchmarking flash_attention_2 will not be possible."
)
return False
return True
class BenchmarkConfig:
"""Configuration for a single benchmark scenario."""
all_attn_implementations = ["flash_attention_2", "eager", "sdpa", "flex_attention"]
all_compiled_modes = [None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"]
def __init__(
self,
warmup_iterations: int = 5,
measurement_iterations: int = 20,
gpu_monitoring: bool = True, # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
continuous_batching: bool = False,
batch_size: int = 1,
sequence_length: int = 128,
num_tokens_to_generate: int = 128,
attn_implementation: str = "eager",
compile_kwargs: dict[str, Any] | None = None,
kernelize: bool = False,
tp_plan: str | dict[str, str] | None = None,
name: str | None = None,
skip_validity_check: bool = False,
) -> None:
# Benchmark parameters
self.warmup_iterations = warmup_iterations
self.measurement_iterations = measurement_iterations
self.gpu_monitoring = gpu_monitoring
self.continuous_batching = continuous_batching
# Input parameters
self.batch_size = batch_size
self.sequence_length = sequence_length
self.num_tokens_to_generate = num_tokens_to_generate
# Generation parameters
self.attn_implementation = attn_implementation
self.tp_plan = tp_plan
# Optimization parameters
if compile_kwargs is None:
self.compile_config = None
else:
compile_kwargs["fullgraph"] = compile_kwargs.get("fullgraph", True)
self.compile_config = CompileConfig(**compile_kwargs)
self.kernelize = kernelize
# Constant parameters
self.dtype = "torch.bfloat16"
self.device = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
self.check_validity(skip_validity_check)
self.name = name if name is not None else self.infer_name()
def check_validity(self, skip_validity_check: bool = False) -> None:
if skip_validity_check:
return
# If flash_attention_2 is selected but not available, default to SDPA
if self.attn_implementation == "flash_attention_2" and not is_fa2_or_kernel_available():
logger.error("Flash attention is not available. Defaulting to SDPA.")
self.attn_implementation = "sdpa"
# The combination of flash_attention_2, compile and generate is not supported # FIXME: support it
if (
not self.continuous_batching
and self.attn_implementation == "flash_attention_2"
and self.compile_config is not None
):
logger.error(
"The combination of flash_attention_2, compile and generate is not supported. Turning off compile."
)
self.compile_config = None
# Continuous batching does not support flex attention as an attention implementation # FIXME: support it
if self.attn_implementation == "flex_attention" and self.continuous_batching:
logger.error(
"Disabling continuous batching because of invalid configuration: flex attention is not supported."
)
self.continuous_batching = False
# Continuous batching supports compile mode "default" or "max-autotune-no-cudagraphs"
if (
self.continuous_batching
and self.compile_config is not None
and self.compile_config.mode not in ["default", "max-autotune-no-cudagraphs"]
):
logger.error(
f"You have continuous batching and compile enabled, but {self.compile_config.mode = } is not supported."
" Supported modes are: default, max-autotune-no-cudagraphs. Changing to default."
)
self.compile_config.mode = "default"
@property
def hash(self) -> str:
return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
def infer_name(self, compact: bool = True) -> str:
"""Infer a human-readable name for the benchmark config, either compact or verbose."""
if compact:
iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
attn_code = self.attn_implementation
compile_str = f"compiled_{self.compile_config.mode}" if self.compile_config is not None else "uncompiled"
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
continuous_batching_str = "cb" if self.continuous_batching else "generate"
tp_str = "tp" if self.tp_plan is not None else "no_tp"
sep = "-"
else:
iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
attn_code = f"{self.attn_implementation} attention"
compile_str = "compiled" if self.compile_config is not None else "not compiled"
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
if self.tp_plan is None:
tp_str = "no_tp"
else:
tp_str = "tp_custom" if isinstance(self.tp_plan, dict) else "tp_auto"
sep = ", "
return sep.join(
[
iter_str,
gpu_monitor_str,
dimensions_str,
attn_code,
compile_str,
kernelize_str,
continuous_batching_str,
tp_str,
]
)
def to_dict(self) -> dict[str, Any]:
return {
"name": self.name,
"warmup_iterations": self.warmup_iterations,
"measurement_iterations": self.measurement_iterations,
"gpu_monitoring": self.gpu_monitoring,
"continuous_batching": self.continuous_batching,
"batch_size": self.batch_size,
"sequence_length": self.sequence_length,
"num_tokens_to_generate": self.num_tokens_to_generate,
"attn_implementation": self.attn_implementation,
"compile_kwargs": self.compile_config.to_dict() if self.compile_config is not None else None,
"kernelize": self.kernelize,
"tp_plan": self.tp_plan,
}
@classmethod
def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
return cls(
warmup_iterations=data.get("warmup_iterations", 5),
measurement_iterations=data.get("measurement_iterations", 20),
gpu_monitoring=data.get("gpu_monitoring", False),
continuous_batching=data.get("continuous_batching", False),
batch_size=data.get("batch_size", 1),
sequence_length=data.get("sequence_length", 128),
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
attn_implementation=data.get("attn_implementation", "eager"),
compile_kwargs=data.get("compile_kwargs"),
kernelize=data.get("kernelize", False),
tp_plan=data.get("tp_plan"),
name=data.get("name"),
skip_validity_check=skip_validity_check,
)
def adapt_configs(
configs: list[BenchmarkConfig],
warmup_iterations: int | list[int] = 5,
measurement_iterations: int | list[int] = 20,
batch_size: int | list[int] = 1,
sequence_length: int | list[int] = 128,
num_tokens_to_generate: int | list[int] = 128,
gpu_monitoring: bool | list[bool] = True,
) -> list[BenchmarkConfig]:
parameters = (
x if isinstance(x, list) else [x]
for x in [
warmup_iterations,
measurement_iterations,
batch_size,
sequence_length,
num_tokens_to_generate,
gpu_monitoring,
]
)
iterator = itertools.product(*parameters)
adapted_configs = []
for warmup_iters, measurement_iters, bs, seqlen, ntok, monitor in iterator:
for config in configs:
config = config.to_dict()
config["warmup_iterations"] = warmup_iters
config["measurement_iterations"] = measurement_iters
config["batch_size"] = bs
config["sequence_length"] = seqlen
config["num_tokens_to_generate"] = ntok
config["gpu_monitoring"] = monitor
# Remove the old name so it gets re-inferred with the updated values
config.pop("name", None)
adapted_configs.append(BenchmarkConfig.from_dict(config))
return adapted_configs
def get_config_by_level(level: int) -> list[BenchmarkConfig]:
configs = []
# Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
if level >= 3:
for attn_implementation in BenchmarkConfig.all_attn_implementations:
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
for cm in compile_modes:
compile_kwargs = {"mode": cm} if cm is not None else None
for kernelize_on in {False, KERNELIZATION_AVAILABLE}:
for cb_on in [False, True]:
configs.append(
BenchmarkConfig(
attn_implementation=attn_implementation,
compile_kwargs=compile_kwargs,
kernelize=kernelize_on,
continuous_batching=cb_on,
)
)
return configs
# Otherwise, we add the configs for the given level
if level >= 0:
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}))
if level >= 1:
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2"))
configs.append(BenchmarkConfig(attn_implementation="eager", compile_kwargs={}))
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", continuous_batching=True))
if level >= 2:
configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_kwargs={}))
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}, kernelize=True))
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
configs.append(BenchmarkConfig(attn_implementation="sdpa", continuous_batching=True))
return configs

View File

@@ -0,0 +1,483 @@
import gc
import json
import logging
import os
import pathlib
import re
import tempfile
import time
from datetime import datetime
from queue import Queue
from typing import Any
import torch
from datasets import Dataset
from huggingface_hub import HfApi
from tqdm import trange
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
GenerationConfig,
GenerationMixin,
is_torch_xpu_available,
)
from transformers.generation.streamers import BaseStreamer
from transformers.utils import is_torch_accelerator_available
from .benchmark_config import BenchmarkConfig
from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
from .hardware_metrics import GPUMonitor
try:
from kernels import Mode, kernelize # noqa: F401
except ImportError:
kernelize = None
Mode = None
DEFAULT_PROMPT = "\n".join([
"The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
"Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
"It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
"Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
"The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
"The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
"The next three years were dominated by a struggle for political control.",
"King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
"As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
"After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
"About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
"Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
"Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
]) # fmt: skip
PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
def compact_json_numeric_arrays(data: dict):
# Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
def replace_numeric_array(match):
# Get the array content
content = match.group(1)
# Remove extra whitespace but keep commas
compact_content = re.sub(r"\s+", " ", content).strip()
return f"[{compact_content}]"
return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
def get_git_revision() -> str:
base_path = pathlib.Path(__file__).parent.parent.parent
git_dir = base_path / ".git"
with (git_dir / "HEAD").open("r") as head:
ref = head.readline().split(" ")[-1].strip()
with (git_dir / ref).open("r") as git_hash:
return git_hash.readline().strip()
def flush_memory(flush_compile: bool = True) -> None:
"""Flush GPU memory and run garbage collection. If the flush_compile flag is set, we also clear the everything
related to compile cache."""
gc.collect()
# If needed, flush everything related to torch.compile
if flush_compile:
# Dynamo resets
torch._dynamo.reset()
torch._dynamo.reset_code_caches()
if hasattr(torch._inductor, "codecache"):
# Clear FX graph cache
if hasattr(torch._inductor.codecache, "FxGraphCache"):
torch._inductor.codecache.FxGraphCache.clear()
# Clear PyCodeCache
if hasattr(torch._inductor.codecache, "PyCodeCache"):
torch._inductor.codecache.PyCodeCache.cache_clear()
# Clear TritonFuture cache (for async compilation)
if hasattr(torch._inductor.codecache, "TritonFuture"):
if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
torch._inductor.codecache.TritonFuture._compile_cache.clear()
# Clear device cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
elif is_torch_xpu_available():
torch.xpu.empty_cache()
torch.xpu.synchronize()
gc.collect()
class BenchmarkStreamer(BaseStreamer):
def __init__(self, **kwargs) -> None:
self.timeout = kwargs.pop("timeout", 10)
self.timestamps = []
self.text_queue = Queue()
self.stop_signal = None
def put(self, value):
"""Receives tokens and logs the timestamp of the generation."""
self.timestamps.append(time.perf_counter())
self.text_queue.put(value)
def end(self):
self.timestamps.append(time.perf_counter())
self.text_queue.put(self.stop_signal)
def __iter__(self):
return self
def __next__(self):
value = self.text_queue.get(timeout=self.timeout)
if value == self.stop_signal:
raise StopIteration()
else:
return value
class BenchmarkRunner:
"""Main benchmark runner that coordinates benchmark execution."""
def __init__(
self,
logger: logging.Logger,
output_dir: str | None = None,
branch_name: str | None = None,
commit_id: str | None = None,
commit_message: str | None = None,
) -> None:
# Those stay constant for the whole run
self.logger = logger
if output_dir is None:
output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
self.output_dir = output_dir
self.branch_name = branch_name
self.commit_id = get_git_revision() if commit_id is None else commit_id
self.commit_message = commit_message
os.makedirs(self.output_dir, exist_ok=True)
self.profile_dir = None
# Attributes that are reset for each model
self._setup_for = ""
# Attributes that are reset for each run
self.model: GenerationMixin | None = None
self.device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
self.torch_accelerator_module = getattr(torch, self.device_type, torch.cuda)
def cleanup(self) -> None:
del self.model
self.model = None
flush_memory()
@staticmethod
def _is_primary_process() -> bool:
if not torch.distributed.is_available() or not torch.distributed.is_initialized():
return True
return torch.distributed.get_rank() == 0
def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
# Some attributes only need to be set once per model
if self._setup_for != model_id:
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
# We set the EOS token to the padding token for open-ended generation
self.tokenizer.eos_token = self.tokenizer.pad_token
self._setup_for = model_id
# Prepare inputs
self.inputs = self.tokenizer(
[DEFAULT_PROMPT for _ in range(config.batch_size)],
return_tensors="pt",
max_length=config.sequence_length,
truncation=True,
return_attention_mask=True,
)
self.inputs["use_cache"] = True
# Prepare generation config
generation_config_kwargs = {
"do_sample": False,
"max_new_tokens": config.num_tokens_to_generate,
}
# Add compile config if found
if config.compile_config is not None:
generation_config_kwargs.update(compile_config=config.compile_config)
# To trigger compile in generate, we need to set the cache to static
if not config.continuous_batching:
generation_config_kwargs.update(cache_implementation="static")
generation_config = GenerationConfig(**generation_config_kwargs)
# Load model
self.logger.debug(f"Loading model {model_id} on device {config.device}...")
dtype = getattr(torch, config.dtype.removeprefix("torch."))
use_kernels = config.kernelize and kernelize is not None and Mode is not None
device_map = config.device if config.tp_plan is None else None
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=dtype,
attn_implementation=config.attn_implementation,
generation_config=generation_config,
use_kernels=use_kernels,
device_map=device_map,
tp_plan=config.tp_plan,
)
self.model = self.model.eval()
self.inputs = self.inputs.to(self.model.device)
def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
"""Run a single benchmark with the given model ID and config."""
with torch.no_grad():
self.logger.info(f"Running benchmark scenario: {config.name}")
self.logger.debug(f"Full config: {config.to_dict()}")
# Quick validation: try one measurement first to see if this scenario works
flush_memory()
e2e_latency = self.time_generate(config, warmup=True)[0]
if e2e_latency < 0:
self.logger.warning(f"Skipping config {config.name}: {e2e_latency = }")
return None
# Warmup runs
self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
for _ in trange(config.warmup_iterations, desc="Warmup"):
self.time_generate(config, warmup=True)
self.logger.info("Warmup over.")
# Measurement runs
result = BenchmarkResult()
self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
for _ in trange(config.measurement_iterations, desc="Benchmarking"):
e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics = self.time_generate(
config, warmup=False
)
result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics)
self.logger.info("Benchmarking done. Cleaning up.")
# Profile if needed
if num_tokens_to_profile > 0:
self.profile_generate(num_tokens_to_profile, config.name)
return result
def time_generate(
self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
# Prepare gpu monitoring if needed
if config.gpu_monitoring and not warmup:
gpu_monitor = GPUMonitor(logger=self.logger)
gpu_monitor.start()
else:
gpu_monitor = None
# Generate and time
if config.continuous_batching:
inputs = self.inputs["input_ids"].tolist()
wall_time_0 = time.perf_counter()
outputs = self.model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True)
else:
streamer = BenchmarkStreamer()
wall_time_0 = time.perf_counter()
outputs = self.model.generate(**self.inputs, streamer=streamer)
wall_time_1 = time.perf_counter()
gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
# Retrieve timestamps and results in a way that allows similar post-processing
input_tokens = self.inputs["input_ids"].size(-1)
if config.continuous_batching:
timestamps = [output.timestamps[:] for output in outputs.values()]
results = torch.tensor([output.generated_tokens[:] for output in outputs.values()])
else:
timestamps = [streamer.timestamps[1:]] # skip the first timestamp because it's the input tokens
results = outputs[:, input_tokens:]
outputs = None
flush_memory(flush_compile=False)
# Check if generation had the right number of tokens
if results.size(-1) != config.num_tokens_to_generate:
raise RuntimeError(f"Generated {results.size(-1)} tokens, expected {config.num_tokens_to_generate}")
# Decode outputs
decoded_output = self.tokenizer.decode(results[0], skip_special_tokens=True)
shape_and_decoded_output = f"{tuple(results.shape)} | {decoded_output}"
# Compute metrics
e2e_latency = wall_time_1 - wall_time_0
timestamps = torch.tensor(timestamps).sub(wall_time_0).tolist()
self.logger.info(
f"Time generate done in {e2e_latency:.2f} seconds. Memory usage: {self.torch_accelerator_module.memory_allocated() / 1024**2:.2f} MB"
)
return e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics
def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
"""Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
activities = [torch.profiler.ProfilerActivity.CPU]
if self.device_type == "cuda":
activities.append(torch.profiler.ProfilerActivity.CUDA)
elif self.device_type == "xpu":
activities.append(torch.profiler.ProfilerActivity.XPU)
profiler = torch.profiler.profile(
activities=activities,
record_shapes=True,
)
with profiler as prof:
_ = self.model.generate(
**self.inputs,
max_new_tokens=num_tokens_to_profile,
)
if self.profile_dir is None:
self.profile_dir = self.output_dir + "_profiles"
os.makedirs(self.profile_dir, exist_ok=True)
prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
@torch.inference_mode()
def run_benchmarks(
self,
model_id: str,
benchmark_configs: list[BenchmarkConfig],
num_tokens_to_profile: int = 0,
pretty_print_summary: bool = True,
summarized: bool = True,
) -> tuple[str, dict[str, Any]]:
"""Run multiple benchmarks for the given model ID and list of benchmark configs."""
all_results = {}
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
start_time = time.perf_counter()
n_configs = len(benchmark_configs)
for i, config in enumerate(benchmark_configs):
# Skip if already run
if config.hash in all_results:
self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
continue
# Otherwise, run the benchmark
self.setup_benchmark(model_id, config)
self.logger.info(
f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
)
# Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
try:
result = self.run_benchmark(config, num_tokens_to_profile)
except Exception as e:
self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
result = None
# Memoize
all_results[config.hash] = {
"metadata": BenchmarkMetadata(
model_id=model_id,
branch_name=self.branch_name,
commit_id=self.commit_id,
commit_message=self.commit_message,
success=result is not None,
),
"measurements": result if result is not None else BenchmarkResult(),
"config": config,
}
# Cleanup model and save results
self.cleanup()
self.save_results(model_id, all_results, timestamp=timestamp, summarized=summarized)
if len(all_results) < 1:
raise RuntimeError("No benchmark was run successfully")
if pretty_print_summary:
if not self._is_primary_process():
return (timestamp, all_results)
print()
print("=" * 100)
print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
print(f"Total number of benchmarks: {len(all_results)}")
print("First run metadata:")
first_key = list(all_results.keys())[0]
first_metadata = all_results[first_key]["metadata"].to_dict()
hardware_info = first_metadata.pop("hardware_info")
pretty_print_dict(first_metadata | hardware_info, tabs=1)
for result in all_results.values():
print("=" * 100)
print(f"Config: {result['config'].infer_name(compact=False)}\n")
result["measurements"].pprint(
batch_size=result["config"].batch_size,
num_generated_tokens=result["config"].num_tokens_to_generate,
tabs=1,
)
print("=" * 100)
return (timestamp, all_results)
def save_results(self, model_name: str, results: dict, timestamp: str = "", summarized: bool = True) -> str:
"""Save benchmark results to JSON file."""
if not self._is_primary_process():
return ""
# Create model-specific subdirectory
model_name = model_name.replace("/", "_")
model_dir = os.path.join(self.output_dir, model_name)
os.makedirs(model_dir, exist_ok=True)
# Create filename with timestamp
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{model_name}_benchmark_{timestamp}.json"
filepath = os.path.join(model_dir, filename)
# Convert results to dict
converted_results = {}
for cfg_hash in results.keys():
converted_results[cfg_hash] = {
"metadata": results[cfg_hash]["metadata"].to_dict(),
"measurements": results[cfg_hash]["measurements"].to_dict(summarized=summarized),
"config": results[cfg_hash]["config"].to_dict(),
}
# Save to JSON file
with open(filepath, "w") as f:
f.write(compact_json_numeric_arrays(converted_results))
self.logger.info(f"Results saved to {filepath}")
return filepath
def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
if PUSH_TO_HUB_TOKEN is None:
raise ValueError(
"PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
)
api = HfApi()
n_results = len(results)
for summarized in [False, True]:
self.logger.info(f"Pushing {n_results} results to: {dataset_id} with {summarized = }")
rows = []
for cfg_hash, entry in results.items():
row = {
"benchmark_config_hash": cfg_hash,
"config": entry["config"].to_dict(),
"measurements": entry["measurements"].to_dict(summarized=summarized),
"metadata": entry["metadata"].to_dict(),
}
rows.append(row)
ds = Dataset.from_list(rows)
with tempfile.TemporaryDirectory() as tmp:
file_name = "summarized_results" if summarized else "full_results"
jsonl_path = os.path.join(tmp, f"{file_name}.jsonl")
with open(jsonl_path, "w") as f:
json_lines = []
for ex in ds:
json_lines.append(json.dumps(ex, ensure_ascii=False))
f.write("\n".join(json_lines))
# NOTE: we expect the repository to already exist
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
file_name = file_name + "/" + f"benchmark_run_{timestamp}.jsonl"
api.upload_file(
path_or_fileobj=jsonl_path,
path_in_repo=file_name,
repo_id=dataset_id,
repo_type="dataset",
token=PUSH_TO_HUB_TOKEN,
)
self.logger.info(f"Successfully uploaded results to: {dataset_id} with {summarized = }")

View File

@@ -0,0 +1,176 @@
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Any
import numpy as np
from .hardware_metrics import GPURawMetrics, HardwareInfo
def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
return {
"avg": np.mean(measurements) if measurements else 0,
"std": np.std(measurements) if measurements else 0,
"min": np.min(measurements) if measurements else 0,
"med": np.median(measurements) if measurements else 0,
"max": np.max(measurements) if measurements else 0,
"p95": np.percentile(measurements, 95) if measurements else 0,
}
def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
for key in list(stats.keys()):
value = stats[key]
if value > 3600:
stats[key] = f"{(value / 3600):.2f}hr"
elif value > 60:
stats[key] = f"{(value / 60):.2f}min"
elif value > 1:
stats[key] = f"{value:.2f}s"
elif value > 1e-3:
stats[key] = f"{(value * 1e3):.2f}ms"
elif value > 1e-6:
stats[key] = f"{(value * 1e6):.2f}us"
else:
stats[key] = f"{(value * 1e9):.2f}ns"
return stats
def equalize_lengths_and_collate(stats: dict[str, dict[str, str]]) -> dict[str, str]:
"""Note: This operation is destructive as it will update values in place before returning a new correctly formatted dict"""
keys = ["avg", "std", "min", "med", "max", "p95"]
for key in keys:
max_length = max(len(stat[key]) for stat in stats.values())
for stat in stats.values():
stat[key] = stat[key].ljust(max_length, " ")
return {name: " ".join([f"{key}={stat[key]}" for key in keys]) for name, stat in stats.items()}
def pretty_print_dict(data: dict[str, str], tabs: int = 0) -> None:
max_key_length = max([len(key) for key in data.keys()])
for key, value in data.items():
tabs_str = " " * tabs
padded_key = key.ljust(max_key_length + 1, ".")
print(f"{tabs_str}{padded_key}: {value}")
@dataclass
class BenchmarkMetadata:
"""Metadata collected for each benchmark run."""
model_id: str
timestamp: str
branch_name: str
commit_id: str
commit_message: str
hardware_info: HardwareInfo
success: bool
def __init__(
self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "", success: bool = True
) -> None:
self.model_id = model_id
self.timestamp = datetime.now(timezone.utc).isoformat()
self.branch_name = branch_name
self.commit_id = commit_id
self.commit_message = commit_message
self.hardware_info = HardwareInfo()
self.success = success
def to_dict(self) -> dict[str, Any]:
return {
"model_id": self.model_id,
"timestamp": self.timestamp,
"branch_name": self.branch_name,
"commit_id": self.commit_id,
"commit_message": self.commit_message,
"hardware_info": self.hardware_info.to_dict(),
"success": self.success,
}
class BenchmarkResult:
"""Result from a series of benchmark runs."""
def __init__(self) -> None:
self.e2e_latency = []
self._timestamps = []
self.time_to_first_token = []
self.inter_token_latency = []
self.shape_and_decoded_outputs = []
self.gpu_metrics = []
def accumulate(
self,
e2e_latency: float,
timestamps: list[float],
shape_and_decoded_output: str,
gpu_metrics: GPURawMetrics | None,
) -> None:
self.e2e_latency.append(e2e_latency)
self._timestamps.append(timestamps)
self._accumulate_ttft_and_itl(timestamps)
self.shape_and_decoded_outputs.append(shape_and_decoded_output)
self.gpu_metrics.append(gpu_metrics)
def _accumulate_ttft_and_itl(self, timestamps: list[float]) -> None:
timestamps = np.array(timestamps)
tftt = np.min(timestamps[:, 0])
itl = np.mean(timestamps[:, -1] - timestamps[:, 0]) / (timestamps.shape[1] - 1)
self.time_to_first_token.append(tftt)
self.inter_token_latency.append(itl)
def to_dict(self, summarized: bool = False) -> dict[str, Any]:
# Save GPU metrics as None if it contains only None values or if we are summarizing
if summarized or all(gm is None for gm in self.gpu_metrics):
gpu_metrics = None
else:
gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
return {
"e2e_latency": self.e2e_latency,
"time_to_first_token": self.time_to_first_token,
"inter_token_latency": self.inter_token_latency,
"shape_and_decoded_outputs": self.shape_and_decoded_outputs,
"gpu_metrics": gpu_metrics,
"timestamps": None if summarized else self._timestamps,
}
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "BenchmarkResult":
# Handle GPU metrics, which is saved as None if it contains only None values
if data["gpu_metrics"] is None:
gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
else:
gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
# Handle timestamps, which can be saved as None to reduce file size
if data["timestamps"] is None:
timestamps = [None for _ in range(len(data["e2e_latency"]))]
else:
timestamps = data["timestamps"]
# Create a new instance and accumulate the data
new_instance = cls()
new_instance.e2e_latency = data["e2e_latency"]
new_instance._timestamps = timestamps
new_instance.time_to_first_token = data["time_to_first_token"]
new_instance.inter_token_latency = data["inter_token_latency"]
new_instance.shape_and_decoded_outputs = data["shape_and_decoded_outputs"]
new_instance.gpu_metrics = gpu_metrics
return new_instance
def get_throughput(self, total_generated_tokens: int) -> list[float]:
return [total_generated_tokens / e2e_latency for e2e_latency in self.e2e_latency]
def pprint(self, batch_size: int = 0, num_generated_tokens: int = 0, tabs: int = 0) -> None:
measurements = {
"E2E Latency": add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
"Time to First Token": add_unit_to_duration(compute_basic_statistics(self.time_to_first_token)),
}
if len(self.inter_token_latency) > 0:
measurements["Inter-Token Latency"] = add_unit_to_duration(
compute_basic_statistics(self.inter_token_latency)
)
if batch_size > 0:
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size * num_generated_tokens))
measurements["Throughput"] = {key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()}
dict_to_pprint = equalize_lengths_and_collate(measurements)
pretty_print_dict(dict_to_pprint, tabs=tabs)

View File

@@ -0,0 +1,325 @@
import logging
import subprocess
import sys
import time
from dataclasses import dataclass
from enum import Enum
from logging import Logger
from multiprocessing import Pipe, Process
from multiprocessing.connection import Connection
from transformers.utils.import_utils import is_cuda_platform, is_rocm_platform
if is_cuda_platform():
import pynvml
if is_rocm_platform():
import amdsmi
import psutil
import torch
from transformers.utils import is_torch_accelerator_available
_logger = logging.getLogger(__name__)
# Data class to hold the hardware information
def get_device_name_and_memory_total() -> tuple[str, float]:
"""Returns the name and memory total of GPU 0."""
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
device_name = torch_accelerator_module.get_device_properties(0).name
device_memory_total = torch_accelerator_module.get_device_properties(0).total_memory / 1024**3
return device_name, device_memory_total
class HardwareInfo:
"""A class to hold information about the hardware."""
def __init__(self) -> None:
# Retrieve GPU stats
try:
self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
except Exception:
self.gpu_name, self.gpu_memory_total_gb = None, None
# Retrieve python, torch and CUDA version
self.python_version = f"{sys.version.split()[0]}"
self.torch_version = torch.__version__
if hasattr(torch, "cuda") and torch.cuda.is_available():
self.cuda_version = torch.version.cuda
else:
self.cuda_version = None
# Retrieve general hardware information
self.cpu_count = psutil.cpu_count()
self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
def to_dict(self) -> dict[str, None | int | float | str]:
return {
"gpu_name": self.gpu_name,
"gpu_memory_total_gb": self.gpu_memory_total_gb,
"python_version": self.python_version,
"torch_version": self.torch_version,
}
# Functions to get information about the GPU
def get_amd_gpu_stats(device_handle) -> tuple[int, float]:
"""Get AMD GPU stats using amdsmi library."""
utilization = amdsmi.amdsmi_get_gpu_activity(device_handle)["gfx_activity"]
memory_used = amdsmi.amdsmi_get_gpu_vram_usage(device_handle)["vram_used"]
return int(utilization), float(memory_used) / 1024**3 # Convert bytes to GB
def get_intel_xpu_stats() -> tuple[int, float]:
"""Returns the utilization and memory used of an Intel XPU"""
# xpu-smi outputs CSV format: Timestamp, DeviceId, GPU Memory Utilization (%), GPU Memory Used (MiB)
xpu_smi_output = subprocess.check_output(["xpu-smi", "dump", "-m", "5,18", "-n", "1"])
lines = xpu_smi_output.decode("utf-8").strip().split("\n")
# Parse all data lines (skip header) and collect stats from all cards
xpu_stats = []
for line in lines[1:]:
data_line = line.split(",")
if len(data_line) < 4:
continue
device_id = data_line[1].strip()
utilization_str = data_line[2].strip()
memory_used_str = data_line[3].strip()
if utilization_str != "N/A" and memory_used_str != "N/A":
utilization = int(float(utilization_str))
memory_used_mib = float(memory_used_str)
xpu_stats.append((device_id, utilization, memory_used_mib))
if not xpu_stats:
return 0, 0.0
# Sort by utilization (descending) and pick the highest
xpu_stats.sort(key=lambda x: x[1], reverse=True)
device_id, utilization, memory_used_mib = xpu_stats[0]
memory_used_gb = memory_used_mib / 1024
return utilization, memory_used_gb
def get_nvidia_gpu_stats(device_handle) -> tuple[int, float]:
"""Returns the utilization and memory used of an NVIDIA GPU using pynvml."""
utilization = pynvml.nvmlDeviceGetUtilizationRates(device_handle).gpu
memory_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
memory_used_gb = memory_info.used / 1024**3
return int(utilization), float(memory_used_gb)
# Simple data classes to hold the raw GPU metrics
class GPUMonitoringStatus(Enum):
"""Status of GPU monitoring."""
SUCCESS = "success"
FAILED = "failed"
NO_GPUS_AVAILABLE = "no_gpus_available"
NO_SAMPLES_COLLECTED = "no_samples_collected"
@dataclass
class GPURawMetrics:
"""Raw values for GPU utilization and memory used."""
utilization: list[float] # in percent
memory_used: list[float] # in GB
timestamps: list[float] # in seconds
timestamp_0: float # in seconds
monitoring_status: GPUMonitoringStatus
def to_dict(self) -> dict[str, None | int | float | str]:
return {
"utilization": self.utilization,
"memory_used": self.memory_used,
"timestamps": self.timestamps,
"timestamp_0": self.timestamp_0,
"monitoring_status": self.monitoring_status.value,
}
@classmethod
def from_dict(cls, data: dict[str, None | int | float | str]) -> "GPURawMetrics":
"""Create a GPURawMetrics instance from a dictionary."""
return cls(
utilization=data["utilization"],
memory_used=data["memory_used"],
timestamps=data["timestamps"],
timestamp_0=data["timestamp_0"],
monitoring_status=GPUMonitoringStatus(data["monitoring_status"]),
)
# Main class, used to monitor the GPU utilization during benchmark execution
class GPUMonitor:
"""Monitor GPU utilization during benchmark execution using a separate process."""
def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
self.sample_interval_sec = sample_interval_sec
self.logger = logger if logger is not None else _logger
self.gpu_type = None
self.process = None
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
self.num_available_gpus = torch_accelerator_module.device_count()
if self.num_available_gpus == 0:
self.logger.warning(f"No GPUs detected by torch.{device_type}.device_count().")
return
# Determine GPU type
device_name, _ = get_device_name_and_memory_total()
if "amd" in device_name.lower():
self.gpu_type = "amd"
elif "nvidia" in device_name.lower():
self.gpu_type = "nvidia"
elif "intel" in device_name.lower() or device_type == "xpu":
self.gpu_type = "intel"
else:
self.logger.warning(f"Unsupported GPU for monitoring: {device_name}")
@staticmethod
def _monitor_worker(gpu_type: str, sample_interval_sec: float, connection: Connection):
"""Worker process for GPU monitoring."""
gpu_utilization = []
gpu_memory_used = []
timestamps = []
device_handle = None
# Initialize GPU-specific monitoring
if gpu_type == "amd":
amdsmi.amdsmi_init()
device_handle = amdsmi.amdsmi_get_processor_handles()[0]
elif gpu_type == "nvidia":
pynvml.nvmlInit()
device_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
# Signal ready
try:
connection.send(0)
except Exception:
return
# Monitoring loop
stop = False
while not stop:
try:
if gpu_type == "amd":
utilization, memory_used = get_amd_gpu_stats(device_handle)
elif gpu_type == "nvidia":
utilization, memory_used = get_nvidia_gpu_stats(device_handle)
elif gpu_type == "intel":
utilization, memory_used = get_intel_xpu_stats()
else:
break
gpu_utilization.append(utilization)
gpu_memory_used.append(memory_used)
timestamps.append(time.time())
except Exception as e:
# Skips failed measurements
_logger.debug(f"Failed to collect GPU metrics sample: {e}")
stop = connection.poll(sample_interval_sec)
# Cleanup
if gpu_type == "amd":
try:
amdsmi.amdsmi_shut_down()
except Exception as e:
_logger.debug(f"Failed to shutdown AMD GPU monitoring: {e}")
elif gpu_type == "nvidia":
try:
pynvml.nvmlShutdown()
except Exception as e:
_logger.debug(f"Failed to shutdown NVIDIA GPU monitoring: {e}")
# Send results back
try:
connection.send((gpu_utilization, gpu_memory_used, timestamps))
except Exception as e:
_logger.error(f"Failed to send GPU monitoring results: {e}")
connection.close()
def start(self):
"""Start monitoring GPU metrics in a separate process."""
if self.gpu_type is None:
self.logger.debug("GPU monitoring skipped (no supported GPU)")
return
self.child_connection, self.parent_connection = Pipe()
self.process = Process(
target=GPUMonitor._monitor_worker,
args=(self.gpu_type, self.sample_interval_sec, self.child_connection),
daemon=True,
)
self.process.start()
# Wait for worker to signal ready
if self.process.is_alive():
self.parent_connection.recv()
self.logger.debug("GPU monitoring started (multiprocessing)")
def stop_and_collect(self) -> GPURawMetrics:
"""Stop monitoring and return collected metrics."""
# No GPU available or unsupported GPU
if self.process is None:
return GPURawMetrics(
utilization=[],
memory_used=[],
timestamps=[],
timestamp_0=0.0,
monitoring_status=GPUMonitoringStatus.NO_GPUS_AVAILABLE,
)
# Process crashed before we could collect results
process_failed = False
if not self.process.is_alive():
process_failed = True
gpu_utilization, gpu_memory_used, timestamps = [], [], []
else:
# Signal stop
self.parent_connection.send(0)
# Get results
try:
gpu_utilization, gpu_memory_used, timestamps = self.parent_connection.recv()
except Exception:
process_failed = True
gpu_utilization, gpu_memory_used, timestamps = [], [], []
self.parent_connection.close()
self.process.join(timeout=2.0)
if self.process.is_alive():
self.process.terminate()
if gpu_utilization:
timestamp_0 = timestamps[0]
metrics = GPURawMetrics(
utilization=gpu_utilization,
memory_used=gpu_memory_used,
timestamps=[t - timestamp_0 for t in timestamps],
timestamp_0=timestamp_0,
monitoring_status=GPUMonitoringStatus.SUCCESS,
)
self.logger.debug(f"GPU monitoring completed: {len(gpu_utilization)} samples collected")
elif process_failed:
metrics = GPURawMetrics(
utilization=[],
memory_used=[],
timestamps=[],
timestamp_0=0.0,
monitoring_status=GPUMonitoringStatus.FAILED,
)
self.logger.warning("GPU monitoring failed (process crashed or timed out)")
else:
metrics = GPURawMetrics(
utilization=[],
memory_used=[],
timestamps=[],
timestamp_0=0.0,
monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED,
)
return metrics

View File

@@ -0,0 +1,7 @@
numpy>=1.21.0
psutil>=5.8.0
nvidia-ml-py>=12.0.0
torch>=2.0.0
datasets>=2.10.0
huggingface_hub>=0.16.0
amdsmi>=7.0.2

133
benchmark_v2/run_benchmarks.py Executable file
View File

@@ -0,0 +1,133 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Top-level benchmarking script that automatically discovers and runs all benchmarks
in the ./benches directory, organizing outputs into model-specific subfolders.
"""
import argparse
import json
import logging
import sys
import uuid
from framework.benchmark_config import BenchmarkConfig, adapt_configs, get_config_by_level
from framework.benchmark_runner import BenchmarkRunner
if __name__ == "__main__":
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="WARNING")
parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
parser.add_argument("--warmup", "-w", type=int, default=3, help="Number of warmup iterations")
parser.add_argument("--iterations", "-i", type=int, default=10, help="Number of measurement iterations")
parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")
parser.add_argument(
"--level",
type=int,
default=1,
help="Level of coverage for the benchmark. 0: only the main config, 1: a few important configs, 2: a config for"
" each attn implementation an option, 3: cross-generate all combinations of configs, 4: cross-generate all"
" combinations of configs w/ all compile modes",
)
parser.add_argument("--config-file", type=str, help="Path to a config file stored as a json or jsonl format")
parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")
parser.add_argument("--enable-tp", action="store_true", help="Enable tensor parallelism with tp_plan=auto")
parser.add_argument("--branch-name", type=str, help="Git branch name")
parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
parser.add_argument("--commit-message", type=str, help="Git commit message")
parser.add_argument(
"--no-gpu-monitoring", action="store_true", help="Disables GPU monitoring during benchmark runs"
)
parser.add_argument(
"--push-result-to-dataset",
type=str,
default=None,
help="Name of the dataset to push results to. If not provided, results are not pushed to the Hub.",
)
args = parser.parse_args()
# Setup logging
benchmark_run_uuid = str(uuid.uuid4())[:8]
numeric_level = getattr(logging, args.log_level.upper())
handlers = [logging.StreamHandler(sys.stdout)]
logging.basicConfig(
level=numeric_level, format="[%(levelname)s - %(asctime)s] %(name)s: %(message)s", handlers=handlers
)
logger = logging.getLogger("benchmark_v2")
logger.info("Starting benchmark discovery and execution")
logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
logger.info(f"Output directory: {args.output_dir}")
# Error out if one of the arguments is not provided
if any(arg is None for arg in [args.batch_size, args.sequence_length, args.num_tokens_to_generate]):
raise ValueError(
"All of the arguments --batch-size, --sequence-length, and --num-tokens-to-generate are required"
)
# We cannot compute ITL if we don't have at least two measurements
if any(n <= 1 for n in args.num_tokens_to_generate):
raise ValueError("--num_tokens_to_generate arguments should be larger than 1")
# If a config file is provided, read it and use the configs therein. They will still be adapted to the given arguments.
if args.config_file is not None:
if args.config_file.endswith(".json"):
with open(args.config_file, "r") as f:
config_as_dicts = [json.load(f)]
elif args.config_file.endswith(".jsonl"):
with open(args.config_file, "r") as f:
config_as_dicts = [json.loads(line) for line in f if line.startswith("{")]
else:
raise ValueError(f"Unsupported config file format: {args.config_file}")
configs = [BenchmarkConfig.from_dict(config) for config in config_as_dicts]
else:
# Otherwise, get the configs for the given coverage level
configs = get_config_by_level(args.level)
# Adapt the configs to the given arguments
configs = adapt_configs(
configs,
args.warmup,
args.iterations,
args.batch_size,
args.sequence_length,
args.num_tokens_to_generate,
not args.no_gpu_monitoring,
)
if args.enable_tp:
for config in configs:
config.tp_plan = "auto"
runner = BenchmarkRunner(logger, args.output_dir, args.branch_name, args.commit_id, args.commit_message)
timestamp, results = runner.run_benchmarks(
args.model_id, configs, args.num_tokens_to_profile, pretty_print_summary=True
)
dataset_id = args.push_result_to_dataset
if dataset_id is not None and len(results) > 0 and runner._is_primary_process():
runner.push_results_to_hub(dataset_id, results, timestamp)