first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
2
benchmark_v2/.gitignore
vendored
Normal file
2
benchmark_v2/.gitignore
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
benchmark_results/
|
||||
benchmark_results_profiles/
|
||||
138
benchmark_v2/README.md
Normal file
138
benchmark_v2/README.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Benchmarking v2
|
||||
|
||||
A comprehensive benchmarking framework for transformer models that supports multiple execution modes (eager, compiled, kernelized), detailed performance metrics collection, and structured output format.
|
||||
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Running All Benchmarks
|
||||
|
||||
```bash
|
||||
# Run all benchmarks with default settings
|
||||
python run_benchmarks.py
|
||||
|
||||
# Specify output directory
|
||||
python run_benchmarks.py --output-dir my_results
|
||||
|
||||
# Run with custom parameters
|
||||
python run_benchmarks.py \
|
||||
--warmup-iterations 5 \
|
||||
--measurement-iterations 10 \
|
||||
--num-tokens-to-generate 200
|
||||
```
|
||||
|
||||
### Uploading Results to HuggingFace Dataset
|
||||
|
||||
You can automatically upload benchmark results to a HuggingFace Dataset for tracking and analysis:
|
||||
|
||||
```bash
|
||||
# Upload to a public dataset with auto-generated run ID
|
||||
python run_benchmarks.py --upload-to-hub username/benchmark-results
|
||||
|
||||
# Upload with a custom run ID for easy identification
|
||||
python run_benchmarks.py --upload-to-hub username/benchmark-results --run-id experiment_v1
|
||||
|
||||
# Upload with custom HuggingFace token (if not set in environment)
|
||||
python run_benchmarks.py --upload-to-hub username/benchmark-results --token hf_your_token_here
|
||||
```
|
||||
|
||||
**Dataset Directory Structure:**
|
||||
```
|
||||
dataset_name/
|
||||
├── 2025-01-15/
|
||||
│ ├── runs/ # Non-scheduled runs (manual, PR, etc.)
|
||||
│ │ └── 123-1245151651/ # GitHub run number and ID
|
||||
│ │ └── benchmark_results/
|
||||
│ │ ├── benchmark_summary_20250115_143022.json
|
||||
│ │ └── model-name/
|
||||
│ │ └── model-name_benchmark_20250115_143022.json
|
||||
│ └── benchmark_results_abc123de/ # Scheduled runs (daily CI)
|
||||
│ ├── benchmark_summary_20250115_143022.json
|
||||
│ └── model-name/
|
||||
│ └── model-name_benchmark_20250115_143022.json
|
||||
└── 2025-01-16/
|
||||
└── ...
|
||||
```
|
||||
|
||||
**Authentication for Uploads:**
|
||||
|
||||
For uploading results, you need a HuggingFace token with write permissions to the target dataset. You can provide the token in several ways (in order of precedence):
|
||||
|
||||
1. Command line: `--token hf_your_token_here`
|
||||
3. Environment variable: `HF_TOKEN`
|
||||
|
||||
### Running Specific Benchmarks
|
||||
|
||||
```bash
|
||||
# Include only specific benchmarks
|
||||
python run_benchmarks.py --include llama
|
||||
|
||||
# Exclude specific benchmarks
|
||||
python run_benchmarks.py --exclude old_benchmark
|
||||
|
||||
## Output Format
|
||||
|
||||
Results are saved as JSON files with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"model_name": "llama_2_7b",
|
||||
"benchmark_scenarios": [
|
||||
{
|
||||
"scenario_name": "eager_variant",
|
||||
"metadata": {
|
||||
"timestamp": "2025-01-XX...",
|
||||
"commit_id": "abc123...",
|
||||
"hardware_info": {
|
||||
"gpu_name": "NVIDIA A100",
|
||||
"gpu_memory_total": 40960,
|
||||
"cpu_count": 64
|
||||
},
|
||||
"config": {
|
||||
"variant": "eager",
|
||||
"warmup_iterations": 3,
|
||||
"measurement_iterations": 5
|
||||
}
|
||||
},
|
||||
"measurements": {
|
||||
"latency": {
|
||||
"mean": 2.45,
|
||||
"median": 2.43,
|
||||
"std": 0.12,
|
||||
"min": 2.31,
|
||||
"max": 2.67,
|
||||
"p95": 2.61,
|
||||
"p99": 2.65
|
||||
},
|
||||
"time_to_first_token": {
|
||||
"mean": 0.15,
|
||||
"std": 0.02
|
||||
},
|
||||
"tokens_per_second": {
|
||||
"mean": 87.3,
|
||||
"unit": "tokens/sec"
|
||||
}
|
||||
},
|
||||
"gpu_metrics": {
|
||||
"gpu_utilization_mean": 85.2,
|
||||
"gpu_memory_used_mean": 12450
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
python run_benchmarks.py --log-level DEBUG
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
To add new benchmarks:
|
||||
|
||||
1. Create a new file in `benches/`
|
||||
2. Implement the `ModelBenchmark` interface
|
||||
3. Add a runner function (`run_<benchmark_name>` or `run_benchmark`)
|
||||
4. run_benchmarks.py
|
||||
443
benchmark_v2/benchmark_scripts/continuous_batching_overall.py
Normal file
443
benchmark_v2/benchmark_scripts/continuous_batching_overall.py
Normal file
@@ -0,0 +1,443 @@
|
||||
"""
|
||||
Continuous batching overall benchmark suite.
|
||||
|
||||
Runs CB in-process across many configurations (GSM8K prompts and synthetic
|
||||
data) and can compare throughput against a previously-saved run.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import gc
|
||||
import json
|
||||
import time
|
||||
import types
|
||||
from collections.abc import Callable
|
||||
from dataclasses import asdict, dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
from lighteval.models.model_output import ModelResponse
|
||||
from lighteval.tasks.lighteval_task import LightevalTask, LightevalTaskConfig
|
||||
from lighteval.tasks.prompt_manager import PromptManager
|
||||
from lighteval.tasks.registry import Registry
|
||||
from lighteval.tasks.requests import Doc
|
||||
from tabulate import tabulate
|
||||
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, ContinuousBatchingConfig, GenerationConfig
|
||||
|
||||
|
||||
# Defaults
|
||||
RESULTS_DIR = Path(__file__).parent.parent / "benchmark_results/cb_overall/"
|
||||
|
||||
|
||||
def _fmt(val: Any, spec: str = "", missing: str = "X") -> str:
|
||||
"""Format `val` per `spec`, or return `missing` if val is None."""
|
||||
return format(val, spec) if val is not None else missing
|
||||
|
||||
|
||||
def _build_gsm8k_platinum_module() -> types.ModuleType:
|
||||
"""Define the gsm8k_platinum custom task inline so lighteval's Registry can pick it up via `custom_tasks=`."""
|
||||
|
||||
def gsm8k_platinum_prompt(line, task_name=None):
|
||||
return Doc(
|
||||
task_name=task_name,
|
||||
query=f"Question: {line['question']}\nAnswer:",
|
||||
choices=[f" {line['answer']}"],
|
||||
gold_index=0,
|
||||
)
|
||||
|
||||
metrics = list(Registry().load_all_task_configs()["gsm8k"].metrics)
|
||||
|
||||
mod = types.ModuleType("_gsm8k_platinum_inline")
|
||||
mod.TASKS_TABLE = [
|
||||
LightevalTaskConfig(
|
||||
name="gsm8k_platinum",
|
||||
prompt_function=gsm8k_platinum_prompt,
|
||||
hf_repo="madrylab/gsm8k-platinum",
|
||||
hf_subset="main",
|
||||
evaluation_splits=("test",),
|
||||
few_shots_split="test",
|
||||
few_shots_select="random_sampling",
|
||||
generation_size=256,
|
||||
stop_sequence=["Question:"],
|
||||
metrics=metrics,
|
||||
),
|
||||
]
|
||||
return mod
|
||||
|
||||
|
||||
def _build_lighteval_inputs_scorer(
|
||||
tokenizer: AutoTokenizer,
|
||||
*,
|
||||
task_spec: str,
|
||||
task_name: str,
|
||||
use_chat_template: bool,
|
||||
custom_tasks: Any = None,
|
||||
primary_metric: str | None = None,
|
||||
stop_sequences: tuple[str, ...] = (),
|
||||
) -> tuple[list[list[int]], Callable[[Any], float]]:
|
||||
"""Tokenize prompts and build a per-sample scorer for any lighteval task."""
|
||||
r = Registry(tasks=task_spec, **({"custom_tasks": custom_tasks} if custom_tasks else {}))
|
||||
metric = r.task_to_configs[task_name][0].metrics[0]
|
||||
tasks_dict = r.load_tasks()
|
||||
LightevalTask.load_datasets(tasks_dict, 1)
|
||||
docs = next(iter(tasks_dict.values())).get_docs()
|
||||
|
||||
pm = PromptManager(use_chat_template=use_chat_template, tokenizer=tokenizer, system_prompt=None)
|
||||
prompts = [pm.prepare_prompt(doc) for doc in docs]
|
||||
inputs = tokenizer(prompts, add_special_tokens=not use_chat_template)["input_ids"]
|
||||
|
||||
def score(outputs) -> float:
|
||||
scores = []
|
||||
for doc, (_, out) in zip(docs, outputs.items()):
|
||||
text = tokenizer.decode(out.generated_tokens, skip_special_tokens=True)
|
||||
for s in stop_sequences:
|
||||
text = text.split(s, 1)[0]
|
||||
value = metric.sample_level_fn.compute(doc, ModelResponse(text=[text]))
|
||||
# Grouped metrics return a dict keyed by sub-metric — pick the primary one.
|
||||
scores.append(value[primary_metric] if isinstance(value, dict) else value)
|
||||
return sum(scores) / len(scores)
|
||||
|
||||
return inputs, score
|
||||
|
||||
|
||||
# Data helpers
|
||||
def get_tokenized_gsm8k(
|
||||
tokenizer: AutoTokenizer, n_fewshot: int = 8
|
||||
) -> tuple[list[list[int]], Callable[[Any], float]]:
|
||||
"""GSM8K-Platinum few-shot inputs and scorer using the same lighteval extractive_match as the gsm8k task."""
|
||||
return _build_lighteval_inputs_scorer(
|
||||
tokenizer,
|
||||
task_spec=f"gsm8k_platinum|{n_fewshot}",
|
||||
task_name="gsm8k_platinum",
|
||||
use_chat_template=False,
|
||||
custom_tasks=_build_gsm8k_platinum_module(),
|
||||
stop_sequences=("Question:",),
|
||||
)
|
||||
|
||||
|
||||
def get_tokenized_ifeval(tokenizer: AutoTokenizer) -> tuple[list[list[int]], Callable[[Any], float]]:
|
||||
"""IFEval inputs (chat-templated, 0-shot) and scorer reporting prompt-level strict accuracy."""
|
||||
return _build_lighteval_inputs_scorer(
|
||||
tokenizer,
|
||||
task_spec="ifeval|0",
|
||||
task_name="ifeval",
|
||||
use_chat_template=True,
|
||||
primary_metric="prompt_level_strict_acc",
|
||||
)
|
||||
|
||||
|
||||
def get_random_data(batch_size: int, num_tokens: int, vocab_size: int = 16000) -> list[list[int]]:
|
||||
"""Random token sequences of fixed length, for raw throughput tests."""
|
||||
rng = torch.Generator().manual_seed(0)
|
||||
return [torch.randint(0, vocab_size, (num_tokens,), generator=rng).tolist() for _ in range(batch_size)]
|
||||
|
||||
|
||||
# Benchmark entries and collection
|
||||
@dataclass
|
||||
class BenchmarkEntry:
|
||||
"""Single CB run: what was fed in, which configs were used, and the resulting metrics."""
|
||||
|
||||
label: str
|
||||
num_samples: int
|
||||
avg_input_tokens: float
|
||||
max_new_tokens: int
|
||||
cb_config: dict[str, Any]
|
||||
gen_config: dict[str, Any]
|
||||
time_seconds: float | None = None
|
||||
num_tokens: int | None = None
|
||||
throughput_tok_per_sec: float | None = None
|
||||
peak_memory_gb: float | None = None
|
||||
accuracy: float | None = None
|
||||
error: str | None = None
|
||||
|
||||
|
||||
def _config_summary(cfg: Any) -> dict[str, Any]:
|
||||
"""Extract a JSON-friendly summary of a dataclass/config object."""
|
||||
raw = cfg.to_dict() if hasattr(cfg, "to_dict") else cfg.__dict__
|
||||
return {k: v for k, v in raw.items() if isinstance(v, (int, float, str, bool, type(None)))}
|
||||
|
||||
|
||||
class BenchmarkResults:
|
||||
"""Holds all CB benchmark runs and the shared model they execute against."""
|
||||
|
||||
def __init__(self, model_id: str, attn_impl: str, tp_size: int = 1):
|
||||
self.model_id = model_id
|
||||
self.attn_impl = attn_impl
|
||||
self.tp_size = tp_size
|
||||
self.entries: list[BenchmarkEntry] = []
|
||||
|
||||
def cleanup(self) -> None:
|
||||
torch.cuda.empty_cache()
|
||||
gc.collect()
|
||||
torch.cuda.reset_peak_memory_stats()
|
||||
|
||||
def _get_model(self) -> Any:
|
||||
self.cleanup()
|
||||
# tp_plan and device_map are mutually exclusive — TP uses its own placement.
|
||||
placement = {"tp_plan": "auto"} if self.tp_size > 1 else {"device_map": 0}
|
||||
model = AutoModelForCausalLM.from_pretrained(self.model_id, attn_implementation=self.attn_impl, **placement)
|
||||
return model.eval()
|
||||
|
||||
def add_benchmark(
|
||||
self,
|
||||
data: list[list[int]],
|
||||
max_new_tokens: int,
|
||||
cb_config: ContinuousBatchingConfig,
|
||||
gen_config: GenerationConfig | None = None,
|
||||
label: str | None = None,
|
||||
score_fn: Callable[[Any], float] | None = None,
|
||||
) -> BenchmarkEntry:
|
||||
"""Run one CB benchmark and record time, tokens, and peak memory."""
|
||||
|
||||
gen_config = GenerationConfig() if gen_config is None else gen_config
|
||||
gen_config.max_new_tokens = max_new_tokens
|
||||
|
||||
model = self._get_model()
|
||||
|
||||
avg_input = sum(len(x) for x in data) / max(len(data), 1)
|
||||
entry = BenchmarkEntry(
|
||||
label=label or f"bench_{len(self.entries)}",
|
||||
num_samples=len(data),
|
||||
avg_input_tokens=avg_input,
|
||||
max_new_tokens=max_new_tokens,
|
||||
cb_config=_config_summary(cb_config),
|
||||
gen_config=_config_summary(gen_config),
|
||||
)
|
||||
|
||||
print(f"\n[{entry.label}] samples={entry.num_samples} avg_in={avg_input:.1f} max_new={max_new_tokens}")
|
||||
|
||||
self.cleanup()
|
||||
|
||||
try:
|
||||
outputs = model.generate_batch(
|
||||
inputs=data,
|
||||
generation_config=gen_config,
|
||||
continuous_batching_config=cb_config,
|
||||
progress_bar=False,
|
||||
)
|
||||
gen_start = min(out.created_time for out in outputs.values())
|
||||
gen_end = max(out.lifespan[1] for out in outputs.values())
|
||||
gen_time = gen_end - gen_start
|
||||
num_tokens = sum(len(out.generated_tokens) for out in outputs.values())
|
||||
|
||||
entry.time_seconds = gen_time
|
||||
entry.num_tokens = num_tokens
|
||||
entry.throughput_tok_per_sec = num_tokens / gen_time if gen_time > 0 else 0.0
|
||||
entry.peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
|
||||
if score_fn is not None:
|
||||
entry.accuracy = score_fn(outputs)
|
||||
print(
|
||||
f" {gen_time:.2f}s, {num_tokens} tokens, "
|
||||
f"{entry.throughput_tok_per_sec:.2f} tok/s, peak {entry.peak_memory_gb:.2f} GB"
|
||||
+ (f", acc {entry.accuracy:.3f}" if entry.accuracy is not None else "")
|
||||
)
|
||||
except Exception as e:
|
||||
entry.error = str(e)
|
||||
print(f" ERROR: {e}")
|
||||
|
||||
self.entries.append(entry)
|
||||
self.cleanup()
|
||||
return entry
|
||||
|
||||
# Persistence
|
||||
def save(self, name: str) -> Path:
|
||||
"""Save all entries to a timestamped JSON file keyed by name."""
|
||||
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
filename = RESULTS_DIR / f"{name}__{int(time.time())}.json"
|
||||
payload = {
|
||||
"model_id": self.model_id,
|
||||
"attn_impl": self.attn_impl,
|
||||
"entries": [asdict(e) for e in self.entries],
|
||||
}
|
||||
filename.write_text(json.dumps(payload, indent=2))
|
||||
print(f"\nResults saved to {filename}")
|
||||
return filename
|
||||
|
||||
@classmethod
|
||||
def load_most_recent(cls, name: str) -> "BenchmarkResults":
|
||||
"""Load the most recent JSON file matching name."""
|
||||
candidates = sorted(RESULTS_DIR.glob(f"{name}__*.json"))
|
||||
if not candidates:
|
||||
raise FileNotFoundError(f"No baseline with name '{name}' in {RESULTS_DIR}")
|
||||
data = json.loads(candidates[-1].read_text())
|
||||
instance = cls(
|
||||
model_id=data.get("model_id"),
|
||||
attn_impl=data.get("attn_impl"),
|
||||
)
|
||||
instance.entries = [BenchmarkEntry(**e) for e in data["entries"]]
|
||||
print(f"Loaded baseline from {candidates[-1]}")
|
||||
return instance
|
||||
|
||||
# Display
|
||||
def print_summary(self) -> None:
|
||||
rows = [
|
||||
{
|
||||
"label": e.label,
|
||||
"samples": e.num_samples,
|
||||
"avg_in": f"{e.avg_input_tokens:.1f}",
|
||||
"max_new": e.max_new_tokens,
|
||||
"time (s)": _fmt(e.time_seconds, ".2f"),
|
||||
"tokens": _fmt(e.num_tokens, "d"),
|
||||
"tok/s": _fmt(e.throughput_tok_per_sec, ".2f", "ERROR"),
|
||||
"mem (GB)": _fmt(e.peak_memory_gb, ".2f"),
|
||||
"acc": _fmt(e.accuracy, ".3f", "-"),
|
||||
}
|
||||
for e in self.entries
|
||||
]
|
||||
print("\n" + tabulate(rows, headers="keys", tablefmt="github"))
|
||||
|
||||
def compare_to(self, baseline: "BenchmarkResults") -> None:
|
||||
"""Print a side-by-side throughput comparison against a baseline run."""
|
||||
base_tps = {e.label: e.throughput_tok_per_sec for e in baseline.entries}
|
||||
|
||||
def diff(cur: float | None, base: float | None) -> str:
|
||||
if cur is None or not base:
|
||||
return "N/A"
|
||||
return f"{(cur - base) / base * 100:+.1f}%"
|
||||
|
||||
rows = [
|
||||
{
|
||||
"label": e.label,
|
||||
"baseline (tok/s)": _fmt(base_tps.get(e.label), ".2f", "N/A"),
|
||||
"current (tok/s)": _fmt(e.throughput_tok_per_sec, ".2f", e.error or "N/A"),
|
||||
"diff": diff(e.throughput_tok_per_sec, base_tps.get(e.label)),
|
||||
}
|
||||
for e in self.entries
|
||||
]
|
||||
print(f"\nComparison against baseline (model={baseline.model_id}):")
|
||||
print(tabulate(rows, headers="keys", tablefmt="github"))
|
||||
|
||||
|
||||
# Main
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--name", type=str, default=None, help="Name of the benchmark run (for saving).")
|
||||
parser.add_argument("--compare-to", type=str, default=None, help="Name of a previous run to compare against.")
|
||||
parser.add_argument("--model-id", type=str, default="meta-llama/Llama-3.1-8B-Instruct")
|
||||
parser.add_argument("--attn", type=str, default="kernels-community/flash-attn3")
|
||||
parser.add_argument("--tp-size", type=int, default=1, help="Tensor parallel size (1 = no TP).")
|
||||
parser.add_argument(
|
||||
"--rollouts-lengths",
|
||||
"-rl",
|
||||
type=int,
|
||||
nargs="+",
|
||||
help="If this is specified, only the rollouts benchmarks run, with the given sizes (in tokens).",
|
||||
)
|
||||
cli_args = parser.parse_args()
|
||||
|
||||
results = BenchmarkResults(model_id=cli_args.model_id, attn_impl=cli_args.attn, tp_size=cli_args.tp_size)
|
||||
tokenizer = AutoTokenizer.from_pretrained(cli_args.model_id, padding_side="left")
|
||||
|
||||
if cli_args.rollouts_lengths is not None:
|
||||
rollouts_only = True
|
||||
rollout_sizes = cli_args.rollouts_lengths
|
||||
else:
|
||||
rollouts_only = False
|
||||
rollout_sizes = [1024, 2048, 4096, 8192, 16384]
|
||||
|
||||
if not rollouts_only:
|
||||
# GSM8K benchmarks (256 max new tokens) — gsm8k_platinum dataset, 8-shot, lighteval extractive_match
|
||||
gsm8k_data, gsm8k_score_fn = get_tokenized_gsm8k(tokenizer)
|
||||
|
||||
## No options
|
||||
results.add_benchmark(
|
||||
data=gsm8k_data,
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label="gsm8k_default",
|
||||
score_fn=gsm8k_score_fn,
|
||||
)
|
||||
|
||||
## With sampling. Recommended chat sampling (T=0.6, top_p=0.9), low enough that math reasoning isn't derailed
|
||||
results.add_benchmark(
|
||||
data=gsm8k_data,
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(),
|
||||
gen_config=GenerationConfig(eos_token_id=-1, do_sample=True, temperature=0.6, top_p=0.9),
|
||||
label="gsm8k_sampling",
|
||||
score_fn=gsm8k_score_fn,
|
||||
)
|
||||
|
||||
## With compile
|
||||
results.add_benchmark(
|
||||
data=gsm8k_data,
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(use_default_compile_configs=True),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label="gsm8k_compile",
|
||||
score_fn=gsm8k_score_fn,
|
||||
)
|
||||
|
||||
## No decode fast path
|
||||
results.add_benchmark(
|
||||
data=gsm8k_data,
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(max_blocks_per_request=0),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label="gsm8k_no_fast_decode",
|
||||
score_fn=gsm8k_score_fn,
|
||||
)
|
||||
|
||||
## Bare-bones CB config
|
||||
results.add_benchmark(
|
||||
data=gsm8k_data,
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(
|
||||
max_blocks_per_request=0, use_async_batching=False, use_cuda_graph=False
|
||||
),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label="gsm8k_bare_bones",
|
||||
score_fn=gsm8k_score_fn,
|
||||
)
|
||||
|
||||
# IFEval: 0-shot chat prompts; uses real EOS so instruction-following metrics see the model's natural stop.
|
||||
ifeval_data, ifeval_score_fn = get_tokenized_ifeval(tokenizer)
|
||||
results.add_benchmark(
|
||||
data=ifeval_data,
|
||||
max_new_tokens=1280,
|
||||
cb_config=ContinuousBatchingConfig(),
|
||||
label="ifeval_default",
|
||||
score_fn=ifeval_score_fn,
|
||||
)
|
||||
|
||||
# Raw benchmarks (various options)
|
||||
|
||||
## Few blocks — tight cache pressure
|
||||
results.add_benchmark(
|
||||
data=get_random_data(batch_size=20, num_tokens=256),
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(num_blocks=16),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label="few_blocks",
|
||||
)
|
||||
|
||||
## Multiple return sequences (sampling + parallel decoding)
|
||||
results.add_benchmark(
|
||||
data=get_random_data(batch_size=50, num_tokens=256),
|
||||
max_new_tokens=256,
|
||||
cb_config=ContinuousBatchingConfig(),
|
||||
gen_config=GenerationConfig(eos_token_id=-1, do_sample=True, num_return_sequences=8),
|
||||
label="multi_return_seq",
|
||||
)
|
||||
|
||||
## RL rollouts: small batch, growing generation lengths
|
||||
for length in rollout_sizes:
|
||||
results.add_benchmark(
|
||||
data=get_random_data(batch_size=32, num_tokens=256),
|
||||
max_new_tokens=length,
|
||||
cb_config=ContinuousBatchingConfig(use_default_compile_configs=True),
|
||||
gen_config=GenerationConfig(eos_token_id=-1),
|
||||
label=f"rollouts_{length}",
|
||||
)
|
||||
|
||||
# Post processing and display. Only on rank 0 in TP runs to avoid duplicate output / file writes.
|
||||
is_rank_zero = not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0
|
||||
if is_rank_zero:
|
||||
results.print_summary()
|
||||
if cli_args.compare_to:
|
||||
baseline = BenchmarkResults.load_most_recent(cli_args.compare_to)
|
||||
results.compare_to(baseline=baseline)
|
||||
if cli_args.name:
|
||||
results.save(cli_args.name)
|
||||
287
benchmark_v2/framework/benchmark_config.py
Normal file
287
benchmark_v2/framework/benchmark_config.py
Normal file
@@ -0,0 +1,287 @@
|
||||
import hashlib
|
||||
import itertools
|
||||
import json
|
||||
import logging
|
||||
from functools import lru_cache
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
|
||||
from transformers.generation.configuration_utils import CompileConfig
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
from transformers.utils.import_utils import is_flash_attn_2_available, is_kernels_available
|
||||
|
||||
|
||||
KERNELIZATION_AVAILABLE = False
|
||||
try:
|
||||
from kernels import Mode, kernelize # noqa: F401
|
||||
|
||||
KERNELIZATION_AVAILABLE = True
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def is_fa2_or_kernel_available() -> bool:
|
||||
"""Returns True if the flash_attn_2 or a fallback kernel is available"""
|
||||
# Early return if flash_attn_2 is available
|
||||
if is_flash_attn_2_available():
|
||||
return True
|
||||
# Early return if kernels is not available
|
||||
if not is_kernels_available():
|
||||
logger.warning(
|
||||
"flash_attention_2 is not available. kernels is not installed. Benchmarking flash_attention_2 will not "
|
||||
"be possible."
|
||||
)
|
||||
return False
|
||||
# If kernels is available, try to get the flash_attn_2 kernel
|
||||
try:
|
||||
from kernels import get_kernel
|
||||
|
||||
# TODO: Pass the 'version' kwarg to specify the binary version once kernels >= 0.12.0 is supported.
|
||||
get_kernel("kernels-community/flash-attn2")
|
||||
except Exception as _:
|
||||
logger.warning(
|
||||
"flash_attention_2 is not available. kernels is installed, but the flash_attn kernel is not available."
|
||||
"Benchmarking flash_attention_2 will not be possible."
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
class BenchmarkConfig:
|
||||
"""Configuration for a single benchmark scenario."""
|
||||
|
||||
all_attn_implementations = ["flash_attention_2", "eager", "sdpa", "flex_attention"]
|
||||
all_compiled_modes = [None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
warmup_iterations: int = 5,
|
||||
measurement_iterations: int = 20,
|
||||
gpu_monitoring: bool = True, # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
|
||||
continuous_batching: bool = False,
|
||||
batch_size: int = 1,
|
||||
sequence_length: int = 128,
|
||||
num_tokens_to_generate: int = 128,
|
||||
attn_implementation: str = "eager",
|
||||
compile_kwargs: dict[str, Any] | None = None,
|
||||
kernelize: bool = False,
|
||||
tp_plan: str | dict[str, str] | None = None,
|
||||
name: str | None = None,
|
||||
skip_validity_check: bool = False,
|
||||
) -> None:
|
||||
# Benchmark parameters
|
||||
self.warmup_iterations = warmup_iterations
|
||||
self.measurement_iterations = measurement_iterations
|
||||
self.gpu_monitoring = gpu_monitoring
|
||||
self.continuous_batching = continuous_batching
|
||||
# Input parameters
|
||||
self.batch_size = batch_size
|
||||
self.sequence_length = sequence_length
|
||||
self.num_tokens_to_generate = num_tokens_to_generate
|
||||
# Generation parameters
|
||||
self.attn_implementation = attn_implementation
|
||||
self.tp_plan = tp_plan
|
||||
# Optimization parameters
|
||||
if compile_kwargs is None:
|
||||
self.compile_config = None
|
||||
else:
|
||||
compile_kwargs["fullgraph"] = compile_kwargs.get("fullgraph", True)
|
||||
self.compile_config = CompileConfig(**compile_kwargs)
|
||||
self.kernelize = kernelize
|
||||
# Constant parameters
|
||||
self.dtype = "torch.bfloat16"
|
||||
self.device = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
|
||||
self.check_validity(skip_validity_check)
|
||||
self.name = name if name is not None else self.infer_name()
|
||||
|
||||
def check_validity(self, skip_validity_check: bool = False) -> None:
|
||||
if skip_validity_check:
|
||||
return
|
||||
# If flash_attention_2 is selected but not available, default to SDPA
|
||||
if self.attn_implementation == "flash_attention_2" and not is_fa2_or_kernel_available():
|
||||
logger.error("Flash attention is not available. Defaulting to SDPA.")
|
||||
self.attn_implementation = "sdpa"
|
||||
|
||||
# The combination of flash_attention_2, compile and generate is not supported # FIXME: support it
|
||||
if (
|
||||
not self.continuous_batching
|
||||
and self.attn_implementation == "flash_attention_2"
|
||||
and self.compile_config is not None
|
||||
):
|
||||
logger.error(
|
||||
"The combination of flash_attention_2, compile and generate is not supported. Turning off compile."
|
||||
)
|
||||
self.compile_config = None
|
||||
|
||||
# Continuous batching does not support flex attention as an attention implementation # FIXME: support it
|
||||
if self.attn_implementation == "flex_attention" and self.continuous_batching:
|
||||
logger.error(
|
||||
"Disabling continuous batching because of invalid configuration: flex attention is not supported."
|
||||
)
|
||||
self.continuous_batching = False
|
||||
|
||||
# Continuous batching supports compile mode "default" or "max-autotune-no-cudagraphs"
|
||||
if (
|
||||
self.continuous_batching
|
||||
and self.compile_config is not None
|
||||
and self.compile_config.mode not in ["default", "max-autotune-no-cudagraphs"]
|
||||
):
|
||||
logger.error(
|
||||
f"You have continuous batching and compile enabled, but {self.compile_config.mode = } is not supported."
|
||||
" Supported modes are: default, max-autotune-no-cudagraphs. Changing to default."
|
||||
)
|
||||
self.compile_config.mode = "default"
|
||||
|
||||
@property
|
||||
def hash(self) -> str:
|
||||
return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
|
||||
|
||||
def infer_name(self, compact: bool = True) -> str:
|
||||
"""Infer a human-readable name for the benchmark config, either compact or verbose."""
|
||||
if compact:
|
||||
iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
|
||||
gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
|
||||
dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
|
||||
attn_code = self.attn_implementation
|
||||
compile_str = f"compiled_{self.compile_config.mode}" if self.compile_config is not None else "uncompiled"
|
||||
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
|
||||
continuous_batching_str = "cb" if self.continuous_batching else "generate"
|
||||
tp_str = "tp" if self.tp_plan is not None else "no_tp"
|
||||
sep = "-"
|
||||
else:
|
||||
iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
|
||||
gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
|
||||
dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
|
||||
attn_code = f"{self.attn_implementation} attention"
|
||||
compile_str = "compiled" if self.compile_config is not None else "not compiled"
|
||||
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
|
||||
continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
|
||||
if self.tp_plan is None:
|
||||
tp_str = "no_tp"
|
||||
else:
|
||||
tp_str = "tp_custom" if isinstance(self.tp_plan, dict) else "tp_auto"
|
||||
sep = ", "
|
||||
return sep.join(
|
||||
[
|
||||
iter_str,
|
||||
gpu_monitor_str,
|
||||
dimensions_str,
|
||||
attn_code,
|
||||
compile_str,
|
||||
kernelize_str,
|
||||
continuous_batching_str,
|
||||
tp_str,
|
||||
]
|
||||
)
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return {
|
||||
"name": self.name,
|
||||
"warmup_iterations": self.warmup_iterations,
|
||||
"measurement_iterations": self.measurement_iterations,
|
||||
"gpu_monitoring": self.gpu_monitoring,
|
||||
"continuous_batching": self.continuous_batching,
|
||||
"batch_size": self.batch_size,
|
||||
"sequence_length": self.sequence_length,
|
||||
"num_tokens_to_generate": self.num_tokens_to_generate,
|
||||
"attn_implementation": self.attn_implementation,
|
||||
"compile_kwargs": self.compile_config.to_dict() if self.compile_config is not None else None,
|
||||
"kernelize": self.kernelize,
|
||||
"tp_plan": self.tp_plan,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
|
||||
return cls(
|
||||
warmup_iterations=data.get("warmup_iterations", 5),
|
||||
measurement_iterations=data.get("measurement_iterations", 20),
|
||||
gpu_monitoring=data.get("gpu_monitoring", False),
|
||||
continuous_batching=data.get("continuous_batching", False),
|
||||
batch_size=data.get("batch_size", 1),
|
||||
sequence_length=data.get("sequence_length", 128),
|
||||
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
|
||||
attn_implementation=data.get("attn_implementation", "eager"),
|
||||
compile_kwargs=data.get("compile_kwargs"),
|
||||
kernelize=data.get("kernelize", False),
|
||||
tp_plan=data.get("tp_plan"),
|
||||
name=data.get("name"),
|
||||
skip_validity_check=skip_validity_check,
|
||||
)
|
||||
|
||||
|
||||
def adapt_configs(
|
||||
configs: list[BenchmarkConfig],
|
||||
warmup_iterations: int | list[int] = 5,
|
||||
measurement_iterations: int | list[int] = 20,
|
||||
batch_size: int | list[int] = 1,
|
||||
sequence_length: int | list[int] = 128,
|
||||
num_tokens_to_generate: int | list[int] = 128,
|
||||
gpu_monitoring: bool | list[bool] = True,
|
||||
) -> list[BenchmarkConfig]:
|
||||
parameters = (
|
||||
x if isinstance(x, list) else [x]
|
||||
for x in [
|
||||
warmup_iterations,
|
||||
measurement_iterations,
|
||||
batch_size,
|
||||
sequence_length,
|
||||
num_tokens_to_generate,
|
||||
gpu_monitoring,
|
||||
]
|
||||
)
|
||||
iterator = itertools.product(*parameters)
|
||||
|
||||
adapted_configs = []
|
||||
for warmup_iters, measurement_iters, bs, seqlen, ntok, monitor in iterator:
|
||||
for config in configs:
|
||||
config = config.to_dict()
|
||||
config["warmup_iterations"] = warmup_iters
|
||||
config["measurement_iterations"] = measurement_iters
|
||||
config["batch_size"] = bs
|
||||
config["sequence_length"] = seqlen
|
||||
config["num_tokens_to_generate"] = ntok
|
||||
config["gpu_monitoring"] = monitor
|
||||
# Remove the old name so it gets re-inferred with the updated values
|
||||
config.pop("name", None)
|
||||
adapted_configs.append(BenchmarkConfig.from_dict(config))
|
||||
return adapted_configs
|
||||
|
||||
|
||||
def get_config_by_level(level: int) -> list[BenchmarkConfig]:
|
||||
configs = []
|
||||
# Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
|
||||
if level >= 3:
|
||||
for attn_implementation in BenchmarkConfig.all_attn_implementations:
|
||||
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
|
||||
compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
|
||||
for cm in compile_modes:
|
||||
compile_kwargs = {"mode": cm} if cm is not None else None
|
||||
for kernelize_on in {False, KERNELIZATION_AVAILABLE}:
|
||||
for cb_on in [False, True]:
|
||||
configs.append(
|
||||
BenchmarkConfig(
|
||||
attn_implementation=attn_implementation,
|
||||
compile_kwargs=compile_kwargs,
|
||||
kernelize=kernelize_on,
|
||||
continuous_batching=cb_on,
|
||||
)
|
||||
)
|
||||
return configs
|
||||
# Otherwise, we add the configs for the given level
|
||||
if level >= 0:
|
||||
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}))
|
||||
if level >= 1:
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2"))
|
||||
configs.append(BenchmarkConfig(attn_implementation="eager", compile_kwargs={}))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", continuous_batching=True))
|
||||
if level >= 2:
|
||||
configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_kwargs={}))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}, kernelize=True))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
|
||||
configs.append(BenchmarkConfig(attn_implementation="sdpa", continuous_batching=True))
|
||||
return configs
|
||||
483
benchmark_v2/framework/benchmark_runner.py
Normal file
483
benchmark_v2/framework/benchmark_runner.py
Normal file
@@ -0,0 +1,483 @@
|
||||
import gc
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
import tempfile
|
||||
import time
|
||||
from datetime import datetime
|
||||
from queue import Queue
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
from datasets import Dataset
|
||||
from huggingface_hub import HfApi
|
||||
from tqdm import trange
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
AutoTokenizer,
|
||||
GenerationConfig,
|
||||
GenerationMixin,
|
||||
is_torch_xpu_available,
|
||||
)
|
||||
from transformers.generation.streamers import BaseStreamer
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
|
||||
from .benchmark_config import BenchmarkConfig
|
||||
from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
|
||||
from .hardware_metrics import GPUMonitor
|
||||
|
||||
|
||||
try:
|
||||
from kernels import Mode, kernelize # noqa: F401
|
||||
except ImportError:
|
||||
kernelize = None
|
||||
Mode = None
|
||||
|
||||
|
||||
DEFAULT_PROMPT = "\n".join([
|
||||
"The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
|
||||
"Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
|
||||
"It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
|
||||
"Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
|
||||
"The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
|
||||
"The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
|
||||
"The next three years were dominated by a struggle for political control.",
|
||||
"King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
|
||||
"As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
|
||||
"After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
|
||||
"About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
|
||||
"Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
|
||||
"Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
|
||||
]) # fmt: skip
|
||||
|
||||
PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
|
||||
|
||||
|
||||
def compact_json_numeric_arrays(data: dict):
|
||||
# Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
|
||||
pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
|
||||
|
||||
def replace_numeric_array(match):
|
||||
# Get the array content
|
||||
content = match.group(1)
|
||||
# Remove extra whitespace but keep commas
|
||||
compact_content = re.sub(r"\s+", " ", content).strip()
|
||||
return f"[{compact_content}]"
|
||||
|
||||
return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
|
||||
|
||||
|
||||
def get_git_revision() -> str:
|
||||
base_path = pathlib.Path(__file__).parent.parent.parent
|
||||
git_dir = base_path / ".git"
|
||||
with (git_dir / "HEAD").open("r") as head:
|
||||
ref = head.readline().split(" ")[-1].strip()
|
||||
with (git_dir / ref).open("r") as git_hash:
|
||||
return git_hash.readline().strip()
|
||||
|
||||
|
||||
def flush_memory(flush_compile: bool = True) -> None:
|
||||
"""Flush GPU memory and run garbage collection. If the flush_compile flag is set, we also clear the everything
|
||||
related to compile cache."""
|
||||
gc.collect()
|
||||
# If needed, flush everything related to torch.compile
|
||||
if flush_compile:
|
||||
# Dynamo resets
|
||||
torch._dynamo.reset()
|
||||
torch._dynamo.reset_code_caches()
|
||||
if hasattr(torch._inductor, "codecache"):
|
||||
# Clear FX graph cache
|
||||
if hasattr(torch._inductor.codecache, "FxGraphCache"):
|
||||
torch._inductor.codecache.FxGraphCache.clear()
|
||||
# Clear PyCodeCache
|
||||
if hasattr(torch._inductor.codecache, "PyCodeCache"):
|
||||
torch._inductor.codecache.PyCodeCache.cache_clear()
|
||||
# Clear TritonFuture cache (for async compilation)
|
||||
if hasattr(torch._inductor.codecache, "TritonFuture"):
|
||||
if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
|
||||
torch._inductor.codecache.TritonFuture._compile_cache.clear()
|
||||
# Clear device cache
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
elif is_torch_xpu_available():
|
||||
torch.xpu.empty_cache()
|
||||
torch.xpu.synchronize()
|
||||
gc.collect()
|
||||
|
||||
|
||||
class BenchmarkStreamer(BaseStreamer):
|
||||
def __init__(self, **kwargs) -> None:
|
||||
self.timeout = kwargs.pop("timeout", 10)
|
||||
self.timestamps = []
|
||||
self.text_queue = Queue()
|
||||
self.stop_signal = None
|
||||
|
||||
def put(self, value):
|
||||
"""Receives tokens and logs the timestamp of the generation."""
|
||||
self.timestamps.append(time.perf_counter())
|
||||
self.text_queue.put(value)
|
||||
|
||||
def end(self):
|
||||
self.timestamps.append(time.perf_counter())
|
||||
self.text_queue.put(self.stop_signal)
|
||||
|
||||
def __iter__(self):
|
||||
return self
|
||||
|
||||
def __next__(self):
|
||||
value = self.text_queue.get(timeout=self.timeout)
|
||||
if value == self.stop_signal:
|
||||
raise StopIteration()
|
||||
else:
|
||||
return value
|
||||
|
||||
|
||||
class BenchmarkRunner:
|
||||
"""Main benchmark runner that coordinates benchmark execution."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
logger: logging.Logger,
|
||||
output_dir: str | None = None,
|
||||
branch_name: str | None = None,
|
||||
commit_id: str | None = None,
|
||||
commit_message: str | None = None,
|
||||
) -> None:
|
||||
# Those stay constant for the whole run
|
||||
self.logger = logger
|
||||
if output_dir is None:
|
||||
output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
|
||||
self.output_dir = output_dir
|
||||
self.branch_name = branch_name
|
||||
self.commit_id = get_git_revision() if commit_id is None else commit_id
|
||||
self.commit_message = commit_message
|
||||
os.makedirs(self.output_dir, exist_ok=True)
|
||||
self.profile_dir = None
|
||||
# Attributes that are reset for each model
|
||||
self._setup_for = ""
|
||||
# Attributes that are reset for each run
|
||||
self.model: GenerationMixin | None = None
|
||||
self.device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
self.torch_accelerator_module = getattr(torch, self.device_type, torch.cuda)
|
||||
|
||||
def cleanup(self) -> None:
|
||||
del self.model
|
||||
self.model = None
|
||||
flush_memory()
|
||||
|
||||
@staticmethod
|
||||
def _is_primary_process() -> bool:
|
||||
if not torch.distributed.is_available() or not torch.distributed.is_initialized():
|
||||
return True
|
||||
return torch.distributed.get_rank() == 0
|
||||
|
||||
def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
|
||||
# Some attributes only need to be set once per model
|
||||
if self._setup_for != model_id:
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
# We set the EOS token to the padding token for open-ended generation
|
||||
self.tokenizer.eos_token = self.tokenizer.pad_token
|
||||
self._setup_for = model_id
|
||||
|
||||
# Prepare inputs
|
||||
self.inputs = self.tokenizer(
|
||||
[DEFAULT_PROMPT for _ in range(config.batch_size)],
|
||||
return_tensors="pt",
|
||||
max_length=config.sequence_length,
|
||||
truncation=True,
|
||||
return_attention_mask=True,
|
||||
)
|
||||
self.inputs["use_cache"] = True
|
||||
|
||||
# Prepare generation config
|
||||
generation_config_kwargs = {
|
||||
"do_sample": False,
|
||||
"max_new_tokens": config.num_tokens_to_generate,
|
||||
}
|
||||
|
||||
# Add compile config if found
|
||||
if config.compile_config is not None:
|
||||
generation_config_kwargs.update(compile_config=config.compile_config)
|
||||
# To trigger compile in generate, we need to set the cache to static
|
||||
if not config.continuous_batching:
|
||||
generation_config_kwargs.update(cache_implementation="static")
|
||||
|
||||
generation_config = GenerationConfig(**generation_config_kwargs)
|
||||
|
||||
# Load model
|
||||
self.logger.debug(f"Loading model {model_id} on device {config.device}...")
|
||||
dtype = getattr(torch, config.dtype.removeprefix("torch."))
|
||||
use_kernels = config.kernelize and kernelize is not None and Mode is not None
|
||||
device_map = config.device if config.tp_plan is None else None
|
||||
self.model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
dtype=dtype,
|
||||
attn_implementation=config.attn_implementation,
|
||||
generation_config=generation_config,
|
||||
use_kernels=use_kernels,
|
||||
device_map=device_map,
|
||||
tp_plan=config.tp_plan,
|
||||
)
|
||||
self.model = self.model.eval()
|
||||
self.inputs = self.inputs.to(self.model.device)
|
||||
|
||||
def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
|
||||
"""Run a single benchmark with the given model ID and config."""
|
||||
with torch.no_grad():
|
||||
self.logger.info(f"Running benchmark scenario: {config.name}")
|
||||
self.logger.debug(f"Full config: {config.to_dict()}")
|
||||
|
||||
# Quick validation: try one measurement first to see if this scenario works
|
||||
flush_memory()
|
||||
e2e_latency = self.time_generate(config, warmup=True)[0]
|
||||
if e2e_latency < 0:
|
||||
self.logger.warning(f"Skipping config {config.name}: {e2e_latency = }")
|
||||
return None
|
||||
|
||||
# Warmup runs
|
||||
self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
|
||||
for _ in trange(config.warmup_iterations, desc="Warmup"):
|
||||
self.time_generate(config, warmup=True)
|
||||
self.logger.info("Warmup over.")
|
||||
|
||||
# Measurement runs
|
||||
result = BenchmarkResult()
|
||||
self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
|
||||
for _ in trange(config.measurement_iterations, desc="Benchmarking"):
|
||||
e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics = self.time_generate(
|
||||
config, warmup=False
|
||||
)
|
||||
result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics)
|
||||
self.logger.info("Benchmarking done. Cleaning up.")
|
||||
|
||||
# Profile if needed
|
||||
if num_tokens_to_profile > 0:
|
||||
self.profile_generate(num_tokens_to_profile, config.name)
|
||||
|
||||
return result
|
||||
|
||||
def time_generate(
|
||||
self, config: BenchmarkConfig, warmup: bool
|
||||
) -> tuple[float, list[float], str, GPURawMetrics | None]:
|
||||
# Prepare gpu monitoring if needed
|
||||
if config.gpu_monitoring and not warmup:
|
||||
gpu_monitor = GPUMonitor(logger=self.logger)
|
||||
gpu_monitor.start()
|
||||
else:
|
||||
gpu_monitor = None
|
||||
|
||||
# Generate and time
|
||||
if config.continuous_batching:
|
||||
inputs = self.inputs["input_ids"].tolist()
|
||||
wall_time_0 = time.perf_counter()
|
||||
outputs = self.model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True)
|
||||
else:
|
||||
streamer = BenchmarkStreamer()
|
||||
wall_time_0 = time.perf_counter()
|
||||
outputs = self.model.generate(**self.inputs, streamer=streamer)
|
||||
|
||||
wall_time_1 = time.perf_counter()
|
||||
gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
|
||||
|
||||
# Retrieve timestamps and results in a way that allows similar post-processing
|
||||
input_tokens = self.inputs["input_ids"].size(-1)
|
||||
if config.continuous_batching:
|
||||
timestamps = [output.timestamps[:] for output in outputs.values()]
|
||||
results = torch.tensor([output.generated_tokens[:] for output in outputs.values()])
|
||||
else:
|
||||
timestamps = [streamer.timestamps[1:]] # skip the first timestamp because it's the input tokens
|
||||
results = outputs[:, input_tokens:]
|
||||
outputs = None
|
||||
flush_memory(flush_compile=False)
|
||||
|
||||
# Check if generation had the right number of tokens
|
||||
if results.size(-1) != config.num_tokens_to_generate:
|
||||
raise RuntimeError(f"Generated {results.size(-1)} tokens, expected {config.num_tokens_to_generate}")
|
||||
|
||||
# Decode outputs
|
||||
decoded_output = self.tokenizer.decode(results[0], skip_special_tokens=True)
|
||||
shape_and_decoded_output = f"{tuple(results.shape)} | {decoded_output}"
|
||||
|
||||
# Compute metrics
|
||||
e2e_latency = wall_time_1 - wall_time_0
|
||||
timestamps = torch.tensor(timestamps).sub(wall_time_0).tolist()
|
||||
self.logger.info(
|
||||
f"Time generate done in {e2e_latency:.2f} seconds. Memory usage: {self.torch_accelerator_module.memory_allocated() / 1024**2:.2f} MB"
|
||||
)
|
||||
return e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics
|
||||
|
||||
def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
|
||||
"""Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
|
||||
activities = [torch.profiler.ProfilerActivity.CPU]
|
||||
if self.device_type == "cuda":
|
||||
activities.append(torch.profiler.ProfilerActivity.CUDA)
|
||||
elif self.device_type == "xpu":
|
||||
activities.append(torch.profiler.ProfilerActivity.XPU)
|
||||
|
||||
profiler = torch.profiler.profile(
|
||||
activities=activities,
|
||||
record_shapes=True,
|
||||
)
|
||||
with profiler as prof:
|
||||
_ = self.model.generate(
|
||||
**self.inputs,
|
||||
max_new_tokens=num_tokens_to_profile,
|
||||
)
|
||||
if self.profile_dir is None:
|
||||
self.profile_dir = self.output_dir + "_profiles"
|
||||
os.makedirs(self.profile_dir, exist_ok=True)
|
||||
prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
|
||||
|
||||
@torch.inference_mode()
|
||||
def run_benchmarks(
|
||||
self,
|
||||
model_id: str,
|
||||
benchmark_configs: list[BenchmarkConfig],
|
||||
num_tokens_to_profile: int = 0,
|
||||
pretty_print_summary: bool = True,
|
||||
summarized: bool = True,
|
||||
) -> tuple[str, dict[str, Any]]:
|
||||
"""Run multiple benchmarks for the given model ID and list of benchmark configs."""
|
||||
all_results = {}
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
start_time = time.perf_counter()
|
||||
|
||||
n_configs = len(benchmark_configs)
|
||||
for i, config in enumerate(benchmark_configs):
|
||||
# Skip if already run
|
||||
if config.hash in all_results:
|
||||
self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
|
||||
continue
|
||||
|
||||
# Otherwise, run the benchmark
|
||||
self.setup_benchmark(model_id, config)
|
||||
self.logger.info(
|
||||
f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
|
||||
)
|
||||
|
||||
# Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
|
||||
try:
|
||||
result = self.run_benchmark(config, num_tokens_to_profile)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
|
||||
result = None
|
||||
|
||||
# Memoize
|
||||
all_results[config.hash] = {
|
||||
"metadata": BenchmarkMetadata(
|
||||
model_id=model_id,
|
||||
branch_name=self.branch_name,
|
||||
commit_id=self.commit_id,
|
||||
commit_message=self.commit_message,
|
||||
success=result is not None,
|
||||
),
|
||||
"measurements": result if result is not None else BenchmarkResult(),
|
||||
"config": config,
|
||||
}
|
||||
|
||||
# Cleanup model and save results
|
||||
self.cleanup()
|
||||
self.save_results(model_id, all_results, timestamp=timestamp, summarized=summarized)
|
||||
|
||||
if len(all_results) < 1:
|
||||
raise RuntimeError("No benchmark was run successfully")
|
||||
|
||||
if pretty_print_summary:
|
||||
if not self._is_primary_process():
|
||||
return (timestamp, all_results)
|
||||
print()
|
||||
print("=" * 100)
|
||||
print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
|
||||
print(f"Total number of benchmarks: {len(all_results)}")
|
||||
print("First run metadata:")
|
||||
first_key = list(all_results.keys())[0]
|
||||
first_metadata = all_results[first_key]["metadata"].to_dict()
|
||||
hardware_info = first_metadata.pop("hardware_info")
|
||||
pretty_print_dict(first_metadata | hardware_info, tabs=1)
|
||||
for result in all_results.values():
|
||||
print("=" * 100)
|
||||
print(f"Config: {result['config'].infer_name(compact=False)}\n")
|
||||
result["measurements"].pprint(
|
||||
batch_size=result["config"].batch_size,
|
||||
num_generated_tokens=result["config"].num_tokens_to_generate,
|
||||
tabs=1,
|
||||
)
|
||||
print("=" * 100)
|
||||
|
||||
return (timestamp, all_results)
|
||||
|
||||
def save_results(self, model_name: str, results: dict, timestamp: str = "", summarized: bool = True) -> str:
|
||||
"""Save benchmark results to JSON file."""
|
||||
if not self._is_primary_process():
|
||||
return ""
|
||||
# Create model-specific subdirectory
|
||||
model_name = model_name.replace("/", "_")
|
||||
model_dir = os.path.join(self.output_dir, model_name)
|
||||
os.makedirs(model_dir, exist_ok=True)
|
||||
|
||||
# Create filename with timestamp
|
||||
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"{model_name}_benchmark_{timestamp}.json"
|
||||
filepath = os.path.join(model_dir, filename)
|
||||
|
||||
# Convert results to dict
|
||||
converted_results = {}
|
||||
for cfg_hash in results.keys():
|
||||
converted_results[cfg_hash] = {
|
||||
"metadata": results[cfg_hash]["metadata"].to_dict(),
|
||||
"measurements": results[cfg_hash]["measurements"].to_dict(summarized=summarized),
|
||||
"config": results[cfg_hash]["config"].to_dict(),
|
||||
}
|
||||
|
||||
# Save to JSON file
|
||||
with open(filepath, "w") as f:
|
||||
f.write(compact_json_numeric_arrays(converted_results))
|
||||
|
||||
self.logger.info(f"Results saved to {filepath}")
|
||||
return filepath
|
||||
|
||||
def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
|
||||
if PUSH_TO_HUB_TOKEN is None:
|
||||
raise ValueError(
|
||||
"PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
|
||||
)
|
||||
|
||||
api = HfApi()
|
||||
n_results = len(results)
|
||||
for summarized in [False, True]:
|
||||
self.logger.info(f"Pushing {n_results} results to: {dataset_id} with {summarized = }")
|
||||
rows = []
|
||||
for cfg_hash, entry in results.items():
|
||||
row = {
|
||||
"benchmark_config_hash": cfg_hash,
|
||||
"config": entry["config"].to_dict(),
|
||||
"measurements": entry["measurements"].to_dict(summarized=summarized),
|
||||
"metadata": entry["metadata"].to_dict(),
|
||||
}
|
||||
rows.append(row)
|
||||
|
||||
ds = Dataset.from_list(rows)
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
file_name = "summarized_results" if summarized else "full_results"
|
||||
jsonl_path = os.path.join(tmp, f"{file_name}.jsonl")
|
||||
with open(jsonl_path, "w") as f:
|
||||
json_lines = []
|
||||
for ex in ds:
|
||||
json_lines.append(json.dumps(ex, ensure_ascii=False))
|
||||
f.write("\n".join(json_lines))
|
||||
|
||||
# NOTE: we expect the repository to already exist
|
||||
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
file_name = file_name + "/" + f"benchmark_run_{timestamp}.jsonl"
|
||||
api.upload_file(
|
||||
path_or_fileobj=jsonl_path,
|
||||
path_in_repo=file_name,
|
||||
repo_id=dataset_id,
|
||||
repo_type="dataset",
|
||||
token=PUSH_TO_HUB_TOKEN,
|
||||
)
|
||||
self.logger.info(f"Successfully uploaded results to: {dataset_id} with {summarized = }")
|
||||
176
benchmark_v2/framework/data_classes.py
Normal file
176
benchmark_v2/framework/data_classes.py
Normal file
@@ -0,0 +1,176 @@
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
|
||||
from .hardware_metrics import GPURawMetrics, HardwareInfo
|
||||
|
||||
|
||||
def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
|
||||
return {
|
||||
"avg": np.mean(measurements) if measurements else 0,
|
||||
"std": np.std(measurements) if measurements else 0,
|
||||
"min": np.min(measurements) if measurements else 0,
|
||||
"med": np.median(measurements) if measurements else 0,
|
||||
"max": np.max(measurements) if measurements else 0,
|
||||
"p95": np.percentile(measurements, 95) if measurements else 0,
|
||||
}
|
||||
|
||||
|
||||
def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
|
||||
for key in list(stats.keys()):
|
||||
value = stats[key]
|
||||
if value > 3600:
|
||||
stats[key] = f"{(value / 3600):.2f}hr"
|
||||
elif value > 60:
|
||||
stats[key] = f"{(value / 60):.2f}min"
|
||||
elif value > 1:
|
||||
stats[key] = f"{value:.2f}s"
|
||||
elif value > 1e-3:
|
||||
stats[key] = f"{(value * 1e3):.2f}ms"
|
||||
elif value > 1e-6:
|
||||
stats[key] = f"{(value * 1e6):.2f}us"
|
||||
else:
|
||||
stats[key] = f"{(value * 1e9):.2f}ns"
|
||||
return stats
|
||||
|
||||
|
||||
def equalize_lengths_and_collate(stats: dict[str, dict[str, str]]) -> dict[str, str]:
|
||||
"""Note: This operation is destructive as it will update values in place before returning a new correctly formatted dict"""
|
||||
keys = ["avg", "std", "min", "med", "max", "p95"]
|
||||
for key in keys:
|
||||
max_length = max(len(stat[key]) for stat in stats.values())
|
||||
for stat in stats.values():
|
||||
stat[key] = stat[key].ljust(max_length, " ")
|
||||
return {name: " ".join([f"{key}={stat[key]}" for key in keys]) for name, stat in stats.items()}
|
||||
|
||||
|
||||
def pretty_print_dict(data: dict[str, str], tabs: int = 0) -> None:
|
||||
max_key_length = max([len(key) for key in data.keys()])
|
||||
for key, value in data.items():
|
||||
tabs_str = " " * tabs
|
||||
padded_key = key.ljust(max_key_length + 1, ".")
|
||||
print(f"{tabs_str}{padded_key}: {value}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class BenchmarkMetadata:
|
||||
"""Metadata collected for each benchmark run."""
|
||||
|
||||
model_id: str
|
||||
timestamp: str
|
||||
branch_name: str
|
||||
commit_id: str
|
||||
commit_message: str
|
||||
hardware_info: HardwareInfo
|
||||
success: bool
|
||||
|
||||
def __init__(
|
||||
self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "", success: bool = True
|
||||
) -> None:
|
||||
self.model_id = model_id
|
||||
self.timestamp = datetime.now(timezone.utc).isoformat()
|
||||
self.branch_name = branch_name
|
||||
self.commit_id = commit_id
|
||||
self.commit_message = commit_message
|
||||
self.hardware_info = HardwareInfo()
|
||||
self.success = success
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return {
|
||||
"model_id": self.model_id,
|
||||
"timestamp": self.timestamp,
|
||||
"branch_name": self.branch_name,
|
||||
"commit_id": self.commit_id,
|
||||
"commit_message": self.commit_message,
|
||||
"hardware_info": self.hardware_info.to_dict(),
|
||||
"success": self.success,
|
||||
}
|
||||
|
||||
|
||||
class BenchmarkResult:
|
||||
"""Result from a series of benchmark runs."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.e2e_latency = []
|
||||
self._timestamps = []
|
||||
self.time_to_first_token = []
|
||||
self.inter_token_latency = []
|
||||
self.shape_and_decoded_outputs = []
|
||||
self.gpu_metrics = []
|
||||
|
||||
def accumulate(
|
||||
self,
|
||||
e2e_latency: float,
|
||||
timestamps: list[float],
|
||||
shape_and_decoded_output: str,
|
||||
gpu_metrics: GPURawMetrics | None,
|
||||
) -> None:
|
||||
self.e2e_latency.append(e2e_latency)
|
||||
self._timestamps.append(timestamps)
|
||||
self._accumulate_ttft_and_itl(timestamps)
|
||||
self.shape_and_decoded_outputs.append(shape_and_decoded_output)
|
||||
self.gpu_metrics.append(gpu_metrics)
|
||||
|
||||
def _accumulate_ttft_and_itl(self, timestamps: list[float]) -> None:
|
||||
timestamps = np.array(timestamps)
|
||||
tftt = np.min(timestamps[:, 0])
|
||||
itl = np.mean(timestamps[:, -1] - timestamps[:, 0]) / (timestamps.shape[1] - 1)
|
||||
self.time_to_first_token.append(tftt)
|
||||
self.inter_token_latency.append(itl)
|
||||
|
||||
def to_dict(self, summarized: bool = False) -> dict[str, Any]:
|
||||
# Save GPU metrics as None if it contains only None values or if we are summarizing
|
||||
if summarized or all(gm is None for gm in self.gpu_metrics):
|
||||
gpu_metrics = None
|
||||
else:
|
||||
gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
|
||||
return {
|
||||
"e2e_latency": self.e2e_latency,
|
||||
"time_to_first_token": self.time_to_first_token,
|
||||
"inter_token_latency": self.inter_token_latency,
|
||||
"shape_and_decoded_outputs": self.shape_and_decoded_outputs,
|
||||
"gpu_metrics": gpu_metrics,
|
||||
"timestamps": None if summarized else self._timestamps,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, Any]) -> "BenchmarkResult":
|
||||
# Handle GPU metrics, which is saved as None if it contains only None values
|
||||
if data["gpu_metrics"] is None:
|
||||
gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
|
||||
else:
|
||||
gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
|
||||
# Handle timestamps, which can be saved as None to reduce file size
|
||||
if data["timestamps"] is None:
|
||||
timestamps = [None for _ in range(len(data["e2e_latency"]))]
|
||||
else:
|
||||
timestamps = data["timestamps"]
|
||||
# Create a new instance and accumulate the data
|
||||
new_instance = cls()
|
||||
new_instance.e2e_latency = data["e2e_latency"]
|
||||
new_instance._timestamps = timestamps
|
||||
new_instance.time_to_first_token = data["time_to_first_token"]
|
||||
new_instance.inter_token_latency = data["inter_token_latency"]
|
||||
new_instance.shape_and_decoded_outputs = data["shape_and_decoded_outputs"]
|
||||
new_instance.gpu_metrics = gpu_metrics
|
||||
return new_instance
|
||||
|
||||
def get_throughput(self, total_generated_tokens: int) -> list[float]:
|
||||
return [total_generated_tokens / e2e_latency for e2e_latency in self.e2e_latency]
|
||||
|
||||
def pprint(self, batch_size: int = 0, num_generated_tokens: int = 0, tabs: int = 0) -> None:
|
||||
measurements = {
|
||||
"E2E Latency": add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
|
||||
"Time to First Token": add_unit_to_duration(compute_basic_statistics(self.time_to_first_token)),
|
||||
}
|
||||
if len(self.inter_token_latency) > 0:
|
||||
measurements["Inter-Token Latency"] = add_unit_to_duration(
|
||||
compute_basic_statistics(self.inter_token_latency)
|
||||
)
|
||||
if batch_size > 0:
|
||||
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size * num_generated_tokens))
|
||||
measurements["Throughput"] = {key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()}
|
||||
dict_to_pprint = equalize_lengths_and_collate(measurements)
|
||||
pretty_print_dict(dict_to_pprint, tabs=tabs)
|
||||
325
benchmark_v2/framework/hardware_metrics.py
Normal file
325
benchmark_v2/framework/hardware_metrics.py
Normal file
@@ -0,0 +1,325 @@
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from logging import Logger
|
||||
from multiprocessing import Pipe, Process
|
||||
from multiprocessing.connection import Connection
|
||||
|
||||
from transformers.utils.import_utils import is_cuda_platform, is_rocm_platform
|
||||
|
||||
|
||||
if is_cuda_platform():
|
||||
import pynvml
|
||||
|
||||
if is_rocm_platform():
|
||||
import amdsmi
|
||||
|
||||
import psutil
|
||||
import torch
|
||||
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
|
||||
|
||||
_logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Data class to hold the hardware information
|
||||
def get_device_name_and_memory_total() -> tuple[str, float]:
|
||||
"""Returns the name and memory total of GPU 0."""
|
||||
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
|
||||
device_name = torch_accelerator_module.get_device_properties(0).name
|
||||
device_memory_total = torch_accelerator_module.get_device_properties(0).total_memory / 1024**3
|
||||
return device_name, device_memory_total
|
||||
|
||||
|
||||
class HardwareInfo:
|
||||
"""A class to hold information about the hardware."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
# Retrieve GPU stats
|
||||
try:
|
||||
self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
|
||||
except Exception:
|
||||
self.gpu_name, self.gpu_memory_total_gb = None, None
|
||||
# Retrieve python, torch and CUDA version
|
||||
self.python_version = f"{sys.version.split()[0]}"
|
||||
self.torch_version = torch.__version__
|
||||
if hasattr(torch, "cuda") and torch.cuda.is_available():
|
||||
self.cuda_version = torch.version.cuda
|
||||
else:
|
||||
self.cuda_version = None
|
||||
# Retrieve general hardware information
|
||||
self.cpu_count = psutil.cpu_count()
|
||||
self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
|
||||
|
||||
def to_dict(self) -> dict[str, None | int | float | str]:
|
||||
return {
|
||||
"gpu_name": self.gpu_name,
|
||||
"gpu_memory_total_gb": self.gpu_memory_total_gb,
|
||||
"python_version": self.python_version,
|
||||
"torch_version": self.torch_version,
|
||||
}
|
||||
|
||||
|
||||
# Functions to get information about the GPU
|
||||
def get_amd_gpu_stats(device_handle) -> tuple[int, float]:
|
||||
"""Get AMD GPU stats using amdsmi library."""
|
||||
utilization = amdsmi.amdsmi_get_gpu_activity(device_handle)["gfx_activity"]
|
||||
memory_used = amdsmi.amdsmi_get_gpu_vram_usage(device_handle)["vram_used"]
|
||||
return int(utilization), float(memory_used) / 1024**3 # Convert bytes to GB
|
||||
|
||||
|
||||
def get_intel_xpu_stats() -> tuple[int, float]:
|
||||
"""Returns the utilization and memory used of an Intel XPU"""
|
||||
# xpu-smi outputs CSV format: Timestamp, DeviceId, GPU Memory Utilization (%), GPU Memory Used (MiB)
|
||||
xpu_smi_output = subprocess.check_output(["xpu-smi", "dump", "-m", "5,18", "-n", "1"])
|
||||
lines = xpu_smi_output.decode("utf-8").strip().split("\n")
|
||||
|
||||
# Parse all data lines (skip header) and collect stats from all cards
|
||||
xpu_stats = []
|
||||
for line in lines[1:]:
|
||||
data_line = line.split(",")
|
||||
if len(data_line) < 4:
|
||||
continue
|
||||
device_id = data_line[1].strip()
|
||||
utilization_str = data_line[2].strip()
|
||||
memory_used_str = data_line[3].strip()
|
||||
if utilization_str != "N/A" and memory_used_str != "N/A":
|
||||
utilization = int(float(utilization_str))
|
||||
memory_used_mib = float(memory_used_str)
|
||||
xpu_stats.append((device_id, utilization, memory_used_mib))
|
||||
|
||||
if not xpu_stats:
|
||||
return 0, 0.0
|
||||
|
||||
# Sort by utilization (descending) and pick the highest
|
||||
xpu_stats.sort(key=lambda x: x[1], reverse=True)
|
||||
device_id, utilization, memory_used_mib = xpu_stats[0]
|
||||
memory_used_gb = memory_used_mib / 1024
|
||||
return utilization, memory_used_gb
|
||||
|
||||
|
||||
def get_nvidia_gpu_stats(device_handle) -> tuple[int, float]:
|
||||
"""Returns the utilization and memory used of an NVIDIA GPU using pynvml."""
|
||||
utilization = pynvml.nvmlDeviceGetUtilizationRates(device_handle).gpu
|
||||
memory_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
|
||||
memory_used_gb = memory_info.used / 1024**3
|
||||
return int(utilization), float(memory_used_gb)
|
||||
|
||||
|
||||
# Simple data classes to hold the raw GPU metrics
|
||||
class GPUMonitoringStatus(Enum):
|
||||
"""Status of GPU monitoring."""
|
||||
|
||||
SUCCESS = "success"
|
||||
FAILED = "failed"
|
||||
NO_GPUS_AVAILABLE = "no_gpus_available"
|
||||
NO_SAMPLES_COLLECTED = "no_samples_collected"
|
||||
|
||||
|
||||
@dataclass
|
||||
class GPURawMetrics:
|
||||
"""Raw values for GPU utilization and memory used."""
|
||||
|
||||
utilization: list[float] # in percent
|
||||
memory_used: list[float] # in GB
|
||||
timestamps: list[float] # in seconds
|
||||
timestamp_0: float # in seconds
|
||||
monitoring_status: GPUMonitoringStatus
|
||||
|
||||
def to_dict(self) -> dict[str, None | int | float | str]:
|
||||
return {
|
||||
"utilization": self.utilization,
|
||||
"memory_used": self.memory_used,
|
||||
"timestamps": self.timestamps,
|
||||
"timestamp_0": self.timestamp_0,
|
||||
"monitoring_status": self.monitoring_status.value,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, None | int | float | str]) -> "GPURawMetrics":
|
||||
"""Create a GPURawMetrics instance from a dictionary."""
|
||||
return cls(
|
||||
utilization=data["utilization"],
|
||||
memory_used=data["memory_used"],
|
||||
timestamps=data["timestamps"],
|
||||
timestamp_0=data["timestamp_0"],
|
||||
monitoring_status=GPUMonitoringStatus(data["monitoring_status"]),
|
||||
)
|
||||
|
||||
|
||||
# Main class, used to monitor the GPU utilization during benchmark execution
|
||||
class GPUMonitor:
|
||||
"""Monitor GPU utilization during benchmark execution using a separate process."""
|
||||
|
||||
def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
|
||||
self.sample_interval_sec = sample_interval_sec
|
||||
self.logger = logger if logger is not None else _logger
|
||||
self.gpu_type = None
|
||||
self.process = None
|
||||
|
||||
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
|
||||
self.num_available_gpus = torch_accelerator_module.device_count()
|
||||
if self.num_available_gpus == 0:
|
||||
self.logger.warning(f"No GPUs detected by torch.{device_type}.device_count().")
|
||||
return
|
||||
|
||||
# Determine GPU type
|
||||
device_name, _ = get_device_name_and_memory_total()
|
||||
if "amd" in device_name.lower():
|
||||
self.gpu_type = "amd"
|
||||
elif "nvidia" in device_name.lower():
|
||||
self.gpu_type = "nvidia"
|
||||
elif "intel" in device_name.lower() or device_type == "xpu":
|
||||
self.gpu_type = "intel"
|
||||
else:
|
||||
self.logger.warning(f"Unsupported GPU for monitoring: {device_name}")
|
||||
|
||||
@staticmethod
|
||||
def _monitor_worker(gpu_type: str, sample_interval_sec: float, connection: Connection):
|
||||
"""Worker process for GPU monitoring."""
|
||||
gpu_utilization = []
|
||||
gpu_memory_used = []
|
||||
timestamps = []
|
||||
device_handle = None
|
||||
|
||||
# Initialize GPU-specific monitoring
|
||||
if gpu_type == "amd":
|
||||
amdsmi.amdsmi_init()
|
||||
device_handle = amdsmi.amdsmi_get_processor_handles()[0]
|
||||
elif gpu_type == "nvidia":
|
||||
pynvml.nvmlInit()
|
||||
device_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
|
||||
|
||||
# Signal ready
|
||||
try:
|
||||
connection.send(0)
|
||||
except Exception:
|
||||
return
|
||||
|
||||
# Monitoring loop
|
||||
stop = False
|
||||
while not stop:
|
||||
try:
|
||||
if gpu_type == "amd":
|
||||
utilization, memory_used = get_amd_gpu_stats(device_handle)
|
||||
elif gpu_type == "nvidia":
|
||||
utilization, memory_used = get_nvidia_gpu_stats(device_handle)
|
||||
elif gpu_type == "intel":
|
||||
utilization, memory_used = get_intel_xpu_stats()
|
||||
else:
|
||||
break
|
||||
|
||||
gpu_utilization.append(utilization)
|
||||
gpu_memory_used.append(memory_used)
|
||||
timestamps.append(time.time())
|
||||
except Exception as e:
|
||||
# Skips failed measurements
|
||||
_logger.debug(f"Failed to collect GPU metrics sample: {e}")
|
||||
|
||||
stop = connection.poll(sample_interval_sec)
|
||||
|
||||
# Cleanup
|
||||
if gpu_type == "amd":
|
||||
try:
|
||||
amdsmi.amdsmi_shut_down()
|
||||
except Exception as e:
|
||||
_logger.debug(f"Failed to shutdown AMD GPU monitoring: {e}")
|
||||
elif gpu_type == "nvidia":
|
||||
try:
|
||||
pynvml.nvmlShutdown()
|
||||
except Exception as e:
|
||||
_logger.debug(f"Failed to shutdown NVIDIA GPU monitoring: {e}")
|
||||
|
||||
# Send results back
|
||||
try:
|
||||
connection.send((gpu_utilization, gpu_memory_used, timestamps))
|
||||
except Exception as e:
|
||||
_logger.error(f"Failed to send GPU monitoring results: {e}")
|
||||
|
||||
connection.close()
|
||||
|
||||
def start(self):
|
||||
"""Start monitoring GPU metrics in a separate process."""
|
||||
if self.gpu_type is None:
|
||||
self.logger.debug("GPU monitoring skipped (no supported GPU)")
|
||||
return
|
||||
|
||||
self.child_connection, self.parent_connection = Pipe()
|
||||
self.process = Process(
|
||||
target=GPUMonitor._monitor_worker,
|
||||
args=(self.gpu_type, self.sample_interval_sec, self.child_connection),
|
||||
daemon=True,
|
||||
)
|
||||
self.process.start()
|
||||
|
||||
# Wait for worker to signal ready
|
||||
if self.process.is_alive():
|
||||
self.parent_connection.recv()
|
||||
self.logger.debug("GPU monitoring started (multiprocessing)")
|
||||
|
||||
def stop_and_collect(self) -> GPURawMetrics:
|
||||
"""Stop monitoring and return collected metrics."""
|
||||
# No GPU available or unsupported GPU
|
||||
if self.process is None:
|
||||
return GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.NO_GPUS_AVAILABLE,
|
||||
)
|
||||
|
||||
# Process crashed before we could collect results
|
||||
process_failed = False
|
||||
if not self.process.is_alive():
|
||||
process_failed = True
|
||||
gpu_utilization, gpu_memory_used, timestamps = [], [], []
|
||||
else:
|
||||
# Signal stop
|
||||
self.parent_connection.send(0)
|
||||
# Get results
|
||||
try:
|
||||
gpu_utilization, gpu_memory_used, timestamps = self.parent_connection.recv()
|
||||
except Exception:
|
||||
process_failed = True
|
||||
gpu_utilization, gpu_memory_used, timestamps = [], [], []
|
||||
|
||||
self.parent_connection.close()
|
||||
self.process.join(timeout=2.0)
|
||||
if self.process.is_alive():
|
||||
self.process.terminate()
|
||||
|
||||
if gpu_utilization:
|
||||
timestamp_0 = timestamps[0]
|
||||
metrics = GPURawMetrics(
|
||||
utilization=gpu_utilization,
|
||||
memory_used=gpu_memory_used,
|
||||
timestamps=[t - timestamp_0 for t in timestamps],
|
||||
timestamp_0=timestamp_0,
|
||||
monitoring_status=GPUMonitoringStatus.SUCCESS,
|
||||
)
|
||||
self.logger.debug(f"GPU monitoring completed: {len(gpu_utilization)} samples collected")
|
||||
elif process_failed:
|
||||
metrics = GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.FAILED,
|
||||
)
|
||||
self.logger.warning("GPU monitoring failed (process crashed or timed out)")
|
||||
else:
|
||||
metrics = GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED,
|
||||
)
|
||||
return metrics
|
||||
7
benchmark_v2/requirements.txt
Normal file
7
benchmark_v2/requirements.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
numpy>=1.21.0
|
||||
psutil>=5.8.0
|
||||
nvidia-ml-py>=12.0.0
|
||||
torch>=2.0.0
|
||||
datasets>=2.10.0
|
||||
huggingface_hub>=0.16.0
|
||||
amdsmi>=7.0.2
|
||||
133
benchmark_v2/run_benchmarks.py
Executable file
133
benchmark_v2/run_benchmarks.py
Executable file
@@ -0,0 +1,133 @@
|
||||
#!/usr/bin/env python3
|
||||
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
Top-level benchmarking script that automatically discovers and runs all benchmarks
|
||||
in the ./benches directory, organizing outputs into model-specific subfolders.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import uuid
|
||||
|
||||
from framework.benchmark_config import BenchmarkConfig, adapt_configs, get_config_by_level
|
||||
from framework.benchmark_runner import BenchmarkRunner
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Parse arguments
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--output-dir", type=str, default=None, help="Output dir for benchmark results")
|
||||
parser.add_argument("--log-level", type=str, choices=["DEBUG", "INFO", "WARNING", "ERROR"], default="WARNING")
|
||||
parser.add_argument("--model-id", type=str, help="Specific model ID to benchmark (if supported by benchmarks)")
|
||||
parser.add_argument("--warmup", "-w", type=int, default=3, help="Number of warmup iterations")
|
||||
parser.add_argument("--iterations", "-i", type=int, default=10, help="Number of measurement iterations")
|
||||
|
||||
parser.add_argument("--batch-size", "-b", type=int, nargs="+", help="Batch size")
|
||||
parser.add_argument("--sequence-length", "-s", type=int, nargs="+", help="Sequence length")
|
||||
parser.add_argument("--num-tokens-to-generate", "-n", type=int, nargs="+", help="Number of tokens to generate")
|
||||
|
||||
parser.add_argument(
|
||||
"--level",
|
||||
type=int,
|
||||
default=1,
|
||||
help="Level of coverage for the benchmark. 0: only the main config, 1: a few important configs, 2: a config for"
|
||||
" each attn implementation an option, 3: cross-generate all combinations of configs, 4: cross-generate all"
|
||||
" combinations of configs w/ all compile modes",
|
||||
)
|
||||
parser.add_argument("--config-file", type=str, help="Path to a config file stored as a json or jsonl format")
|
||||
parser.add_argument("--num-tokens-to-profile", "-p", type=int, default=0, help="Number of tokens to profile")
|
||||
parser.add_argument("--enable-tp", action="store_true", help="Enable tensor parallelism with tp_plan=auto")
|
||||
|
||||
parser.add_argument("--branch-name", type=str, help="Git branch name")
|
||||
parser.add_argument("--commit-id", type=str, help="Git commit ID (if not provided, will auto-detect from git)")
|
||||
parser.add_argument("--commit-message", type=str, help="Git commit message")
|
||||
|
||||
parser.add_argument(
|
||||
"--no-gpu-monitoring", action="store_true", help="Disables GPU monitoring during benchmark runs"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--push-result-to-dataset",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Name of the dataset to push results to. If not provided, results are not pushed to the Hub.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Setup logging
|
||||
benchmark_run_uuid = str(uuid.uuid4())[:8]
|
||||
numeric_level = getattr(logging, args.log_level.upper())
|
||||
|
||||
handlers = [logging.StreamHandler(sys.stdout)]
|
||||
logging.basicConfig(
|
||||
level=numeric_level, format="[%(levelname)s - %(asctime)s] %(name)s: %(message)s", handlers=handlers
|
||||
)
|
||||
|
||||
logger = logging.getLogger("benchmark_v2")
|
||||
logger.info("Starting benchmark discovery and execution")
|
||||
logger.info(f"Benchmark run UUID: {benchmark_run_uuid}")
|
||||
logger.info(f"Output directory: {args.output_dir}")
|
||||
|
||||
# Error out if one of the arguments is not provided
|
||||
if any(arg is None for arg in [args.batch_size, args.sequence_length, args.num_tokens_to_generate]):
|
||||
raise ValueError(
|
||||
"All of the arguments --batch-size, --sequence-length, and --num-tokens-to-generate are required"
|
||||
)
|
||||
|
||||
# We cannot compute ITL if we don't have at least two measurements
|
||||
if any(n <= 1 for n in args.num_tokens_to_generate):
|
||||
raise ValueError("--num_tokens_to_generate arguments should be larger than 1")
|
||||
|
||||
# If a config file is provided, read it and use the configs therein. They will still be adapted to the given arguments.
|
||||
if args.config_file is not None:
|
||||
if args.config_file.endswith(".json"):
|
||||
with open(args.config_file, "r") as f:
|
||||
config_as_dicts = [json.load(f)]
|
||||
elif args.config_file.endswith(".jsonl"):
|
||||
with open(args.config_file, "r") as f:
|
||||
config_as_dicts = [json.loads(line) for line in f if line.startswith("{")]
|
||||
else:
|
||||
raise ValueError(f"Unsupported config file format: {args.config_file}")
|
||||
configs = [BenchmarkConfig.from_dict(config) for config in config_as_dicts]
|
||||
else:
|
||||
# Otherwise, get the configs for the given coverage level
|
||||
configs = get_config_by_level(args.level)
|
||||
|
||||
# Adapt the configs to the given arguments
|
||||
configs = adapt_configs(
|
||||
configs,
|
||||
args.warmup,
|
||||
args.iterations,
|
||||
args.batch_size,
|
||||
args.sequence_length,
|
||||
args.num_tokens_to_generate,
|
||||
not args.no_gpu_monitoring,
|
||||
)
|
||||
|
||||
if args.enable_tp:
|
||||
for config in configs:
|
||||
config.tp_plan = "auto"
|
||||
|
||||
runner = BenchmarkRunner(logger, args.output_dir, args.branch_name, args.commit_id, args.commit_message)
|
||||
timestamp, results = runner.run_benchmarks(
|
||||
args.model_id, configs, args.num_tokens_to_profile, pretty_print_summary=True
|
||||
)
|
||||
|
||||
dataset_id = args.push_result_to_dataset
|
||||
if dataset_id is not None and len(results) > 0 and runner._is_primary_process():
|
||||
runner.push_results_to_hub(dataset_id, results, timestamp)
|
||||
Reference in New Issue
Block a user