first commit
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Some checks failed
Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled
Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled
Build documentation / build (push) Has been cancelled
Build documentation / build_other_lang (push) Has been cancelled
CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled
New model PR merged notification / Notify new model (push) Has been cancelled
PR CI / pr-ci (push) Has been cancelled
Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Transformers metadata / build_and_package (push) Has been cancelled
Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled
Check Tiny Models / Check tiny models (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI - Flash Attn / Setup (push) Has been cancelled
Nvidia CI - Flash Attn / Model CI (push) Has been cancelled
Nvidia CI / Setup (push) Has been cancelled
Nvidia CI / Model CI (push) Has been cancelled
Nvidia CI / Torch pipeline CI (push) Has been cancelled
Nvidia CI / Example CI (push) Has been cancelled
Nvidia CI / Trainer/FSDP CI (push) Has been cancelled
Nvidia CI / DeepSpeed CI (push) Has been cancelled
Nvidia CI / Quantization CI (push) Has been cancelled
Nvidia CI / Kernels CI (push) Has been cancelled
Doctests / Setup (push) Has been cancelled
Doctests / Call doctest jobs (push) Has been cancelled
Doctests / Send results to webhook (push) Has been cancelled
Extras Smoke Test / Get supported Python versions (push) Has been cancelled
Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled
Extras Smoke Test / Check Slack token availability (push) Has been cancelled
Extras Smoke Test / Notify failures to Slack (push) Has been cancelled
Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
This commit is contained in:
287
benchmark_v2/framework/benchmark_config.py
Normal file
287
benchmark_v2/framework/benchmark_config.py
Normal file
@@ -0,0 +1,287 @@
|
||||
import hashlib
|
||||
import itertools
|
||||
import json
|
||||
import logging
|
||||
from functools import lru_cache
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
|
||||
from transformers.generation.configuration_utils import CompileConfig
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
from transformers.utils.import_utils import is_flash_attn_2_available, is_kernels_available
|
||||
|
||||
|
||||
KERNELIZATION_AVAILABLE = False
|
||||
try:
|
||||
from kernels import Mode, kernelize # noqa: F401
|
||||
|
||||
KERNELIZATION_AVAILABLE = True
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def is_fa2_or_kernel_available() -> bool:
|
||||
"""Returns True if the flash_attn_2 or a fallback kernel is available"""
|
||||
# Early return if flash_attn_2 is available
|
||||
if is_flash_attn_2_available():
|
||||
return True
|
||||
# Early return if kernels is not available
|
||||
if not is_kernels_available():
|
||||
logger.warning(
|
||||
"flash_attention_2 is not available. kernels is not installed. Benchmarking flash_attention_2 will not "
|
||||
"be possible."
|
||||
)
|
||||
return False
|
||||
# If kernels is available, try to get the flash_attn_2 kernel
|
||||
try:
|
||||
from kernels import get_kernel
|
||||
|
||||
# TODO: Pass the 'version' kwarg to specify the binary version once kernels >= 0.12.0 is supported.
|
||||
get_kernel("kernels-community/flash-attn2")
|
||||
except Exception as _:
|
||||
logger.warning(
|
||||
"flash_attention_2 is not available. kernels is installed, but the flash_attn kernel is not available."
|
||||
"Benchmarking flash_attention_2 will not be possible."
|
||||
)
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
class BenchmarkConfig:
|
||||
"""Configuration for a single benchmark scenario."""
|
||||
|
||||
all_attn_implementations = ["flash_attention_2", "eager", "sdpa", "flex_attention"]
|
||||
all_compiled_modes = [None, "default", "reduce-overhead", "max-autotune", "max-autotune-no-cudagraphs"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
warmup_iterations: int = 5,
|
||||
measurement_iterations: int = 20,
|
||||
gpu_monitoring: bool = True, # NOTE: you may want to disable this at times as we have obsvered it could heavily slow down benchmarks on AMD
|
||||
continuous_batching: bool = False,
|
||||
batch_size: int = 1,
|
||||
sequence_length: int = 128,
|
||||
num_tokens_to_generate: int = 128,
|
||||
attn_implementation: str = "eager",
|
||||
compile_kwargs: dict[str, Any] | None = None,
|
||||
kernelize: bool = False,
|
||||
tp_plan: str | dict[str, str] | None = None,
|
||||
name: str | None = None,
|
||||
skip_validity_check: bool = False,
|
||||
) -> None:
|
||||
# Benchmark parameters
|
||||
self.warmup_iterations = warmup_iterations
|
||||
self.measurement_iterations = measurement_iterations
|
||||
self.gpu_monitoring = gpu_monitoring
|
||||
self.continuous_batching = continuous_batching
|
||||
# Input parameters
|
||||
self.batch_size = batch_size
|
||||
self.sequence_length = sequence_length
|
||||
self.num_tokens_to_generate = num_tokens_to_generate
|
||||
# Generation parameters
|
||||
self.attn_implementation = attn_implementation
|
||||
self.tp_plan = tp_plan
|
||||
# Optimization parameters
|
||||
if compile_kwargs is None:
|
||||
self.compile_config = None
|
||||
else:
|
||||
compile_kwargs["fullgraph"] = compile_kwargs.get("fullgraph", True)
|
||||
self.compile_config = CompileConfig(**compile_kwargs)
|
||||
self.kernelize = kernelize
|
||||
# Constant parameters
|
||||
self.dtype = "torch.bfloat16"
|
||||
self.device = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
|
||||
self.check_validity(skip_validity_check)
|
||||
self.name = name if name is not None else self.infer_name()
|
||||
|
||||
def check_validity(self, skip_validity_check: bool = False) -> None:
|
||||
if skip_validity_check:
|
||||
return
|
||||
# If flash_attention_2 is selected but not available, default to SDPA
|
||||
if self.attn_implementation == "flash_attention_2" and not is_fa2_or_kernel_available():
|
||||
logger.error("Flash attention is not available. Defaulting to SDPA.")
|
||||
self.attn_implementation = "sdpa"
|
||||
|
||||
# The combination of flash_attention_2, compile and generate is not supported # FIXME: support it
|
||||
if (
|
||||
not self.continuous_batching
|
||||
and self.attn_implementation == "flash_attention_2"
|
||||
and self.compile_config is not None
|
||||
):
|
||||
logger.error(
|
||||
"The combination of flash_attention_2, compile and generate is not supported. Turning off compile."
|
||||
)
|
||||
self.compile_config = None
|
||||
|
||||
# Continuous batching does not support flex attention as an attention implementation # FIXME: support it
|
||||
if self.attn_implementation == "flex_attention" and self.continuous_batching:
|
||||
logger.error(
|
||||
"Disabling continuous batching because of invalid configuration: flex attention is not supported."
|
||||
)
|
||||
self.continuous_batching = False
|
||||
|
||||
# Continuous batching supports compile mode "default" or "max-autotune-no-cudagraphs"
|
||||
if (
|
||||
self.continuous_batching
|
||||
and self.compile_config is not None
|
||||
and self.compile_config.mode not in ["default", "max-autotune-no-cudagraphs"]
|
||||
):
|
||||
logger.error(
|
||||
f"You have continuous batching and compile enabled, but {self.compile_config.mode = } is not supported."
|
||||
" Supported modes are: default, max-autotune-no-cudagraphs. Changing to default."
|
||||
)
|
||||
self.compile_config.mode = "default"
|
||||
|
||||
@property
|
||||
def hash(self) -> str:
|
||||
return hashlib.sha256(json.dumps(self.to_dict()).encode()).hexdigest()
|
||||
|
||||
def infer_name(self, compact: bool = True) -> str:
|
||||
"""Infer a human-readable name for the benchmark config, either compact or verbose."""
|
||||
if compact:
|
||||
iter_str = f"w{self.warmup_iterations}_i{self.measurement_iterations}"
|
||||
gpu_monitor_str = "monitored" if self.gpu_monitoring else "unmonitored"
|
||||
dimensions_str = f"b{self.batch_size}_s{self.sequence_length}_n{self.num_tokens_to_generate}"
|
||||
attn_code = self.attn_implementation
|
||||
compile_str = f"compiled_{self.compile_config.mode}" if self.compile_config is not None else "uncompiled"
|
||||
kernelize_str = "kernelized" if self.kernelize else "unkernelized"
|
||||
continuous_batching_str = "cb" if self.continuous_batching else "generate"
|
||||
tp_str = "tp" if self.tp_plan is not None else "no_tp"
|
||||
sep = "-"
|
||||
else:
|
||||
iter_str = f"{self.warmup_iterations} warmup, {self.measurement_iterations} iterations"
|
||||
gpu_monitor_str = ("with" if self.gpu_monitoring else "no") + " GPU monitoring"
|
||||
dimensions_str = f"batch size {self.batch_size}, sequence length {self.sequence_length}, {self.num_tokens_to_generate} generated tokens"
|
||||
attn_code = f"{self.attn_implementation} attention"
|
||||
compile_str = "compiled" if self.compile_config is not None else "not compiled"
|
||||
kernelize_str = "kernelized" if self.kernelize else "not kernelized"
|
||||
continuous_batching_str = "continuous batching" if self.continuous_batching else "regular generate"
|
||||
if self.tp_plan is None:
|
||||
tp_str = "no_tp"
|
||||
else:
|
||||
tp_str = "tp_custom" if isinstance(self.tp_plan, dict) else "tp_auto"
|
||||
sep = ", "
|
||||
return sep.join(
|
||||
[
|
||||
iter_str,
|
||||
gpu_monitor_str,
|
||||
dimensions_str,
|
||||
attn_code,
|
||||
compile_str,
|
||||
kernelize_str,
|
||||
continuous_batching_str,
|
||||
tp_str,
|
||||
]
|
||||
)
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return {
|
||||
"name": self.name,
|
||||
"warmup_iterations": self.warmup_iterations,
|
||||
"measurement_iterations": self.measurement_iterations,
|
||||
"gpu_monitoring": self.gpu_monitoring,
|
||||
"continuous_batching": self.continuous_batching,
|
||||
"batch_size": self.batch_size,
|
||||
"sequence_length": self.sequence_length,
|
||||
"num_tokens_to_generate": self.num_tokens_to_generate,
|
||||
"attn_implementation": self.attn_implementation,
|
||||
"compile_kwargs": self.compile_config.to_dict() if self.compile_config is not None else None,
|
||||
"kernelize": self.kernelize,
|
||||
"tp_plan": self.tp_plan,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, Any], skip_validity_check: bool = False) -> "BenchmarkConfig":
|
||||
return cls(
|
||||
warmup_iterations=data.get("warmup_iterations", 5),
|
||||
measurement_iterations=data.get("measurement_iterations", 20),
|
||||
gpu_monitoring=data.get("gpu_monitoring", False),
|
||||
continuous_batching=data.get("continuous_batching", False),
|
||||
batch_size=data.get("batch_size", 1),
|
||||
sequence_length=data.get("sequence_length", 128),
|
||||
num_tokens_to_generate=data.get("num_tokens_to_generate", 128),
|
||||
attn_implementation=data.get("attn_implementation", "eager"),
|
||||
compile_kwargs=data.get("compile_kwargs"),
|
||||
kernelize=data.get("kernelize", False),
|
||||
tp_plan=data.get("tp_plan"),
|
||||
name=data.get("name"),
|
||||
skip_validity_check=skip_validity_check,
|
||||
)
|
||||
|
||||
|
||||
def adapt_configs(
|
||||
configs: list[BenchmarkConfig],
|
||||
warmup_iterations: int | list[int] = 5,
|
||||
measurement_iterations: int | list[int] = 20,
|
||||
batch_size: int | list[int] = 1,
|
||||
sequence_length: int | list[int] = 128,
|
||||
num_tokens_to_generate: int | list[int] = 128,
|
||||
gpu_monitoring: bool | list[bool] = True,
|
||||
) -> list[BenchmarkConfig]:
|
||||
parameters = (
|
||||
x if isinstance(x, list) else [x]
|
||||
for x in [
|
||||
warmup_iterations,
|
||||
measurement_iterations,
|
||||
batch_size,
|
||||
sequence_length,
|
||||
num_tokens_to_generate,
|
||||
gpu_monitoring,
|
||||
]
|
||||
)
|
||||
iterator = itertools.product(*parameters)
|
||||
|
||||
adapted_configs = []
|
||||
for warmup_iters, measurement_iters, bs, seqlen, ntok, monitor in iterator:
|
||||
for config in configs:
|
||||
config = config.to_dict()
|
||||
config["warmup_iterations"] = warmup_iters
|
||||
config["measurement_iterations"] = measurement_iters
|
||||
config["batch_size"] = bs
|
||||
config["sequence_length"] = seqlen
|
||||
config["num_tokens_to_generate"] = ntok
|
||||
config["gpu_monitoring"] = monitor
|
||||
# Remove the old name so it gets re-inferred with the updated values
|
||||
config.pop("name", None)
|
||||
adapted_configs.append(BenchmarkConfig.from_dict(config))
|
||||
return adapted_configs
|
||||
|
||||
|
||||
def get_config_by_level(level: int) -> list[BenchmarkConfig]:
|
||||
configs = []
|
||||
# Early return if level is greater than 3: we generate all combinations of configs, maybe even w/ all compile modes
|
||||
if level >= 3:
|
||||
for attn_implementation in BenchmarkConfig.all_attn_implementations:
|
||||
# Usually there is not much to gain by compiling with other modes, but we allow it for level 4
|
||||
compile_modes = BenchmarkConfig.all_compiled_modes if level >= 4 else [None, "default"]
|
||||
for cm in compile_modes:
|
||||
compile_kwargs = {"mode": cm} if cm is not None else None
|
||||
for kernelize_on in {False, KERNELIZATION_AVAILABLE}:
|
||||
for cb_on in [False, True]:
|
||||
configs.append(
|
||||
BenchmarkConfig(
|
||||
attn_implementation=attn_implementation,
|
||||
compile_kwargs=compile_kwargs,
|
||||
kernelize=kernelize_on,
|
||||
continuous_batching=cb_on,
|
||||
)
|
||||
)
|
||||
return configs
|
||||
# Otherwise, we add the configs for the given level
|
||||
if level >= 0:
|
||||
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}))
|
||||
if level >= 1:
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2"))
|
||||
configs.append(BenchmarkConfig(attn_implementation="eager", compile_kwargs={}))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", continuous_batching=True))
|
||||
if level >= 2:
|
||||
configs.append(BenchmarkConfig(attn_implementation="sdpa", compile_kwargs={}))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flex_attention", compile_kwargs={}, kernelize=True))
|
||||
configs.append(BenchmarkConfig(attn_implementation="flash_attention_2", kernelize=True))
|
||||
configs.append(BenchmarkConfig(attn_implementation="sdpa", continuous_batching=True))
|
||||
return configs
|
||||
483
benchmark_v2/framework/benchmark_runner.py
Normal file
483
benchmark_v2/framework/benchmark_runner.py
Normal file
@@ -0,0 +1,483 @@
|
||||
import gc
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import pathlib
|
||||
import re
|
||||
import tempfile
|
||||
import time
|
||||
from datetime import datetime
|
||||
from queue import Queue
|
||||
from typing import Any
|
||||
|
||||
import torch
|
||||
from datasets import Dataset
|
||||
from huggingface_hub import HfApi
|
||||
from tqdm import trange
|
||||
|
||||
from transformers import (
|
||||
AutoModelForCausalLM,
|
||||
AutoTokenizer,
|
||||
GenerationConfig,
|
||||
GenerationMixin,
|
||||
is_torch_xpu_available,
|
||||
)
|
||||
from transformers.generation.streamers import BaseStreamer
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
|
||||
from .benchmark_config import BenchmarkConfig
|
||||
from .data_classes import BenchmarkMetadata, BenchmarkResult, GPURawMetrics, pretty_print_dict
|
||||
from .hardware_metrics import GPUMonitor
|
||||
|
||||
|
||||
try:
|
||||
from kernels import Mode, kernelize # noqa: F401
|
||||
except ImportError:
|
||||
kernelize = None
|
||||
Mode = None
|
||||
|
||||
|
||||
DEFAULT_PROMPT = "\n".join([
|
||||
"The French Revolution was a period of political and societal change in France that began with the Estates General of 1789 and ended with the Coup of 18 Brumaire on 9 November 1799.",
|
||||
"Many of the revolution's ideas are considered fundamental principles of liberal democracy, and its values remain central to modern French political discourse.",
|
||||
"It was caused by a combination of social, political, and economic factors which the existing regime proved unable to manage.",
|
||||
"Financial crisis and widespread social distress led to the convocation of the Estates General in May 1789, its first meeting since 1614.",
|
||||
"The representatives of the Third Estate broke away and re-constituted themselves as a National Assembly in June.",
|
||||
"The Storming of the Bastille in Paris on 14 July led to a series of radical measures by the Assembly, including the abolition of feudalism, state control over the Catholic Church in France, and issuing the Declaration of the Rights of Man and of the Citizen.",
|
||||
"The next three years were dominated by a struggle for political control.",
|
||||
"King Louis XVI's attempted flight to Varennes in June 1791 further discredited the monarchy, and military defeats after the outbreak of the French Revolutionary Wars in April 1792 led to the insurrection of 10 August 1792.",
|
||||
"As a result, the monarchy was replaced by the French First Republic in September, followed by the execution of Louis XVI himself in January 1793.",
|
||||
"After another revolt in June 1793, the constitution was suspended, and political power passed from the National Convention to the Committee of Public Safety, dominated by radical Jacobins led by Maximilien Robespierre.",
|
||||
"About 16,000 people were sentenced by the Revolutionary Tribunal and executed in the Reign of Terror, which ended in July 1794 with the Thermidorian Reaction.",
|
||||
"Weakened by external threats and internal opposition, the Committee of Public Safety was replaced in November 1795 by the Directory.",
|
||||
"Its instability ended in the coup of 18 Brumaire and the establishment of the Consulate, with Napoleon Bonaparte as First Consul.",
|
||||
]) # fmt: skip
|
||||
|
||||
PUSH_TO_HUB_TOKEN = os.getenv("PUSH_TO_HUB_TOKEN", None)
|
||||
|
||||
|
||||
def compact_json_numeric_arrays(data: dict):
|
||||
# Match arrays that contain only numbers (ints/floats), whitespace, commas, and newlines
|
||||
pattern = r"\[\s*\n\s*((?:\d+(?:\.\d+)?\s*,\s*)*\d+(?:\.\d+)?)\s*\n\s*\]"
|
||||
|
||||
def replace_numeric_array(match):
|
||||
# Get the array content
|
||||
content = match.group(1)
|
||||
# Remove extra whitespace but keep commas
|
||||
compact_content = re.sub(r"\s+", " ", content).strip()
|
||||
return f"[{compact_content}]"
|
||||
|
||||
return re.sub(pattern, replace_numeric_array, json.dumps(data, indent=4, default=str), flags=re.DOTALL)
|
||||
|
||||
|
||||
def get_git_revision() -> str:
|
||||
base_path = pathlib.Path(__file__).parent.parent.parent
|
||||
git_dir = base_path / ".git"
|
||||
with (git_dir / "HEAD").open("r") as head:
|
||||
ref = head.readline().split(" ")[-1].strip()
|
||||
with (git_dir / ref).open("r") as git_hash:
|
||||
return git_hash.readline().strip()
|
||||
|
||||
|
||||
def flush_memory(flush_compile: bool = True) -> None:
|
||||
"""Flush GPU memory and run garbage collection. If the flush_compile flag is set, we also clear the everything
|
||||
related to compile cache."""
|
||||
gc.collect()
|
||||
# If needed, flush everything related to torch.compile
|
||||
if flush_compile:
|
||||
# Dynamo resets
|
||||
torch._dynamo.reset()
|
||||
torch._dynamo.reset_code_caches()
|
||||
if hasattr(torch._inductor, "codecache"):
|
||||
# Clear FX graph cache
|
||||
if hasattr(torch._inductor.codecache, "FxGraphCache"):
|
||||
torch._inductor.codecache.FxGraphCache.clear()
|
||||
# Clear PyCodeCache
|
||||
if hasattr(torch._inductor.codecache, "PyCodeCache"):
|
||||
torch._inductor.codecache.PyCodeCache.cache_clear()
|
||||
# Clear TritonFuture cache (for async compilation)
|
||||
if hasattr(torch._inductor.codecache, "TritonFuture"):
|
||||
if hasattr(torch._inductor.codecache.TritonFuture, "_compile_cache"):
|
||||
torch._inductor.codecache.TritonFuture._compile_cache.clear()
|
||||
# Clear device cache
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
elif is_torch_xpu_available():
|
||||
torch.xpu.empty_cache()
|
||||
torch.xpu.synchronize()
|
||||
gc.collect()
|
||||
|
||||
|
||||
class BenchmarkStreamer(BaseStreamer):
|
||||
def __init__(self, **kwargs) -> None:
|
||||
self.timeout = kwargs.pop("timeout", 10)
|
||||
self.timestamps = []
|
||||
self.text_queue = Queue()
|
||||
self.stop_signal = None
|
||||
|
||||
def put(self, value):
|
||||
"""Receives tokens and logs the timestamp of the generation."""
|
||||
self.timestamps.append(time.perf_counter())
|
||||
self.text_queue.put(value)
|
||||
|
||||
def end(self):
|
||||
self.timestamps.append(time.perf_counter())
|
||||
self.text_queue.put(self.stop_signal)
|
||||
|
||||
def __iter__(self):
|
||||
return self
|
||||
|
||||
def __next__(self):
|
||||
value = self.text_queue.get(timeout=self.timeout)
|
||||
if value == self.stop_signal:
|
||||
raise StopIteration()
|
||||
else:
|
||||
return value
|
||||
|
||||
|
||||
class BenchmarkRunner:
|
||||
"""Main benchmark runner that coordinates benchmark execution."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
logger: logging.Logger,
|
||||
output_dir: str | None = None,
|
||||
branch_name: str | None = None,
|
||||
commit_id: str | None = None,
|
||||
commit_message: str | None = None,
|
||||
) -> None:
|
||||
# Those stay constant for the whole run
|
||||
self.logger = logger
|
||||
if output_dir is None:
|
||||
output_dir = os.path.join(os.path.dirname(os.path.dirname(__file__)), "benchmark_results")
|
||||
self.output_dir = output_dir
|
||||
self.branch_name = branch_name
|
||||
self.commit_id = get_git_revision() if commit_id is None else commit_id
|
||||
self.commit_message = commit_message
|
||||
os.makedirs(self.output_dir, exist_ok=True)
|
||||
self.profile_dir = None
|
||||
# Attributes that are reset for each model
|
||||
self._setup_for = ""
|
||||
# Attributes that are reset for each run
|
||||
self.model: GenerationMixin | None = None
|
||||
self.device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
self.torch_accelerator_module = getattr(torch, self.device_type, torch.cuda)
|
||||
|
||||
def cleanup(self) -> None:
|
||||
del self.model
|
||||
self.model = None
|
||||
flush_memory()
|
||||
|
||||
@staticmethod
|
||||
def _is_primary_process() -> bool:
|
||||
if not torch.distributed.is_available() or not torch.distributed.is_initialized():
|
||||
return True
|
||||
return torch.distributed.get_rank() == 0
|
||||
|
||||
def setup_benchmark(self, model_id: str, config: BenchmarkConfig) -> None:
|
||||
# Some attributes only need to be set once per model
|
||||
if self._setup_for != model_id:
|
||||
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
# We set the EOS token to the padding token for open-ended generation
|
||||
self.tokenizer.eos_token = self.tokenizer.pad_token
|
||||
self._setup_for = model_id
|
||||
|
||||
# Prepare inputs
|
||||
self.inputs = self.tokenizer(
|
||||
[DEFAULT_PROMPT for _ in range(config.batch_size)],
|
||||
return_tensors="pt",
|
||||
max_length=config.sequence_length,
|
||||
truncation=True,
|
||||
return_attention_mask=True,
|
||||
)
|
||||
self.inputs["use_cache"] = True
|
||||
|
||||
# Prepare generation config
|
||||
generation_config_kwargs = {
|
||||
"do_sample": False,
|
||||
"max_new_tokens": config.num_tokens_to_generate,
|
||||
}
|
||||
|
||||
# Add compile config if found
|
||||
if config.compile_config is not None:
|
||||
generation_config_kwargs.update(compile_config=config.compile_config)
|
||||
# To trigger compile in generate, we need to set the cache to static
|
||||
if not config.continuous_batching:
|
||||
generation_config_kwargs.update(cache_implementation="static")
|
||||
|
||||
generation_config = GenerationConfig(**generation_config_kwargs)
|
||||
|
||||
# Load model
|
||||
self.logger.debug(f"Loading model {model_id} on device {config.device}...")
|
||||
dtype = getattr(torch, config.dtype.removeprefix("torch."))
|
||||
use_kernels = config.kernelize and kernelize is not None and Mode is not None
|
||||
device_map = config.device if config.tp_plan is None else None
|
||||
self.model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
dtype=dtype,
|
||||
attn_implementation=config.attn_implementation,
|
||||
generation_config=generation_config,
|
||||
use_kernels=use_kernels,
|
||||
device_map=device_map,
|
||||
tp_plan=config.tp_plan,
|
||||
)
|
||||
self.model = self.model.eval()
|
||||
self.inputs = self.inputs.to(self.model.device)
|
||||
|
||||
def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
|
||||
"""Run a single benchmark with the given model ID and config."""
|
||||
with torch.no_grad():
|
||||
self.logger.info(f"Running benchmark scenario: {config.name}")
|
||||
self.logger.debug(f"Full config: {config.to_dict()}")
|
||||
|
||||
# Quick validation: try one measurement first to see if this scenario works
|
||||
flush_memory()
|
||||
e2e_latency = self.time_generate(config, warmup=True)[0]
|
||||
if e2e_latency < 0:
|
||||
self.logger.warning(f"Skipping config {config.name}: {e2e_latency = }")
|
||||
return None
|
||||
|
||||
# Warmup runs
|
||||
self.logger.info(f"Warming up with {config.warmup_iterations} iterations...")
|
||||
for _ in trange(config.warmup_iterations, desc="Warmup"):
|
||||
self.time_generate(config, warmup=True)
|
||||
self.logger.info("Warmup over.")
|
||||
|
||||
# Measurement runs
|
||||
result = BenchmarkResult()
|
||||
self.logger.info(f"Benchmarking with {config.measurement_iterations} iterations.")
|
||||
for _ in trange(config.measurement_iterations, desc="Benchmarking"):
|
||||
e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics = self.time_generate(
|
||||
config, warmup=False
|
||||
)
|
||||
result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics)
|
||||
self.logger.info("Benchmarking done. Cleaning up.")
|
||||
|
||||
# Profile if needed
|
||||
if num_tokens_to_profile > 0:
|
||||
self.profile_generate(num_tokens_to_profile, config.name)
|
||||
|
||||
return result
|
||||
|
||||
def time_generate(
|
||||
self, config: BenchmarkConfig, warmup: bool
|
||||
) -> tuple[float, list[float], str, GPURawMetrics | None]:
|
||||
# Prepare gpu monitoring if needed
|
||||
if config.gpu_monitoring and not warmup:
|
||||
gpu_monitor = GPUMonitor(logger=self.logger)
|
||||
gpu_monitor.start()
|
||||
else:
|
||||
gpu_monitor = None
|
||||
|
||||
# Generate and time
|
||||
if config.continuous_batching:
|
||||
inputs = self.inputs["input_ids"].tolist()
|
||||
wall_time_0 = time.perf_counter()
|
||||
outputs = self.model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True)
|
||||
else:
|
||||
streamer = BenchmarkStreamer()
|
||||
wall_time_0 = time.perf_counter()
|
||||
outputs = self.model.generate(**self.inputs, streamer=streamer)
|
||||
|
||||
wall_time_1 = time.perf_counter()
|
||||
gpu_metrics = gpu_monitor.stop_and_collect() if gpu_monitor is not None else None
|
||||
|
||||
# Retrieve timestamps and results in a way that allows similar post-processing
|
||||
input_tokens = self.inputs["input_ids"].size(-1)
|
||||
if config.continuous_batching:
|
||||
timestamps = [output.timestamps[:] for output in outputs.values()]
|
||||
results = torch.tensor([output.generated_tokens[:] for output in outputs.values()])
|
||||
else:
|
||||
timestamps = [streamer.timestamps[1:]] # skip the first timestamp because it's the input tokens
|
||||
results = outputs[:, input_tokens:]
|
||||
outputs = None
|
||||
flush_memory(flush_compile=False)
|
||||
|
||||
# Check if generation had the right number of tokens
|
||||
if results.size(-1) != config.num_tokens_to_generate:
|
||||
raise RuntimeError(f"Generated {results.size(-1)} tokens, expected {config.num_tokens_to_generate}")
|
||||
|
||||
# Decode outputs
|
||||
decoded_output = self.tokenizer.decode(results[0], skip_special_tokens=True)
|
||||
shape_and_decoded_output = f"{tuple(results.shape)} | {decoded_output}"
|
||||
|
||||
# Compute metrics
|
||||
e2e_latency = wall_time_1 - wall_time_0
|
||||
timestamps = torch.tensor(timestamps).sub(wall_time_0).tolist()
|
||||
self.logger.info(
|
||||
f"Time generate done in {e2e_latency:.2f} seconds. Memory usage: {self.torch_accelerator_module.memory_allocated() / 1024**2:.2f} MB"
|
||||
)
|
||||
return e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics
|
||||
|
||||
def profile_generate(self, num_tokens_to_profile: int, config_name: str) -> None:
|
||||
"""Profile the latency of a call to model.generate() with the given (inputs) and (max_new_tokens)."""
|
||||
activities = [torch.profiler.ProfilerActivity.CPU]
|
||||
if self.device_type == "cuda":
|
||||
activities.append(torch.profiler.ProfilerActivity.CUDA)
|
||||
elif self.device_type == "xpu":
|
||||
activities.append(torch.profiler.ProfilerActivity.XPU)
|
||||
|
||||
profiler = torch.profiler.profile(
|
||||
activities=activities,
|
||||
record_shapes=True,
|
||||
)
|
||||
with profiler as prof:
|
||||
_ = self.model.generate(
|
||||
**self.inputs,
|
||||
max_new_tokens=num_tokens_to_profile,
|
||||
)
|
||||
if self.profile_dir is None:
|
||||
self.profile_dir = self.output_dir + "_profiles"
|
||||
os.makedirs(self.profile_dir, exist_ok=True)
|
||||
prof.export_chrome_trace(f"{self.profile_dir}/{config_name}.json")
|
||||
|
||||
@torch.inference_mode()
|
||||
def run_benchmarks(
|
||||
self,
|
||||
model_id: str,
|
||||
benchmark_configs: list[BenchmarkConfig],
|
||||
num_tokens_to_profile: int = 0,
|
||||
pretty_print_summary: bool = True,
|
||||
summarized: bool = True,
|
||||
) -> tuple[str, dict[str, Any]]:
|
||||
"""Run multiple benchmarks for the given model ID and list of benchmark configs."""
|
||||
all_results = {}
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
start_time = time.perf_counter()
|
||||
|
||||
n_configs = len(benchmark_configs)
|
||||
for i, config in enumerate(benchmark_configs):
|
||||
# Skip if already run
|
||||
if config.hash in all_results:
|
||||
self.logger.info(f"Skipping duplicate config {config.name} for model {model_id} ({i + 1}/{n_configs})")
|
||||
continue
|
||||
|
||||
# Otherwise, run the benchmark
|
||||
self.setup_benchmark(model_id, config)
|
||||
self.logger.info(
|
||||
f"Running benchmark of model {model_id} with scenario: {config.name} ({i + 1}/{n_configs})"
|
||||
)
|
||||
|
||||
# Launch benchmark in a try/except block to avoid stopping the whole run if one benchmark fails
|
||||
try:
|
||||
result = self.run_benchmark(config, num_tokens_to_profile)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error running with scenario: {config.name}:\n{repr(e)}")
|
||||
result = None
|
||||
|
||||
# Memoize
|
||||
all_results[config.hash] = {
|
||||
"metadata": BenchmarkMetadata(
|
||||
model_id=model_id,
|
||||
branch_name=self.branch_name,
|
||||
commit_id=self.commit_id,
|
||||
commit_message=self.commit_message,
|
||||
success=result is not None,
|
||||
),
|
||||
"measurements": result if result is not None else BenchmarkResult(),
|
||||
"config": config,
|
||||
}
|
||||
|
||||
# Cleanup model and save results
|
||||
self.cleanup()
|
||||
self.save_results(model_id, all_results, timestamp=timestamp, summarized=summarized)
|
||||
|
||||
if len(all_results) < 1:
|
||||
raise RuntimeError("No benchmark was run successfully")
|
||||
|
||||
if pretty_print_summary:
|
||||
if not self._is_primary_process():
|
||||
return (timestamp, all_results)
|
||||
print()
|
||||
print("=" * 100)
|
||||
print(f"Finished benchmarks in {time.perf_counter() - start_time:.2f} seconds")
|
||||
print(f"Total number of benchmarks: {len(all_results)}")
|
||||
print("First run metadata:")
|
||||
first_key = list(all_results.keys())[0]
|
||||
first_metadata = all_results[first_key]["metadata"].to_dict()
|
||||
hardware_info = first_metadata.pop("hardware_info")
|
||||
pretty_print_dict(first_metadata | hardware_info, tabs=1)
|
||||
for result in all_results.values():
|
||||
print("=" * 100)
|
||||
print(f"Config: {result['config'].infer_name(compact=False)}\n")
|
||||
result["measurements"].pprint(
|
||||
batch_size=result["config"].batch_size,
|
||||
num_generated_tokens=result["config"].num_tokens_to_generate,
|
||||
tabs=1,
|
||||
)
|
||||
print("=" * 100)
|
||||
|
||||
return (timestamp, all_results)
|
||||
|
||||
def save_results(self, model_name: str, results: dict, timestamp: str = "", summarized: bool = True) -> str:
|
||||
"""Save benchmark results to JSON file."""
|
||||
if not self._is_primary_process():
|
||||
return ""
|
||||
# Create model-specific subdirectory
|
||||
model_name = model_name.replace("/", "_")
|
||||
model_dir = os.path.join(self.output_dir, model_name)
|
||||
os.makedirs(model_dir, exist_ok=True)
|
||||
|
||||
# Create filename with timestamp
|
||||
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"{model_name}_benchmark_{timestamp}.json"
|
||||
filepath = os.path.join(model_dir, filename)
|
||||
|
||||
# Convert results to dict
|
||||
converted_results = {}
|
||||
for cfg_hash in results.keys():
|
||||
converted_results[cfg_hash] = {
|
||||
"metadata": results[cfg_hash]["metadata"].to_dict(),
|
||||
"measurements": results[cfg_hash]["measurements"].to_dict(summarized=summarized),
|
||||
"config": results[cfg_hash]["config"].to_dict(),
|
||||
}
|
||||
|
||||
# Save to JSON file
|
||||
with open(filepath, "w") as f:
|
||||
f.write(compact_json_numeric_arrays(converted_results))
|
||||
|
||||
self.logger.info(f"Results saved to {filepath}")
|
||||
return filepath
|
||||
|
||||
def push_results_to_hub(self, dataset_id: str, results: dict[Any, Any], timestamp: str) -> None:
|
||||
if PUSH_TO_HUB_TOKEN is None:
|
||||
raise ValueError(
|
||||
"PUSH_TO_HUB_TOKEN is not set, cannot push results to the Hub. When setting dataset_id, please also set the PUSH_TO_HUB_TOKEN environment variable."
|
||||
)
|
||||
|
||||
api = HfApi()
|
||||
n_results = len(results)
|
||||
for summarized in [False, True]:
|
||||
self.logger.info(f"Pushing {n_results} results to: {dataset_id} with {summarized = }")
|
||||
rows = []
|
||||
for cfg_hash, entry in results.items():
|
||||
row = {
|
||||
"benchmark_config_hash": cfg_hash,
|
||||
"config": entry["config"].to_dict(),
|
||||
"measurements": entry["measurements"].to_dict(summarized=summarized),
|
||||
"metadata": entry["metadata"].to_dict(),
|
||||
}
|
||||
rows.append(row)
|
||||
|
||||
ds = Dataset.from_list(rows)
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
file_name = "summarized_results" if summarized else "full_results"
|
||||
jsonl_path = os.path.join(tmp, f"{file_name}.jsonl")
|
||||
with open(jsonl_path, "w") as f:
|
||||
json_lines = []
|
||||
for ex in ds:
|
||||
json_lines.append(json.dumps(ex, ensure_ascii=False))
|
||||
f.write("\n".join(json_lines))
|
||||
|
||||
# NOTE: we expect the repository to already exist
|
||||
timestamp = timestamp if timestamp else datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
file_name = file_name + "/" + f"benchmark_run_{timestamp}.jsonl"
|
||||
api.upload_file(
|
||||
path_or_fileobj=jsonl_path,
|
||||
path_in_repo=file_name,
|
||||
repo_id=dataset_id,
|
||||
repo_type="dataset",
|
||||
token=PUSH_TO_HUB_TOKEN,
|
||||
)
|
||||
self.logger.info(f"Successfully uploaded results to: {dataset_id} with {summarized = }")
|
||||
176
benchmark_v2/framework/data_classes.py
Normal file
176
benchmark_v2/framework/data_classes.py
Normal file
@@ -0,0 +1,176 @@
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
|
||||
from .hardware_metrics import GPURawMetrics, HardwareInfo
|
||||
|
||||
|
||||
def compute_basic_statistics(measurements: list[float]) -> dict[str, float]:
|
||||
return {
|
||||
"avg": np.mean(measurements) if measurements else 0,
|
||||
"std": np.std(measurements) if measurements else 0,
|
||||
"min": np.min(measurements) if measurements else 0,
|
||||
"med": np.median(measurements) if measurements else 0,
|
||||
"max": np.max(measurements) if measurements else 0,
|
||||
"p95": np.percentile(measurements, 95) if measurements else 0,
|
||||
}
|
||||
|
||||
|
||||
def add_unit_to_duration(stats: dict[str, float]) -> dict[str, str]:
|
||||
for key in list(stats.keys()):
|
||||
value = stats[key]
|
||||
if value > 3600:
|
||||
stats[key] = f"{(value / 3600):.2f}hr"
|
||||
elif value > 60:
|
||||
stats[key] = f"{(value / 60):.2f}min"
|
||||
elif value > 1:
|
||||
stats[key] = f"{value:.2f}s"
|
||||
elif value > 1e-3:
|
||||
stats[key] = f"{(value * 1e3):.2f}ms"
|
||||
elif value > 1e-6:
|
||||
stats[key] = f"{(value * 1e6):.2f}us"
|
||||
else:
|
||||
stats[key] = f"{(value * 1e9):.2f}ns"
|
||||
return stats
|
||||
|
||||
|
||||
def equalize_lengths_and_collate(stats: dict[str, dict[str, str]]) -> dict[str, str]:
|
||||
"""Note: This operation is destructive as it will update values in place before returning a new correctly formatted dict"""
|
||||
keys = ["avg", "std", "min", "med", "max", "p95"]
|
||||
for key in keys:
|
||||
max_length = max(len(stat[key]) for stat in stats.values())
|
||||
for stat in stats.values():
|
||||
stat[key] = stat[key].ljust(max_length, " ")
|
||||
return {name: " ".join([f"{key}={stat[key]}" for key in keys]) for name, stat in stats.items()}
|
||||
|
||||
|
||||
def pretty_print_dict(data: dict[str, str], tabs: int = 0) -> None:
|
||||
max_key_length = max([len(key) for key in data.keys()])
|
||||
for key, value in data.items():
|
||||
tabs_str = " " * tabs
|
||||
padded_key = key.ljust(max_key_length + 1, ".")
|
||||
print(f"{tabs_str}{padded_key}: {value}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class BenchmarkMetadata:
|
||||
"""Metadata collected for each benchmark run."""
|
||||
|
||||
model_id: str
|
||||
timestamp: str
|
||||
branch_name: str
|
||||
commit_id: str
|
||||
commit_message: str
|
||||
hardware_info: HardwareInfo
|
||||
success: bool
|
||||
|
||||
def __init__(
|
||||
self, model_id: str, commit_id: str, branch_name: str = "main", commit_message: str = "", success: bool = True
|
||||
) -> None:
|
||||
self.model_id = model_id
|
||||
self.timestamp = datetime.now(timezone.utc).isoformat()
|
||||
self.branch_name = branch_name
|
||||
self.commit_id = commit_id
|
||||
self.commit_message = commit_message
|
||||
self.hardware_info = HardwareInfo()
|
||||
self.success = success
|
||||
|
||||
def to_dict(self) -> dict[str, Any]:
|
||||
return {
|
||||
"model_id": self.model_id,
|
||||
"timestamp": self.timestamp,
|
||||
"branch_name": self.branch_name,
|
||||
"commit_id": self.commit_id,
|
||||
"commit_message": self.commit_message,
|
||||
"hardware_info": self.hardware_info.to_dict(),
|
||||
"success": self.success,
|
||||
}
|
||||
|
||||
|
||||
class BenchmarkResult:
|
||||
"""Result from a series of benchmark runs."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.e2e_latency = []
|
||||
self._timestamps = []
|
||||
self.time_to_first_token = []
|
||||
self.inter_token_latency = []
|
||||
self.shape_and_decoded_outputs = []
|
||||
self.gpu_metrics = []
|
||||
|
||||
def accumulate(
|
||||
self,
|
||||
e2e_latency: float,
|
||||
timestamps: list[float],
|
||||
shape_and_decoded_output: str,
|
||||
gpu_metrics: GPURawMetrics | None,
|
||||
) -> None:
|
||||
self.e2e_latency.append(e2e_latency)
|
||||
self._timestamps.append(timestamps)
|
||||
self._accumulate_ttft_and_itl(timestamps)
|
||||
self.shape_and_decoded_outputs.append(shape_and_decoded_output)
|
||||
self.gpu_metrics.append(gpu_metrics)
|
||||
|
||||
def _accumulate_ttft_and_itl(self, timestamps: list[float]) -> None:
|
||||
timestamps = np.array(timestamps)
|
||||
tftt = np.min(timestamps[:, 0])
|
||||
itl = np.mean(timestamps[:, -1] - timestamps[:, 0]) / (timestamps.shape[1] - 1)
|
||||
self.time_to_first_token.append(tftt)
|
||||
self.inter_token_latency.append(itl)
|
||||
|
||||
def to_dict(self, summarized: bool = False) -> dict[str, Any]:
|
||||
# Save GPU metrics as None if it contains only None values or if we are summarizing
|
||||
if summarized or all(gm is None for gm in self.gpu_metrics):
|
||||
gpu_metrics = None
|
||||
else:
|
||||
gpu_metrics = [gm.to_dict() for gm in self.gpu_metrics]
|
||||
return {
|
||||
"e2e_latency": self.e2e_latency,
|
||||
"time_to_first_token": self.time_to_first_token,
|
||||
"inter_token_latency": self.inter_token_latency,
|
||||
"shape_and_decoded_outputs": self.shape_and_decoded_outputs,
|
||||
"gpu_metrics": gpu_metrics,
|
||||
"timestamps": None if summarized else self._timestamps,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, Any]) -> "BenchmarkResult":
|
||||
# Handle GPU metrics, which is saved as None if it contains only None values
|
||||
if data["gpu_metrics"] is None:
|
||||
gpu_metrics = [None for _ in range(len(data["e2e_latency"]))]
|
||||
else:
|
||||
gpu_metrics = [GPURawMetrics.from_dict(gm) for gm in data["gpu_metrics"]]
|
||||
# Handle timestamps, which can be saved as None to reduce file size
|
||||
if data["timestamps"] is None:
|
||||
timestamps = [None for _ in range(len(data["e2e_latency"]))]
|
||||
else:
|
||||
timestamps = data["timestamps"]
|
||||
# Create a new instance and accumulate the data
|
||||
new_instance = cls()
|
||||
new_instance.e2e_latency = data["e2e_latency"]
|
||||
new_instance._timestamps = timestamps
|
||||
new_instance.time_to_first_token = data["time_to_first_token"]
|
||||
new_instance.inter_token_latency = data["inter_token_latency"]
|
||||
new_instance.shape_and_decoded_outputs = data["shape_and_decoded_outputs"]
|
||||
new_instance.gpu_metrics = gpu_metrics
|
||||
return new_instance
|
||||
|
||||
def get_throughput(self, total_generated_tokens: int) -> list[float]:
|
||||
return [total_generated_tokens / e2e_latency for e2e_latency in self.e2e_latency]
|
||||
|
||||
def pprint(self, batch_size: int = 0, num_generated_tokens: int = 0, tabs: int = 0) -> None:
|
||||
measurements = {
|
||||
"E2E Latency": add_unit_to_duration(compute_basic_statistics(self.e2e_latency)),
|
||||
"Time to First Token": add_unit_to_duration(compute_basic_statistics(self.time_to_first_token)),
|
||||
}
|
||||
if len(self.inter_token_latency) > 0:
|
||||
measurements["Inter-Token Latency"] = add_unit_to_duration(
|
||||
compute_basic_statistics(self.inter_token_latency)
|
||||
)
|
||||
if batch_size > 0:
|
||||
throughput_stats = compute_basic_statistics(self.get_throughput(batch_size * num_generated_tokens))
|
||||
measurements["Throughput"] = {key: f"{value:.2f}tok/s" for key, value in throughput_stats.items()}
|
||||
dict_to_pprint = equalize_lengths_and_collate(measurements)
|
||||
pretty_print_dict(dict_to_pprint, tabs=tabs)
|
||||
325
benchmark_v2/framework/hardware_metrics.py
Normal file
325
benchmark_v2/framework/hardware_metrics.py
Normal file
@@ -0,0 +1,325 @@
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from logging import Logger
|
||||
from multiprocessing import Pipe, Process
|
||||
from multiprocessing.connection import Connection
|
||||
|
||||
from transformers.utils.import_utils import is_cuda_platform, is_rocm_platform
|
||||
|
||||
|
||||
if is_cuda_platform():
|
||||
import pynvml
|
||||
|
||||
if is_rocm_platform():
|
||||
import amdsmi
|
||||
|
||||
import psutil
|
||||
import torch
|
||||
|
||||
from transformers.utils import is_torch_accelerator_available
|
||||
|
||||
|
||||
_logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Data class to hold the hardware information
|
||||
def get_device_name_and_memory_total() -> tuple[str, float]:
|
||||
"""Returns the name and memory total of GPU 0."""
|
||||
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
|
||||
device_name = torch_accelerator_module.get_device_properties(0).name
|
||||
device_memory_total = torch_accelerator_module.get_device_properties(0).total_memory / 1024**3
|
||||
return device_name, device_memory_total
|
||||
|
||||
|
||||
class HardwareInfo:
|
||||
"""A class to hold information about the hardware."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
# Retrieve GPU stats
|
||||
try:
|
||||
self.gpu_name, self.gpu_memory_total_gb = get_device_name_and_memory_total()
|
||||
except Exception:
|
||||
self.gpu_name, self.gpu_memory_total_gb = None, None
|
||||
# Retrieve python, torch and CUDA version
|
||||
self.python_version = f"{sys.version.split()[0]}"
|
||||
self.torch_version = torch.__version__
|
||||
if hasattr(torch, "cuda") and torch.cuda.is_available():
|
||||
self.cuda_version = torch.version.cuda
|
||||
else:
|
||||
self.cuda_version = None
|
||||
# Retrieve general hardware information
|
||||
self.cpu_count = psutil.cpu_count()
|
||||
self.memory_total_mb = int(psutil.virtual_memory().total / (1024 * 1024))
|
||||
|
||||
def to_dict(self) -> dict[str, None | int | float | str]:
|
||||
return {
|
||||
"gpu_name": self.gpu_name,
|
||||
"gpu_memory_total_gb": self.gpu_memory_total_gb,
|
||||
"python_version": self.python_version,
|
||||
"torch_version": self.torch_version,
|
||||
}
|
||||
|
||||
|
||||
# Functions to get information about the GPU
|
||||
def get_amd_gpu_stats(device_handle) -> tuple[int, float]:
|
||||
"""Get AMD GPU stats using amdsmi library."""
|
||||
utilization = amdsmi.amdsmi_get_gpu_activity(device_handle)["gfx_activity"]
|
||||
memory_used = amdsmi.amdsmi_get_gpu_vram_usage(device_handle)["vram_used"]
|
||||
return int(utilization), float(memory_used) / 1024**3 # Convert bytes to GB
|
||||
|
||||
|
||||
def get_intel_xpu_stats() -> tuple[int, float]:
|
||||
"""Returns the utilization and memory used of an Intel XPU"""
|
||||
# xpu-smi outputs CSV format: Timestamp, DeviceId, GPU Memory Utilization (%), GPU Memory Used (MiB)
|
||||
xpu_smi_output = subprocess.check_output(["xpu-smi", "dump", "-m", "5,18", "-n", "1"])
|
||||
lines = xpu_smi_output.decode("utf-8").strip().split("\n")
|
||||
|
||||
# Parse all data lines (skip header) and collect stats from all cards
|
||||
xpu_stats = []
|
||||
for line in lines[1:]:
|
||||
data_line = line.split(",")
|
||||
if len(data_line) < 4:
|
||||
continue
|
||||
device_id = data_line[1].strip()
|
||||
utilization_str = data_line[2].strip()
|
||||
memory_used_str = data_line[3].strip()
|
||||
if utilization_str != "N/A" and memory_used_str != "N/A":
|
||||
utilization = int(float(utilization_str))
|
||||
memory_used_mib = float(memory_used_str)
|
||||
xpu_stats.append((device_id, utilization, memory_used_mib))
|
||||
|
||||
if not xpu_stats:
|
||||
return 0, 0.0
|
||||
|
||||
# Sort by utilization (descending) and pick the highest
|
||||
xpu_stats.sort(key=lambda x: x[1], reverse=True)
|
||||
device_id, utilization, memory_used_mib = xpu_stats[0]
|
||||
memory_used_gb = memory_used_mib / 1024
|
||||
return utilization, memory_used_gb
|
||||
|
||||
|
||||
def get_nvidia_gpu_stats(device_handle) -> tuple[int, float]:
|
||||
"""Returns the utilization and memory used of an NVIDIA GPU using pynvml."""
|
||||
utilization = pynvml.nvmlDeviceGetUtilizationRates(device_handle).gpu
|
||||
memory_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
|
||||
memory_used_gb = memory_info.used / 1024**3
|
||||
return int(utilization), float(memory_used_gb)
|
||||
|
||||
|
||||
# Simple data classes to hold the raw GPU metrics
|
||||
class GPUMonitoringStatus(Enum):
|
||||
"""Status of GPU monitoring."""
|
||||
|
||||
SUCCESS = "success"
|
||||
FAILED = "failed"
|
||||
NO_GPUS_AVAILABLE = "no_gpus_available"
|
||||
NO_SAMPLES_COLLECTED = "no_samples_collected"
|
||||
|
||||
|
||||
@dataclass
|
||||
class GPURawMetrics:
|
||||
"""Raw values for GPU utilization and memory used."""
|
||||
|
||||
utilization: list[float] # in percent
|
||||
memory_used: list[float] # in GB
|
||||
timestamps: list[float] # in seconds
|
||||
timestamp_0: float # in seconds
|
||||
monitoring_status: GPUMonitoringStatus
|
||||
|
||||
def to_dict(self) -> dict[str, None | int | float | str]:
|
||||
return {
|
||||
"utilization": self.utilization,
|
||||
"memory_used": self.memory_used,
|
||||
"timestamps": self.timestamps,
|
||||
"timestamp_0": self.timestamp_0,
|
||||
"monitoring_status": self.monitoring_status.value,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, None | int | float | str]) -> "GPURawMetrics":
|
||||
"""Create a GPURawMetrics instance from a dictionary."""
|
||||
return cls(
|
||||
utilization=data["utilization"],
|
||||
memory_used=data["memory_used"],
|
||||
timestamps=data["timestamps"],
|
||||
timestamp_0=data["timestamp_0"],
|
||||
monitoring_status=GPUMonitoringStatus(data["monitoring_status"]),
|
||||
)
|
||||
|
||||
|
||||
# Main class, used to monitor the GPU utilization during benchmark execution
|
||||
class GPUMonitor:
|
||||
"""Monitor GPU utilization during benchmark execution using a separate process."""
|
||||
|
||||
def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
|
||||
self.sample_interval_sec = sample_interval_sec
|
||||
self.logger = logger if logger is not None else _logger
|
||||
self.gpu_type = None
|
||||
self.process = None
|
||||
|
||||
device_type = torch.accelerator.current_accelerator().type if is_torch_accelerator_available() else "cuda"
|
||||
torch_accelerator_module = getattr(torch, device_type, torch.cuda)
|
||||
self.num_available_gpus = torch_accelerator_module.device_count()
|
||||
if self.num_available_gpus == 0:
|
||||
self.logger.warning(f"No GPUs detected by torch.{device_type}.device_count().")
|
||||
return
|
||||
|
||||
# Determine GPU type
|
||||
device_name, _ = get_device_name_and_memory_total()
|
||||
if "amd" in device_name.lower():
|
||||
self.gpu_type = "amd"
|
||||
elif "nvidia" in device_name.lower():
|
||||
self.gpu_type = "nvidia"
|
||||
elif "intel" in device_name.lower() or device_type == "xpu":
|
||||
self.gpu_type = "intel"
|
||||
else:
|
||||
self.logger.warning(f"Unsupported GPU for monitoring: {device_name}")
|
||||
|
||||
@staticmethod
|
||||
def _monitor_worker(gpu_type: str, sample_interval_sec: float, connection: Connection):
|
||||
"""Worker process for GPU monitoring."""
|
||||
gpu_utilization = []
|
||||
gpu_memory_used = []
|
||||
timestamps = []
|
||||
device_handle = None
|
||||
|
||||
# Initialize GPU-specific monitoring
|
||||
if gpu_type == "amd":
|
||||
amdsmi.amdsmi_init()
|
||||
device_handle = amdsmi.amdsmi_get_processor_handles()[0]
|
||||
elif gpu_type == "nvidia":
|
||||
pynvml.nvmlInit()
|
||||
device_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
|
||||
|
||||
# Signal ready
|
||||
try:
|
||||
connection.send(0)
|
||||
except Exception:
|
||||
return
|
||||
|
||||
# Monitoring loop
|
||||
stop = False
|
||||
while not stop:
|
||||
try:
|
||||
if gpu_type == "amd":
|
||||
utilization, memory_used = get_amd_gpu_stats(device_handle)
|
||||
elif gpu_type == "nvidia":
|
||||
utilization, memory_used = get_nvidia_gpu_stats(device_handle)
|
||||
elif gpu_type == "intel":
|
||||
utilization, memory_used = get_intel_xpu_stats()
|
||||
else:
|
||||
break
|
||||
|
||||
gpu_utilization.append(utilization)
|
||||
gpu_memory_used.append(memory_used)
|
||||
timestamps.append(time.time())
|
||||
except Exception as e:
|
||||
# Skips failed measurements
|
||||
_logger.debug(f"Failed to collect GPU metrics sample: {e}")
|
||||
|
||||
stop = connection.poll(sample_interval_sec)
|
||||
|
||||
# Cleanup
|
||||
if gpu_type == "amd":
|
||||
try:
|
||||
amdsmi.amdsmi_shut_down()
|
||||
except Exception as e:
|
||||
_logger.debug(f"Failed to shutdown AMD GPU monitoring: {e}")
|
||||
elif gpu_type == "nvidia":
|
||||
try:
|
||||
pynvml.nvmlShutdown()
|
||||
except Exception as e:
|
||||
_logger.debug(f"Failed to shutdown NVIDIA GPU monitoring: {e}")
|
||||
|
||||
# Send results back
|
||||
try:
|
||||
connection.send((gpu_utilization, gpu_memory_used, timestamps))
|
||||
except Exception as e:
|
||||
_logger.error(f"Failed to send GPU monitoring results: {e}")
|
||||
|
||||
connection.close()
|
||||
|
||||
def start(self):
|
||||
"""Start monitoring GPU metrics in a separate process."""
|
||||
if self.gpu_type is None:
|
||||
self.logger.debug("GPU monitoring skipped (no supported GPU)")
|
||||
return
|
||||
|
||||
self.child_connection, self.parent_connection = Pipe()
|
||||
self.process = Process(
|
||||
target=GPUMonitor._monitor_worker,
|
||||
args=(self.gpu_type, self.sample_interval_sec, self.child_connection),
|
||||
daemon=True,
|
||||
)
|
||||
self.process.start()
|
||||
|
||||
# Wait for worker to signal ready
|
||||
if self.process.is_alive():
|
||||
self.parent_connection.recv()
|
||||
self.logger.debug("GPU monitoring started (multiprocessing)")
|
||||
|
||||
def stop_and_collect(self) -> GPURawMetrics:
|
||||
"""Stop monitoring and return collected metrics."""
|
||||
# No GPU available or unsupported GPU
|
||||
if self.process is None:
|
||||
return GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.NO_GPUS_AVAILABLE,
|
||||
)
|
||||
|
||||
# Process crashed before we could collect results
|
||||
process_failed = False
|
||||
if not self.process.is_alive():
|
||||
process_failed = True
|
||||
gpu_utilization, gpu_memory_used, timestamps = [], [], []
|
||||
else:
|
||||
# Signal stop
|
||||
self.parent_connection.send(0)
|
||||
# Get results
|
||||
try:
|
||||
gpu_utilization, gpu_memory_used, timestamps = self.parent_connection.recv()
|
||||
except Exception:
|
||||
process_failed = True
|
||||
gpu_utilization, gpu_memory_used, timestamps = [], [], []
|
||||
|
||||
self.parent_connection.close()
|
||||
self.process.join(timeout=2.0)
|
||||
if self.process.is_alive():
|
||||
self.process.terminate()
|
||||
|
||||
if gpu_utilization:
|
||||
timestamp_0 = timestamps[0]
|
||||
metrics = GPURawMetrics(
|
||||
utilization=gpu_utilization,
|
||||
memory_used=gpu_memory_used,
|
||||
timestamps=[t - timestamp_0 for t in timestamps],
|
||||
timestamp_0=timestamp_0,
|
||||
monitoring_status=GPUMonitoringStatus.SUCCESS,
|
||||
)
|
||||
self.logger.debug(f"GPU monitoring completed: {len(gpu_utilization)} samples collected")
|
||||
elif process_failed:
|
||||
metrics = GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.FAILED,
|
||||
)
|
||||
self.logger.warning("GPU monitoring failed (process crashed or timed out)")
|
||||
else:
|
||||
metrics = GPURawMetrics(
|
||||
utilization=[],
|
||||
memory_used=[],
|
||||
timestamps=[],
|
||||
timestamp_0=0.0,
|
||||
monitoring_status=GPUMonitoringStatus.NO_SAMPLES_COLLECTED,
|
||||
)
|
||||
return metrics
|
||||
Reference in New Issue
Block a user