Evaluation¶
xaytune provides two evaluation paths: custom dataset evaluation with evaluate() and benchmark evaluation with benchmark_evaluate().
evaluate()¶
Evaluate a model on a custom dataset with configurable metrics.
from xaytune.eval import evaluate
results = evaluate(
model="output/my-finetune",
dataset=[{"input_ids": ..., "labels": ...}],
metrics=["loss", "perplexity"],
)
print(results)
# {'loss': 1.234, 'perplexity': 3.435}
Function Signature¶
def evaluate(
*,
model: Any,
dataset: list[dict[str, Any]],
metrics: list[str] | None = None,
) -> dict[str, float]:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
model object or str |
required | A model instance or path to load from |
dataset |
list[dict] |
required | List of data batches to evaluate on |
metrics |
list[str] | None |
["loss", "perplexity"] |
Metric names to compute (must be in metric_registry) |
Returns: dict[str, float] mapping metric names to their computed values.
Note
When model is a string path, xaytune automatically loads the model and tokenizer using xaytune.models.load_model().
benchmark_evaluate()¶
Run standard benchmarks using lm-eval.
from xaytune.eval.benchmarks import benchmark_evaluate
results = benchmark_evaluate(
model="meta-llama/Llama-3.1-8B",
benchmarks=["mmlu", "gsm8k", "hellaswag"],
num_fewshot=5,
)
for task, metrics in results.items():
print(f"{task}: {metrics}")
Function Signature¶
def benchmark_evaluate(
*,
model: str,
benchmarks: list[str],
num_fewshot: int | None = None,
) -> dict[str, dict[str, Any]]:
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
required | Model path or Hugging Face Hub name |
benchmarks |
list[str] |
required | List of benchmark task names |
num_fewshot |
int | None |
None |
Number of few-shot examples (benchmark default if None) |
Returns: Nested dict {task_name: {metric_name: value}}.
Built-in Metrics¶
xaytune ships three metrics, registered in xaytune.eval.metrics.metric_registry:
| Metric | Function | Description |
|---|---|---|
loss |
compute_loss |
Average cross-entropy loss |
perplexity |
compute_perplexity |
Exponentiated average loss: exp(mean_loss) |
token_accuracy |
compute_token_accuracy |
Fraction of correctly predicted tokens |
Custom Metrics¶
Register your own metrics with the @register_metric decorator:
from xaytune.eval.metrics import register_metric
@register_metric("bleu")
def compute_bleu(predictions, references, **kwargs):
# Your BLEU implementation here
...
return score
Once registered, custom metrics can be used anywhere metrics are accepted:
Or in YAML config:
CLI Usage¶
Benchmark Evaluation¶
Dataset Evaluation¶
Model Comparison¶
Compare two models side-by-side on the same benchmarks:
This prints a table showing each model's score on every benchmark metric.
Full API Reference¶
evaluate(*, model, dataset, metrics=None)
¶
Evaluate a model on a list of batches and compute metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Any
|
A model instance or HuggingFace model name string. |
required |
dataset
|
list[dict[str, Any]]
|
List of batch dicts (each passable to |
required |
metrics
|
list[str] | None
|
Metric names to compute (default: |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict mapping metric names to computed values. |
Source code in xaytune/eval/evaluate.py
benchmark_evaluate(*, model, benchmarks, num_fewshot=None)
¶
Run lm-eval-harness benchmarks against a HuggingFace model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
HuggingFace model name or local path. |
required |
benchmarks
|
list[str]
|
Benchmark task names (e.g. |
required |
num_fewshot
|
int | None
|
Number of few-shot examples. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, Any]]
|
Dict mapping benchmark names to their result dicts. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If |
Source code in xaytune/eval/benchmarks.py
compute_loss(losses, *args, **kwargs)
¶
compute_perplexity(losses, *args, **kwargs)
¶
Compute perplexity as exp(mean_loss).
Source code in xaytune/eval/metrics.py
compute_token_accuracy(predictions, references, *args, **kwargs)
¶
Compute fraction of tokens where prediction matches reference.