Evaluation¶

xaytune provides two evaluation paths: custom dataset evaluation with evaluate() and benchmark evaluation with benchmark_evaluate().

evaluate()¶

Evaluate a model on a custom dataset with configurable metrics.

from xaytune.eval import evaluate

results = evaluate(
    model="output/my-finetune",
    dataset=[{"input_ids": ..., "labels": ...}],
    metrics=["loss", "perplexity"],
)

print(results)
# {'loss': 1.234, 'perplexity': 3.435}

Function Signature¶

def evaluate(
    *,
    model: Any,
    dataset: list[dict[str, Any]],
    metrics: list[str] | None = None,
) -> dict[str, float]:

Parameter	Type	Default	Description
`model`	model object or `str`	required	A model instance or path to load from
`dataset`	`list[dict]`	required	List of data batches to evaluate on
`metrics`	`list[str]` \| `None`	`["loss", "perplexity"]`	Metric names to compute (must be in `metric_registry`)

Returns: dict[str, float] mapping metric names to their computed values.

Note

When model is a string path, xaytune automatically loads the model and tokenizer using xaytune.models.load_model().

benchmark_evaluate()¶

Run standard benchmarks using lm-eval.

from xaytune.eval.benchmarks import benchmark_evaluate

results = benchmark_evaluate(
    model="meta-llama/Llama-3.1-8B",
    benchmarks=["mmlu", "gsm8k", "hellaswag"],
    num_fewshot=5,
)

for task, metrics in results.items():
    print(f"{task}: {metrics}")

Function Signature¶

def benchmark_evaluate(
    *,
    model: str,
    benchmarks: list[str],
    num_fewshot: int | None = None,
) -> dict[str, dict[str, Any]]:

Parameter	Type	Default	Description
`model`	`str`	required	Model path or Hugging Face Hub name
`benchmarks`	`list[str]`	required	List of benchmark task names
`num_fewshot`	`int` \| `None`	`None`	Number of few-shot examples (benchmark default if `None`)

Returns: Nested dict {task_name: {metric_name: value}}.

Requires lm-eval

Install the eval extra to use benchmarks:

pip install xaytune[eval]

Built-in Metrics¶

xaytune ships three metrics, registered in xaytune.eval.metrics.metric_registry:

Metric	Function	Description
`loss`	`compute_loss`	Average cross-entropy loss
`perplexity`	`compute_perplexity`	Exponentiated average loss: exp(mean_loss)
`token_accuracy`	`compute_token_accuracy`	Fraction of correctly predicted tokens

Custom Metrics¶

Register your own metrics with the @register_metric decorator:

from xaytune.eval.metrics import register_metric

@register_metric("bleu")
def compute_bleu(predictions, references, **kwargs):
    # Your BLEU implementation here
    ...
    return score

Once registered, custom metrics can be used anywhere metrics are accepted:

results = evaluate(model=model, dataset=data, metrics=["loss", "bleu"])

Or in YAML config:

eval:
  metrics: [loss, perplexity, bleu]

CLI Usage¶

Benchmark Evaluation¶

xaytune eval --model output/my-finetune --benchmarks mmlu,gsm8k --num-fewshot 5

Dataset Evaluation¶

xaytune eval --model output/my-finetune --dataset data/eval.jsonl --metrics loss,perplexity

Model Comparison¶

Compare two models side-by-side on the same benchmarks:

xaytune compare model-a model-b --benchmarks mmlu,gsm8k

This prints a table showing each model's score on every benchmark metric.

Agent Evaluation¶

Evaluate agent performance on tool-use tasks with evaluate_agent(). Scores a dataset of prompt-response pairs across four metrics.

from xaytune.eval.agent_metrics import evaluate_agent

results = evaluate_agent(
    responses=[
        {"prompt": "Search for cats", "response": "<tool_call>..."},
    ],
    expected_tools=["search"],
    success_markers=["Done"],
    max_steps=5,
)
# {'tool_use_accuracy': 0.85, 'task_success_rate': 0.90, ...}