Data Pipeline¶

xaytune's data pipeline handles loading, formatting, tokenizing, packing, and validating training data. The typical flow is:

load_dataset → format (automatic) → tokenize_dataset → pack_sequences (optional) → DataLoader(collate_fn)

For preference/alignment data, use the preference-specific functions instead:

load_preference_dataset → tokenize_preference_dataset → DataLoader(collate_preference)

Loading¶

`load_dataset(path, *, format, source='local', streaming=False, eval_split=0.0, tokenizer=None, **kwargs)` ¶

Load and format a dataset from a local JSONL file or HuggingFace Hub.

Each sample is run through the registered format function ("alpaca", "sharegpt", "chat", "text", "preference"), converting raw fields into a {"text": "..."} dict ready for tokenization.

Parameters:

Name	Type	Description	Default
`path`	`str`	Local file path or HuggingFace dataset name.	required
`format`	`str`	Format name registered in the format registry.	required
`source`	`str`	`"local"` or `"huggingface"`.	`'local'`
`streaming`	`bool`	Stream from HuggingFace instead of downloading.	`False`
`eval_split`	`float`	Fraction to hold out for evaluation (0 = no split).	`0.0`
`tokenizer`	`Any \| None`	Optional tokenizer for chat template application.	`None`

Returns:

Type	Description
`list[dict] \| tuple[list[dict], list[dict]]`	A list of formatted samples, or a `(train, eval)` tuple when
`list[dict] \| tuple[list[dict], list[dict]]`	`eval_split > 0`.

Raises:

Type	Description
`FileNotFoundError`	If source is `"local"` and path doesn't exist.

Source code in xaytune/data/loader.py

def load_dataset(
    path: str,
    *,
    format: str,
    source: str = "local",
    streaming: bool = False,
    eval_split: float = 0.0,
    tokenizer: Any | None = None,
    **kwargs: Any,
) -> list[dict] | tuple[list[dict], list[dict]]:
    """Load and format a dataset from a local JSONL file or HuggingFace Hub.

    Each sample is run through the registered format function (``"alpaca"``,
    ``"sharegpt"``, ``"chat"``, ``"text"``, ``"preference"``), converting
    raw fields into a ``{"text": "..."}`` dict ready for tokenization.

    Args:
        path: Local file path or HuggingFace dataset name.
        format: Format name registered in the format registry.
        source: ``"local"`` or ``"huggingface"``.
        streaming: Stream from HuggingFace instead of downloading.
        eval_split: Fraction to hold out for evaluation (0 = no split).
        tokenizer: Optional tokenizer for chat template application.

    Returns:
        A list of formatted samples, or a ``(train, eval)`` tuple when
        ``eval_split > 0``.

    Raises:
        FileNotFoundError: If *source* is ``"local"`` and *path* doesn't exist.
    """
    if source == "huggingface":
        return _load_huggingface(  # type: ignore[no-any-return]
            path,
            format=format,
            streaming=streaming,
            eval_split=eval_split,
            tokenizer=tokenizer,
        )

    file_path = Path(path)
    if not file_path.exists():
        raise FileNotFoundError(f"Dataset not found: {path}")
    format_fn = _make_format_fn(format, tokenizer)
    raw_data = _load_jsonl(path)
    processed = [format_fn(sample) for sample in raw_data]
    if eval_split > 0:
        return _split_dataset(processed, eval_split)
    return processed