Data Pipeline¶
xaytune's data pipeline handles loading, formatting, tokenizing, packing, and validating training data. The typical flow is:
load_dataset → format (automatic) → tokenize_dataset → pack_sequences (optional) → DataLoader(collate_fn)
For preference/alignment data, use the preference-specific functions instead:
Loading¶
load_dataset(path, *, format, source='local', streaming=False, eval_split=0.0, tokenizer=None, **kwargs)
¶
Load and format a dataset from a local JSONL file or HuggingFace Hub.
Each sample is run through the registered format function ("alpaca",
"sharegpt", "chat", "text", "preference"), converting
raw fields into a {"text": "..."} dict ready for tokenization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Local file path or HuggingFace dataset name. |
required |
format
|
str
|
Format name registered in the format registry. |
required |
source
|
str
|
|
'local'
|
streaming
|
bool
|
Stream from HuggingFace instead of downloading. |
False
|
eval_split
|
float
|
Fraction to hold out for evaluation (0 = no split). |
0.0
|
tokenizer
|
Any | None
|
Optional tokenizer for chat template application. |
None
|
Returns:
| Type | Description |
|---|---|
list[dict] | tuple[list[dict], list[dict]]
|
A list of formatted samples, or a |
list[dict] | tuple[list[dict], list[dict]]
|
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If source is |
Source code in xaytune/data/loader.py
Formats¶
Built-in format functions registered in format_registry:
format_alpaca(sample)
¶
Format an Alpaca-style sample (instruction/input/output) into {"text": ...}.
Source code in xaytune/data/formats.py
format_sharegpt(sample)
¶
Format a ShareGPT-style multi-turn conversation into {"text": ...}.
Source code in xaytune/data/formats.py
format_chat(sample)
¶
Format an OpenAI-style chat messages list into {"text": ...}.
Source code in xaytune/data/formats.py
format_text(sample)
¶
Pass through a raw text sample as {"text": ...}.
apply_chat_template(sample, tokenizer, *, format='chat')
¶
Apply the tokenizer's chat template to a conversation sample.
Source code in xaytune/data/formats.py
Tokenization¶
tokenize_dataset(data, tokenizer, max_seq_length=0)
¶
Tokenize formatted samples into input_ids/labels/attention_mask dicts.
If samples already contain "input_ids", returns them unchanged.
Empty texts and empty encodings are filtered out.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[dict[str, Any]]
|
Formatted samples, each with a |
required |
tokenizer
|
Any
|
A HuggingFace tokenizer. |
required |
max_seq_length
|
int
|
Maximum sequence length (0 = use tokenizer default). |
0
|
Returns:
| Type | Description |
|---|---|
list[dict[str, list[int]]]
|
List of dicts with |
list[dict[str, list[int]]]
|
(all |
Source code in xaytune/data/tokenizer.py
collate_tokenized(batch, pad_token_id=0)
¶
Collate tokenized samples into padded tensors for model input.
Pads all sequences to the longest in the batch. Labels are padded
with -100 (cross-entropy ignore index).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
list[dict[str, Any]]
|
List of tokenized dicts with |
required |
pad_token_id
|
int
|
Token id for input padding (masks use 0). |
0
|
Returns:
| Type | Description |
|---|---|
dict[str, Tensor]
|
Dict with |
Source code in xaytune/data/tokenizer.py
Preference Data¶
load_preference_dataset(path, *, eval_split=0.0)
¶
Load a preference JSONL file with prompt/chosen/rejected fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to a JSONL file where each line has |
required |
eval_split
|
float
|
Fraction to hold out for evaluation. |
0.0
|
Returns:
| Type | Description |
|---|---|
list[dict] | tuple[list[dict], list[dict]]
|
Formatted samples, or a |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If path doesn't exist. |
ValueError
|
If any row is missing required fields. |
Source code in xaytune/data/preferences.py
tokenize_preference_dataset(data, tokenizer, max_seq_length=0)
¶
Tokenize preference pairs into chosen/rejected input_ids and masks.
Concatenates prompt + chosen and prompt + rejected before
tokenizing. If samples already contain "chosen_input_ids", returns
them unchanged. Pairs with empty chosen or rejected text are skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
list[dict[str, Any]]
|
Preference samples with |
required |
tokenizer
|
Any
|
A HuggingFace tokenizer. |
required |
max_seq_length
|
int
|
Maximum sequence length (0 = use tokenizer default). |
0
|
Returns:
| Type | Description |
|---|---|
list[dict[str, list[int]]]
|
List of dicts with |
list[dict[str, list[int]]]
|
|
Source code in xaytune/data/tokenizer.py
collate_preference(batch, pad_token_id=0)
¶
Collate tokenized preference pairs into padded tensors.
Pads chosen and rejected sequences independently to their respective max lengths within the batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
list[dict[str, Any]]
|
List of tokenized preference dicts. |
required |
pad_token_id
|
int
|
Token id for input padding (masks use 0). |
0
|
Returns:
| Type | Description |
|---|---|
dict[str, Tensor]
|
Dict with |
dict[str, Tensor]
|
|
Source code in xaytune/data/tokenizer.py
Packing¶
pack_sequences(sequences, *, max_seq_length, pad_token_id)
¶
Pack multiple short sequences into fixed-length blocks to reduce padding.
Concatenates tokenized samples end-to-end and splits at
max_seq_length boundaries. Remaining space is padded with
pad_token_id (labels use -100).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[dict[str, list[int]]]
|
Tokenized samples, each with |
required |
max_seq_length
|
int
|
Target sequence length for packed blocks. |
required |
pad_token_id
|
int
|
Token id used for input padding. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, list[int]]]
|
Packed samples with |
Source code in xaytune/data/packing.py
Validation¶
validate_dataset_sample(dataloader, *, max_seq_length=0)
¶
Draw one batch from a dataloader and validate it.
Raises:
| Type | Description |
|---|---|
DataValidationError
|
If the dataset is empty or the batch has issues. |
Source code in xaytune/data/validation.py
validate_batch(batch, *, max_seq_length=0)
¶
Check a single batch dict for common data issues.
Returns a list of human-readable issue strings (empty = valid).
Source code in xaytune/data/validation.py
DataValidationError
¶
Bases: ValueError
Raised when a data batch fails validation checks.