metadata

`metadata` ¶

Metadata-stage checks that require a loaded tokenizer / model metadata.

Classes:

Name	Description
`TokenBudgetCheck`	Verify that records and groups fit within the model's context window.

Verify that records and groups fit within the model's context window.

This check is a heuristic approximation of what the training assembler will see, not an exact simulation. Known sources of drift:

Sampling: only the first token_sample_size rows (default 5000) and the largest top_groups_to_check groups (default 100) are tokenized; a long-tail outlier outside the sample can still fail at assembly time.
Top-by-records bias: groups are ranked by row count, but token budget is driven by serialized text length -- a group with fewer rows but very wide columns could exceed the budget without being flagged.
PII-replacement drift: on --validate the data has not been PII-replaced, so token counts reflect the raw input rather than the replaced text the assembler actually sees. Replacement tokens can be shorter or longer than the originals.

Treat the output as a strong signal, not a guarantee; a clean result means the sampled rows and top groups fit, not that every row will.