metadata
metadata
¶
Metadata-stage checks that require a loaded tokenizer / model metadata.
Classes:
| Name | Description |
|---|---|
TokenBudgetCheck |
Verify that records and groups fit within the model's context window. |
TokenBudgetCheck
¶
Bases: MetadataCheck
Verify that records and groups fit within the model's context window.
This check is a heuristic approximation of what the training assembler will see, not an exact simulation. Known sources of drift:
- Sampling: only the first
token_sample_sizerows (default 5000) and the largesttop_groups_to_checkgroups (default 100) are tokenized; a long-tail outlier outside the sample can still fail at assembly time. - Top-by-records bias: groups are ranked by row count, but token budget is driven by serialized text length -- a group with fewer rows but very wide columns could exceed the budget without being flagged.
- PII-replacement drift: on
--validatethe data has not been PII-replaced, so token counts reflect the raw input rather than the replaced text the assembler actually sees. Replacement tokens can be shorter or longer than the originals.
Treat the output as a strong signal, not a guarantee; a clean result means the sampled rows and top groups fit, not that every row will.