Skip to content

metadata

metadata

Metadata-stage checks that require a loaded tokenizer / model metadata.

Classes:

Name Description
TokenBudgetCheck

Verify that records and groups fit within the model's context window.

TokenBudgetCheck

Bases: MetadataCheck

Verify that records and groups fit within the model's context window.

This check is a heuristic approximation of what the training assembler will see, not an exact simulation. Known sources of drift:

  • Sampling: only the first token_sample_size rows (default 5000) and the largest top_groups_to_check groups (default 100) are tokenized; a long-tail outlier outside the sample can still fail at assembly time.
  • Top-by-records bias: groups are ranked by row count, but token budget is driven by serialized text length -- a group with fewer rows but very wide columns could exceed the budget without being flagged.
  • PII-replacement drift: on --validate the data has not been PII-replaced, so token counts reflect the raw input rather than the replaced text the assembler actually sees. Replacement tokens can be shorter or longer than the originals.

Treat the output as a strong signal, not a guarantee; a clean result means the sampled rows and top groups fit, not that every row will.