regex_manager

`regex_manager` ¶

JSON-schema-to-regex compiler for structured generation.

Converts a subset of JSON Schema into a regular expression that can be used by vLLM's structured-output backend to constrain model output to valid JSONL records.

Functions:

Name	Description
`build_json_based_regex`	Build a regex that constrains LLM output to valid JSONL records.
`build_json_structural_tag`	Build an XGrammar Structural Tag for schema-constrained JSONL records.

`build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None)` ¶

Build a regex that constrains LLM output to valid JSONL records.

Supports properties, required, enum, primitive type values, arrays/objects with min/max item or property counts, string length bounds, pattern, and format values for date-time, date, time, and UUID. Use vLLM's native JSON schema structured-output path for unsupported schema features such as additionalProperties, composition keywords, and $ref.

Parameters:

Name	Type	Description	Default
`schema`	`dict[str, Any]`	JSON schema dictionary describing the record format.	required
`config`	`SafeSynthesizerParameters`	Pipeline configuration (used for grouping and structured-generation settings).	required
`bos_token`	`str`	Beginning-of-sequence token (used when grouping).	required
`eos_token`	`str`	End-of-sequence token (used when grouping).	required
`whitespace_pattern`	`str \| None`	Optional regex fragment for matching whitespace between JSON tokens.	`None`

Returns:

Type	Description
`str`	Compiled regex string suitable for vLLM's structured-output
`str`	backend.

Source code in src/nemo_safe_synthesizer/generation/regex_manager.py

def build_json_based_regex(
    schema: dict[str, Any],
    config: SafeSynthesizerParameters,
    bos_token: str,
    eos_token: str,
    whitespace_pattern: str | None = None,
) -> str:
    """Build a regex that constrains LLM output to valid JSONL records.

    Supports ``properties``, ``required``, ``enum``, primitive ``type`` values,
    arrays/objects with min/max item or property counts, string length bounds,
    ``pattern``, and ``format`` values for date-time, date, time, and UUID.
    Use vLLM's native JSON schema structured-output path for unsupported schema
    features such as ``additionalProperties``, composition keywords, and
    ``$ref``.

    Args:
        schema: JSON schema dictionary describing the record format.
        config: Pipeline configuration (used for grouping and
            structured-generation settings).
        bos_token: Beginning-of-sequence token (used when grouping).
        eos_token: End-of-sequence token (used when grouping).
        whitespace_pattern: Optional regex fragment for matching
            whitespace between JSON tokens.

    Returns:
        Compiled regex string suitable for vLLM's structured-output
        backend.
    """
    whitespace_pattern = whitespace_pattern or ""

    record_regex = _build_regex(schema, whitespace_pattern)

    if config.data.group_training_examples_by is not None:
        sequence_regex = rf"{re.escape(bos_token)}({record_regex}\n)+{re.escape(eos_token)}"
    else:
        # Without grouping, the "sequence" is a single record.
        sequence_regex = record_regex

    if config.generation.structured_generation.use_single_sequence and config.data.max_sequences_per_example == 1:
        regex = sequence_regex
    else:
        regex = rf"({sequence_regex}\n)+"

    return regex

`build_json_structural_tag(schema, config, bos_token, eos_token)` ¶

Build an XGrammar Structural Tag for schema-constrained JSONL records.

The raw vLLM json constraint describes a single JSON value. Structural Tag lets NSS describe the larger generation shape directly: one or more schema-constrained JSON records separated by newlines, optionally wrapped in BOS/EOS group delimiters.