Skip to content

regex_manager

regex_manager

JSON-schema-to-regex compiler for structured generation.

Converts a subset of JSON Schema into a regular expression that can be used by vLLM's structured-output backend to constrain model output to valid JSONL records.

Functions:

Name Description
build_json_based_regex

Build a regex that constrains LLM output to valid JSONL records.

build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None)

Build a regex that constrains LLM output to valid JSONL records.

Parameters:

Name Type Description Default
schema dict

JSON schema dictionary describing the record format.

required
config SafeSynthesizerParameters

Pipeline configuration (used for grouping and structured-generation settings).

required
bos_token str

Beginning-of-sequence token (used when grouping).

required
eos_token str

End-of-sequence token (used when grouping).

required
whitespace_pattern str | None

Optional regex fragment for matching whitespace between JSON tokens.

None

Returns:

Type Description

Compiled regex string suitable for vLLM's structured-output

backend.

Source code in src/nemo_safe_synthesizer/generation/regex_manager.py
def build_json_based_regex(
    schema: dict,
    config: SafeSynthesizerParameters,
    bos_token: str,
    eos_token: str,
    whitespace_pattern: str | None = None,
):
    """Build a regex that constrains LLM output to valid JSONL records.

    Args:
        schema: JSON schema dictionary describing the record format.
        config: Pipeline configuration (used for grouping and
            structured-generation settings).
        bos_token: Beginning-of-sequence token (used when grouping).
        eos_token: End-of-sequence token (used when grouping).
        whitespace_pattern: Optional regex fragment for matching
            whitespace between JSON tokens.

    Returns:
        Compiled regex string suitable for vLLM's structured-output
        backend.
    """
    whitespace_pattern = whitespace_pattern or ""

    record_regex = _build_regex(schema, whitespace_pattern)

    if config.data.group_training_examples_by is not None:
        sequence_regex = rf"{re.escape(bos_token)}({record_regex}\n)+{re.escape(eos_token)}"
    else:
        # Without grouping, the "sequence" is a single record.
        sequence_regex = record_regex

    if config.generation.structured_generation_use_single_sequence and config.data.max_sequences_per_example == 1:
        regex = sequence_regex
    else:
        regex = rf"({sequence_regex}\n)+"

    return regex