regex_manager

`regex_manager` ¶

JSON-schema-to-regex compiler for structured generation.

Converts a subset of JSON Schema into a regular expression that can be used by vLLM's structured-output backend to constrain model output to valid JSONL records.

Functions:

Name	Description
`build_json_based_regex`	Build a regex that constrains LLM output to valid JSONL records.

`build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None)` ¶

Build a regex that constrains LLM output to valid JSONL records.

Parameters:

Name	Type	Description	Default
`schema`	`dict`	JSON schema dictionary describing the record format.	required
`config`	`SafeSynthesizerParameters`	Pipeline configuration (used for grouping and structured-generation settings).	required
`bos_token`	`str`	Beginning-of-sequence token (used when grouping).	required
`eos_token`	`str`	End-of-sequence token (used when grouping).	required
`whitespace_pattern`	`str \| None`	Optional regex fragment for matching whitespace between JSON tokens.	`None`

Returns:

Type	Description
	Compiled regex string suitable for vLLM's structured-output
	backend.

Source code in src/nemo_safe_synthesizer/generation/regex_manager.py

def build_json_based_regex(
    schema: dict,
    config: SafeSynthesizerParameters,
    bos_token: str,
    eos_token: str,
    whitespace_pattern: str | None = None,
):
    """Build a regex that constrains LLM output to valid JSONL records.

    Args:
        schema: JSON schema dictionary describing the record format.
        config: Pipeline configuration (used for grouping and
            structured-generation settings).
        bos_token: Beginning-of-sequence token (used when grouping).
        eos_token: End-of-sequence token (used when grouping).
        whitespace_pattern: Optional regex fragment for matching
            whitespace between JSON tokens.

    Returns:
        Compiled regex string suitable for vLLM's structured-output
        backend.
    """
    whitespace_pattern = whitespace_pattern or ""

    record_regex = _build_regex(schema, whitespace_pattern)

    if config.data.group_training_examples_by is not None:
        sequence_regex = rf"{re.escape(bos_token)}({record_regex}\n)+{re.escape(eos_token)}"
    else:
        # Without grouping, the "sequence" is a single record.
        sequence_regex = record_regex

    if config.generation.structured_generation_use_single_sequence and config.data.max_sequences_per_example == 1:
        regex = sequence_regex
    else:
        regex = rf"({sequence_regex}\n)+"

    return regex

regex_manager

regex_manager ¶

build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None) ¶

`regex_manager` ¶

`build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None)` ¶