Skip to content

regex_manager

regex_manager

JSON-schema-to-regex compiler for structured generation.

Converts a subset of JSON Schema into a regular expression that can be used by vLLM's structured-output backend to constrain model output to valid JSONL records.

Functions:

Name Description
build_json_based_regex

Build a regex that constrains LLM output to valid JSONL records.

build_json_structural_tag

Build an XGrammar Structural Tag for schema-constrained JSONL records.

build_json_based_regex(schema, config, bos_token, eos_token, whitespace_pattern=None)

Build a regex that constrains LLM output to valid JSONL records.

Supports properties, required, enum, primitive type values, arrays/objects with min/max item or property counts, string length bounds, pattern, and format values for date-time, date, time, and UUID. Use vLLM's native JSON schema structured-output path for unsupported schema features such as additionalProperties, composition keywords, and $ref.

Parameters:

Name Type Description Default
schema dict[str, Any]

JSON schema dictionary describing the record format.

required
config SafeSynthesizerParameters

Pipeline configuration (used for grouping and structured-generation settings).

required
bos_token str

Beginning-of-sequence token (used when grouping).

required
eos_token str

End-of-sequence token (used when grouping).

required
whitespace_pattern str | None

Optional regex fragment for matching whitespace between JSON tokens.

None

Returns:

Type Description
str

Compiled regex string suitable for vLLM's structured-output

str

backend.

Source code in src/nemo_safe_synthesizer/generation/regex_manager.py
def build_json_based_regex(
    schema: dict[str, Any],
    config: SafeSynthesizerParameters,
    bos_token: str,
    eos_token: str,
    whitespace_pattern: str | None = None,
) -> str:
    """Build a regex that constrains LLM output to valid JSONL records.

    Supports ``properties``, ``required``, ``enum``, primitive ``type`` values,
    arrays/objects with min/max item or property counts, string length bounds,
    ``pattern``, and ``format`` values for date-time, date, time, and UUID.
    Use vLLM's native JSON schema structured-output path for unsupported schema
    features such as ``additionalProperties``, composition keywords, and
    ``$ref``.

    Args:
        schema: JSON schema dictionary describing the record format.
        config: Pipeline configuration (used for grouping and
            structured-generation settings).
        bos_token: Beginning-of-sequence token (used when grouping).
        eos_token: End-of-sequence token (used when grouping).
        whitespace_pattern: Optional regex fragment for matching
            whitespace between JSON tokens.

    Returns:
        Compiled regex string suitable for vLLM's structured-output
        backend.
    """
    whitespace_pattern = whitespace_pattern or ""

    record_regex = _build_regex(schema, whitespace_pattern)

    if config.data.group_training_examples_by is not None:
        sequence_regex = rf"{re.escape(bos_token)}({record_regex}\n)+{re.escape(eos_token)}"
    else:
        # Without grouping, the "sequence" is a single record.
        sequence_regex = record_regex

    if config.generation.structured_generation.use_single_sequence and config.data.max_sequences_per_example == 1:
        regex = sequence_regex
    else:
        regex = rf"({sequence_regex}\n)+"

    return regex

build_json_structural_tag(schema, config, bos_token, eos_token)

Build an XGrammar Structural Tag for schema-constrained JSONL records.

The raw vLLM json constraint describes a single JSON value. Structural Tag lets NSS describe the larger generation shape directly: one or more schema-constrained JSON records separated by newlines, optionally wrapped in BOS/EOS group delimiters.

Parameters:

Name Type Description Default
schema dict[str, Any]

JSON schema dictionary describing one record.

required
config SafeSynthesizerParameters

Pipeline configuration (used for grouping and structured-generation settings).

required
bos_token str

Beginning-of-sequence token (used when grouping).

required
eos_token str

End-of-sequence token (used when grouping).

required

Returns:

Type Description
str

JSON string suitable for StructuredOutputsParams(structural_tag=...).

Source code in src/nemo_safe_synthesizer/generation/regex_manager.py
def build_json_structural_tag(
    schema: dict[str, Any],
    config: SafeSynthesizerParameters,
    bos_token: str,
    eos_token: str,
) -> str:
    """Build an XGrammar Structural Tag for schema-constrained JSONL records.

    The raw vLLM ``json`` constraint describes a single JSON value. Structural
    Tag lets NSS describe the larger generation shape directly: one or more
    schema-constrained JSON records separated by newlines, optionally wrapped
    in BOS/EOS group delimiters.

    Args:
        schema: JSON schema dictionary describing one record.
        config: Pipeline configuration (used for grouping and
            structured-generation settings).
        bos_token: Beginning-of-sequence token (used when grouping).
        eos_token: End-of-sequence token (used when grouping).

    Returns:
        JSON string suitable for ``StructuredOutputsParams(structural_tag=...)``.
    """
    record_format: dict[str, Any] = {
        "type": "json_schema",
        "json_schema": schema,
    }
    record_line_format = _sequence_format([record_format, _const_string_format("\n")])

    if config.data.group_training_examples_by is not None:
        sequence_format = _sequence_format(
            [
                _const_string_format(bos_token),
                _plus_format(record_line_format),
                _const_string_format(eos_token),
            ]
        )
    else:
        sequence_format = record_format

    if config.generation.structured_generation.use_single_sequence and config.data.max_sequences_per_example == 1:
        output_format = sequence_format
    elif config.data.group_training_examples_by is not None:
        output_format = _plus_format(_sequence_format([sequence_format, _const_string_format("\n")]))
    else:
        output_format = _plus_format(record_line_format)

    return json.dumps(
        {
            "type": "structural_tag",
            "format": output_format,
        },
        ensure_ascii=True,
    )