Skip to content

models

models

Classes:

Name Description
Visibility

Visibility identifiers indicating what access is opened for each model

ObjectRef

Maps individual model files to keys in memory

ModelManifest

Manages remote model files and versions. These configurations are model specific.

StorageConfig

Defines where and how ModelManifests are stored. These configurations are likely

CacheManager

Handles downloading model files from the "opt" package repo.

Functions:

Name Description
get_cache_manager

Returns a singleton instance of CacheManager.

Attributes:

Name Type Description
DEFAULT_BUCKET

Default bucket environment variable. If it's not found, use dev.

DEFAULT_CACHE_DIR

StorageConfig default cache directory. Searches the NSS_OPT_CACHE_DIR environment

DEFAULT_BUCKET = os.getenv('NSS_OPT_BUCKET', 'nss-opt-dev-use2') module-attribute

Default bucket environment variable. If it's not found, use dev.

DEFAULT_CACHE_DIR = os.getenv('NSS_OPT_CACHE_DIR', '.optcache') module-attribute

StorageConfig default cache directory. Searches the NSS_OPT_CACHE_DIR environment for a cache directory. By default it will fallback to .optcache. If this is set to disabled files wont be cached to disk.

Visibility

Bases: Enum

Visibility identifiers indicating what access is opened for each model

Attributes:

Name Type Description
PUBLIC

Packages that are open to the public with a public-read ACL

PRIVATE

Packages available to customers via "paywall" behind an api key

INTERNAL

Only available from internal infrastructure.

PUBLIC = 'pub' class-attribute instance-attribute

Packages that are open to the public with a public-read ACL

PRIVATE = 'priv' class-attribute instance-attribute

Packages available to customers via "paywall" behind an api key

INTERNAL = 'int' class-attribute instance-attribute

Only available from internal infrastructure.

ObjectRef(key, file_name) dataclass

Maps individual model files to keys in memory

Attributes:

Name Type Description
key str

Lookup key to access model data

file_name str

Remote file name. Used to download model data from storage

key instance-attribute

Lookup key to access model data

file_name instance-attribute

Remote file name. Used to download model data from storage

ModelManifest(model, version, sources, visibility) dataclass

Manages remote model files and versions. These configurations are model specific.

Models are by default stored in opt, under the convention

/[vis]/models/[pkg]/[version]/[...files]

Attributes:

Name Type Description
model str

Model identifer. Eg spacy, fasttext, entityruler

version str

Model version

sources list[ObjectRef]

A list of pickled objects to load into memory.

visibility Visibility

Package visibility

key str

A unique key that can be used to store or fetch model sources.

model instance-attribute

Model identifer. Eg spacy, fasttext, entityruler

version instance-attribute

Model version

sources instance-attribute

A list of pickled objects to load into memory.

visibility instance-attribute

Package visibility

key property

A unique key that can be used to store or fetch model sources.

StorageConfig(bucket=DEFAULT_BUCKET, cache_dir=None) dataclass

Defines where and how ModelManifests are stored. These configurations are likely environment specific.

Methods:

Name Description
from_system

Return a default StorageConfig based on a system's environment variables.

Attributes:

Name Type Description
bucket Optional[str]

Remote "opt" bucket model files are stored under

cache_dir Optional[Path]

Local file system directory. If this value is None, ModelManifest files

bucket = DEFAULT_BUCKET class-attribute instance-attribute

Remote "opt" bucket model files are stored under

cache_dir = None class-attribute instance-attribute

Local file system directory. If this value is None, ModelManifest files wont be cached through to disk.

from_system() classmethod

Return a default StorageConfig based on a system's environment variables.

By convention, it looks for the environment variable NSS_OPT_BUCKET as the bucket location. The default settings from this function are appropriate for development without additional configuration.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
@classmethod
def from_system(cls) -> StorageConfig:
    """Return a default ``StorageConfig`` based on a system's environment variables.

    By convention, it looks for the environment variable ``NSS_OPT_BUCKET`` as
    the bucket location. The default settings from this function are appropriate
    for development without additional configuration.
    """
    if DEFAULT_CACHE_DIR == "disabled":
        cache_dir = None
    else:
        cache_dir = Path(DEFAULT_CACHE_DIR)

    return cls(bucket=DEFAULT_BUCKET, cache_dir=cache_dir)

CacheManager(storage_config=None)

Handles downloading model files from the "opt" package repo.

This class will also optionally cache these files to disk. This is useful for environments with local persistent state such as a local development laptop.

Parameters:

Name Type Description Default
storage_config StorageConfig

A storage config.

None

Methods:

Name Description
get_instance

Returns a singleton instance of CacheManager.

register_manifest

Registers a manifest in the cache manager. This will not download the file

set_storage_config

Apply a new StorageConfig to the CacheManager.

download_and_cache_manifest_data

Load each registered manifest into memory

resolve

Given a manifest, will return it's resolved data. If the manifest hasn't

obj_from_fs

Return the source object from the filesystem if it exists. If no file is found

Attributes:

Name Type Description
timings dict[str, float]

Holds timings for each model manifest. Keyed by ModelManifest.model

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def __init__(self, storage_config: StorageConfig = None):
    if CacheManager.__instance:
        raise Exception("Cannot instantiate a singleton.")
    else:
        CacheManager.__instance = self

    self.storage_config = storage_config or StorageConfig.from_system()
    logger.info(f"Creating a new instance of CacheManager for {storage_config}")

    self._cache = {}
    self._manifests = {}
    self.timings = {}

timings = {} instance-attribute

Holds timings for each model manifest. Keyed by ModelManifest.model

get_instance(storage_config=None) classmethod

Returns a singleton instance of CacheManager.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
@classmethod
def get_instance(cls, storage_config: StorageConfig = None) -> CacheManager:
    """Returns a singleton instance of ``CacheManager``."""
    if not CacheManager.__instance:
        CacheManager(storage_config)
    return CacheManager.__instance

register_manifest(manifest)

Registers a manifest in the cache manager. This will not download the file

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def register_manifest(self, manifest: ModelManifest):
    """Registers a manifest in the cache manager. This will not download the file"""
    if manifest.key not in self._manifests:
        logger.info(f"Registering ModelManifest in cache: {manifest}")
        self._manifests[manifest.key] = manifest

set_storage_config(storage_config)

Apply a new StorageConfig to the CacheManager.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def set_storage_config(self, storage_config: StorageConfig):
    """Apply a new ``StorageConfig`` to the ``CacheManager``."""
    self.storage_config = storage_config

download_and_cache_manifest_data()

Load each registered manifest into memory

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def download_and_cache_manifest_data(self):
    """Load each registered manifest into memory"""
    for manifest in self._manifests.values():
        self.resolve(manifest)

resolve(manifest, evict=False, skip_pickle=False)

Given a manifest, will return it's resolved data. If the manifest hasn't already been registered with the manager, it will be registered automatically.

The load order is as follows:

1. From in-memory cache
2. From FS cache if enabled via a ``StorageConfig``

Parameters:

Name Type Description Default
manifest ModelManifest

The manifest file to resolve and return

required
evict bool

If True will return the object, but won't store it in the cache. If the manifest is already in the cache, it will be removed.

False
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def resolve(self, manifest: ModelManifest, evict: bool = False, skip_pickle: bool = False) -> Optional[dict]:
    """Given a manifest, will return it's resolved data. If the manifest hasn't
    already been registered with the manager, it will be registered automatically.

    The load order is as follows:

        1. From in-memory cache
        2. From FS cache if enabled via a ``StorageConfig``

    Args:
        manifest: The manifest file to resolve and return
        evict: If ``True`` will return the object, but won't store it in the cache.
            If the manifest is already in the cache, it will be removed.
    """
    if manifest.key in self._cache:
        return self._cache.pop(manifest.key) if evict else self._cache[manifest.key]

    self.register_manifest(manifest)

    objs = {}
    for obj_ref in manifest.sources:
        src_obj = None
        start_time = time.perf_counter()
        for step in self.obj_from_fs:
            src_obj = step(manifest, obj_ref, skip_pickle=skip_pickle)
            if src_obj:
                break
        if not src_obj:
            raise RuntimeError(f"Could note resolve manifest {manifest}. Failed to load {src_obj}")
        else:
            elapsed_time_seconds = time.perf_counter() - start_time
            objs[obj_ref.key] = src_obj
            self.timings[manifest.model] = elapsed_time_seconds

    if not evict:
        self._cache[manifest.key] = objs
    return objs

obj_from_fs(manifest, obj_ref, skip_pickle=False)

Return the source object from the filesystem if it exists. If no file is found return None.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def obj_from_fs(self, manifest: ModelManifest, obj_ref: ObjectRef, skip_pickle: bool = False) -> Optional[Any]:
    """Return the source object from the filesystem if it exists. If no file is found
    return ``None``.
    """
    logger.debug(f"Checking local FS for manifest {manifest.model} for {obj_ref.key} at {obj_ref.file_name}...")
    if not self.storage_config.cache_dir:
        logger.debug("Manifest data not found on local FS!")
        return None

    file_path = self.storage_config.cache_dir / manifest.key / obj_ref.file_name
    if not file_path.is_file():
        return None

    with open(file_path, "rb") as cache:
        if skip_pickle:
            return cache.read()
        src_obj = pickle.load(cache)
        return src_obj

get_cache_manager(storage_config=None)

Returns a singleton instance of CacheManager.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/models.py
def get_cache_manager(storage_config: StorageConfig = None) -> CacheManager:
    """Returns a singleton instance of ``CacheManager``."""
    return CacheManager.get_instance(storage_config)