Tutorials

pydantic is a package providing model/configuration classes and allows for parameter validation when instantiating the object. exca package builds on top of it, and provides “infra” pydantic configuration that can be part of a parent pydantic configuration and change the way it behaves. In particular, it lets one add caching and remote computation to its methods. Check-out the package philosophy for more in depths explanation of the “whys” of this package.

If you are not familiar with pydantic, have a look first at the Pydantic models section.

Installation

pip install exca

Two types of infra: Task and Map

Infras currently come in 2 flavors.

TaskInfra

Consider you have one pydantic model/config that fully defines one processing to perform, for instance through a process method like below:

import numpy as np
import typing as tp
import pydantic

class TutorialTask(pydantic.BaseModel):
    param: int = 12

    def process(self) -> float:
        return self.param * np.random.rand()

Adding an infra on the process model only requires adding an TaskInfra object to the config:

import typing as tp
import torch
import exca


class TutorialTask(pydantic.BaseModel):
    param: int = 12
    infra: exca.TaskInfra = exca.TaskInfra(version="1")

    @infra.apply
    def process(self) -> float:
        return self.param * np.random.rand()

TaskInfra provides configuration for caching and computation, in particular providing a folder activates caching through the filesystem:

task = TutorialTask(param=1, infra={"folder": tmp_path})
out = task.process()
# calling process again will load the cache and not a new random number
assert out == task.process()

Adding cluster="auto" to the infra would trigger computation either on slurm cluster if available, or in a dedicated process otherwise. See the API reference for all the details

Map infra

The TaskInfra above is limited to methods that do not take additional arguments / computations that are fully defined by the configuration such as an experiment/a training for instance. Consider now that the configuration defines a computation to be applied to a list of items (eg: process a list of images / texts etc), this is the use case for the MapInfra:

import typing as tp
import pydantic
import numpy as np
from exca import MapInfra

class TutorialMap(pydantic.BaseModel):
    param: int = 12
    infra: MapInfra = MapInfra(version="1")

    @infra.apply(item_uid=str)  
    def process(self, items: tp.Iterable[int]) -> tp.Iterator[np.ndarray]:
        for item in items:
            yield np.random.rand(item, self.param)

As opposed to the TaskInfra, the MapInfra.apply method now requires an item_uid parameter that states how to map each item of the input iterable into a unique string which will be used for identification/caching.

From then, calling whatever.process([1, 2, 3]) will trigger (possibly) remote computation and caching/storage. You can control the remote resources through the infra instance. Eg: the following will trigger the computation in the current process (change "cluster": None to auto to have it run on slurm cluster if available or in a dedicated process)

mapper = TutorialMap(infra={"cluster": None, "folder": tmp_path, "cpus_per_task": 1})
mapper.process([1, 2, 3])

See the API reference for all the details

Features of MapInfra and TaskInfra

This section provides an overview of parameters and features of infra, but the full API reference page will provide mode options and details if need be.

Common useful parameters include:

  • folder: where to create the cache folder

  • mode: one of:

    • cached: cache is returned if available (error or not), otherwise computed (and cached). This is the default behavior.

    • force: cache is ignored, and result is (re)computed (and cached)

    • retry (only for TaskInfra): cache is returned if available except if it’s an error, otherwise (re)computed (and cached)

  • submitit/slurm parameters (eg: gpus_per_node, cpus_per_node, slurm_partition, slurm_constraint etc)

All infra object have common features such as:

  • config export: through task.infra.config(uid=False, exclude_defaults=True).

  • uid/xp folder: through task.infra.uid_folder(). The folder is always populated with the full config, and the reduced uid config. It also contains a symlink to the job folder.

When filesystem caching is used, the folder will contain useful information:

  • config.yaml: the full configuration (all parameters) of the pydantic model

  • full-uid.yaml: the config defining the task/map, including defaults (not including non-uid related configs such as number of workers etc)

  • uid.yaml: the minimal config defining the task/map (not including defaults, nor non-uid related configs such as number of workers etc).

It will also optionally contain:

  • code (if workdir is specified): a symlink to the directory where the task was executed

  • submitit (for TaskInfra if cluster is not None): a symlink to the folder containing all submitit related files for the task (stdout, stderr, batch file etc)

TaskInfra also has additional features, in particular:

  • job access: through task.infra.job(). Jobs submitted through submitit have stdout(), stderr() and cancel(), and more. All jobs have methods result(), done(), wait(). Calling infra.job() submits the job if it does not already exists.

  • cache/job clearing: through task.infra.clear_job()

Quick comparison

feature \ tool

lru_cache

hydra

submitit

stool

exca

RAM cache

file cache

remote compute

pure python (vs commandline)

hierarchical config

Simplified infra decorator

For quick experimentation with infra, the infra.helpers.with_infra function decorator can add an infra parameter on most functions (with simple arguments).

import numpy as np
import exca

@exca.helpers.with_infra(folder=tmp_path)
def my_func(a: int, b: int) -> np.ndarray:
    return np.random.rand(a, b)

out = my_func(a=3, b=4)
out2 = my_func(a=3, b=4)

np.testing.assert_array_equal(out2, out)  # should the same (as cached)

On the long run this is not adviced as this will prevent you from using many features of infra (running an array of jobs, checking their status etc)

Pydantic models

This is a quick recap of important features of pydantic models. Models do not have an __init__ method, parameters are instead specified directly in the class (as if they were class attributes, but they will not be):

import pydantic

class MyModel(pydantic.BaseModel):
    x: int
    y: str = "blublu"

mymodel = MyModel(x=12)
assert mymodel.x == 12

One can then instantiate it easily with mymodel = MyModel(x=12) and access attributes like mymodel.x. One important feature is the typechecking when instantiating the objects, as x is typed as an int, the field will not accept a string, and the following code would raise an exception: mymodel = MyModel(x="wrong").

Note: pydantic is very similar to the more standard dataclasses with a few important features: models are type checked (dataclasses are not), one can set mutable default values like [] without risks (with dataclasses this can be buggy or require a factory), and one can use discriminators for sub-configs (more on that here).

For more safety, one should set extra="forbid" for models as this will trigger an error as well if you instantiate an object with parameters that do not exist in the model:

import pydantic

class MyModel(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # safer
    x: int
    y: str = "blublu"

# MyModel(x=12, wrong_parameter=12)  # will not work anymore

Note: adding a default infra automatically sets extra="forbid" as a default in the pydantic class model_config, as it is much safer to avoid silent errors.

Hierarchical config

One important aspects of models is that they can be composed as one model/config can contain another config. Instantiating such models is simple as the subparameters can be specified as dictionary and `pydantic will take care of transforming them into the correct class:

class Parent(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # safer
    data: MyModel

obj = Parent(data={"x": 12})
assert obj.data.x == 12

This makes it easy to specify configs as yaml and load them into a model, eg:

import yaml

string = """
data:
  x: 12
  y: whatever
"""

dictconfig = yaml.safe_load(string)
obj = Parent(**dictconfig)