Tutorials
pydantic
is a package providing model/configuration classes and allows for parameter validation when instantiating the object. exca
package builds on top of it, and provides “infra” pydantic configuration that can be part of a parent pydantic configuration and change the way it behaves. In particular, it lets one add caching and remote computation to its methods. Check-out the package philosophy for more in depths explanation of the “whys” of this package.
If you are not familiar with pydantic
, have a look first at the Pydantic models section.
Installation
pip install exca
Two types of infra: Task and Map
Infras currently come in 2 flavors.
TaskInfra
Consider you have one pydantic model/config that fully defines one processing to perform, for instance through a process
method like below:
import numpy as np
import typing as tp
import pydantic
class TutorialTask(pydantic.BaseModel):
param: int = 12
def process(self) -> float:
return self.param * np.random.rand()
Adding an infra on the process
model only requires adding an TaskInfra
object to the config:
import typing as tp
import torch
import exca
class TutorialTask(pydantic.BaseModel):
param: int = 12
infra: exca.TaskInfra = exca.TaskInfra(version="1")
@infra.apply
def process(self) -> float:
return self.param * np.random.rand()
TaskInfra
provides configuration for caching and computation, in particular providing a folder
activates caching through the filesystem:
task = TutorialTask(param=1, infra={"folder": tmp_path})
out = task.process()
# calling process again will load the cache and not a new random number
assert out == task.process()
Adding cluster="auto"
to the infra
would trigger computation either on slurm cluster if available, or in a dedicated process otherwise. See the API reference for all the details
Map infra
The TaskInfra
above is limited to methods that do not take additional arguments / computations that are fully defined by the configuration such as an experiment/a training for instance. Consider now that the configuration defines a computation to be applied to a list of items (eg: process a list of images / texts etc), this is the use case for the MapInfra
:
import typing as tp
import pydantic
import numpy as np
from exca import MapInfra
class TutorialMap(pydantic.BaseModel):
param: int = 12
infra: MapInfra = MapInfra(version="1")
@infra.apply(item_uid=str)
def process(self, items: tp.Iterable[int]) -> tp.Iterator[np.ndarray]:
for item in items:
yield np.random.rand(item, self.param)
As opposed to the TaskInfra
, the MapInfra.apply
method now requires an item_uid
parameter that states how to map each item of the input iterable into a unique string which will be used for identification/caching.
From then, calling whatever.process([1, 2, 3])
will trigger (possibly) remote computation and caching/storage.
You can control the remote resources through the infra
instance.
Eg: the following will trigger the computation in the current process (change "cluster": None
to auto
to have it run on slurm
cluster if available or in a dedicated process)
mapper = TutorialMap(infra={"cluster": None, "folder": tmp_path, "cpus_per_task": 1})
mapper.process([1, 2, 3])
Features of MapInfra and TaskInfra
This section provides an overview of parameters and features of infra, but the full API reference page will provide mode options and details if need be.
Common useful parameters include:
folder
: where to create the cache foldermode
: one of:cached
: cache is returned if available (error or not), otherwise computed (and cached). This is the default behavior.force
: cache is ignored, and result is (re)computed (and cached)retry
(only forTaskInfra
): cache is returned if available except if it’s an error, otherwise (re)computed (and cached)
submitit/slurm parameters (eg:
gpus_per_node
,cpus_per_node
,slurm_partition
,slurm_constraint
etc)
All infra object have common features such as:
config export: through
task.infra.config(uid=False, exclude_defaults=True)
.uid/xp folder: through
task.infra.uid_folder()
. The folder is always populated with the full config, and the reduced uid config. It also contains a symlink to the job folder.
When filesystem caching is used, the folder will contain useful information:
config.yaml
: the full configuration (all parameters) of the pydantic modelfull-uid.yaml
: the config defining the task/map, including defaults (not including non-uid related configs such as number of workers etc)uid.yaml
: the minimal config defining the task/map (not including defaults, nor non-uid related configs such as number of workers etc).
It will also optionally contain:
code
(ifworkdir
is specified): a symlink to the directory where the task was executedsubmitit
(forTaskInfra
ifcluster
is notNone
): a symlink to the folder containing allsubmitit
related files for the task (stdout, stderr, batch file etc)
TaskInfra
also has additional features, in particular:
job access: through
task.infra.job()
. Jobs submitted throughsubmitit
havestdout()
,stderr()
andcancel()
, and more. All jobs have methodsresult()
,done()
,wait()
. Callinginfra.job()
submits the job if it does not already exists.cache/job clearing: through
task.infra.clear_job()
Quick comparison
feature \ tool |
lru_cache |
hydra |
submitit |
stool |
exca |
---|---|---|---|---|---|
RAM cache |
✔ |
✘ |
✘ |
✘ |
✔ |
file cache |
✘ |
✘ |
✘ |
✘ |
✔ |
remote compute |
✘ |
✔ |
✔ |
✔ |
✔ |
pure python (vs commandline) |
✔ |
✘ |
✔ |
✘ |
✔ |
hierarchical config |
✘ |
✔ |
✘ |
✘ |
✔ |
Simplified infra decorator
For quick experimentation with infra, the infra.helpers.with_infra
function decorator can add an infra parameter on most functions (with simple arguments).
import numpy as np
import exca
@exca.helpers.with_infra(folder=tmp_path)
def my_func(a: int, b: int) -> np.ndarray:
return np.random.rand(a, b)
out = my_func(a=3, b=4)
out2 = my_func(a=3, b=4)
np.testing.assert_array_equal(out2, out) # should the same (as cached)
On the long run this is not adviced as this will prevent you from using many features of infra (running an array of jobs, checking their status etc)
Pydantic models
This is a quick recap of important features of pydantic
models. Models do not have an __init__
method, parameters are instead specified directly in the class (as if they were class attributes, but they will not be):
import pydantic
class MyModel(pydantic.BaseModel):
x: int
y: str = "blublu"
mymodel = MyModel(x=12)
assert mymodel.x == 12
One can then instantiate it easily with mymodel = MyModel(x=12)
and access attributes like mymodel.x
. One important feature is the typechecking when instantiating the objects, as x
is typed as an int
, the field will not accept a string, and the following code would raise an exception: mymodel = MyModel(x="wrong")
.
Note: pydantic
is very similar to the more standard dataclasses
with a few important features: models are type checked (dataclasses are not), one can set mutable default values like []
without risks (with dataclasses this can be buggy or require a factory), and one can use discriminators for sub-configs (more on that here).
For more safety, one should set extra="forbid"
for models as this will trigger an error as well if you instantiate an object with parameters that do not exist in the model:
import pydantic
class MyModel(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid") # safer
x: int
y: str = "blublu"
# MyModel(x=12, wrong_parameter=12) # will not work anymore
Note: adding a default infra automatically sets extra="forbid"
as a default in the pydantic class model_config
, as it is much safer to avoid silent errors.
Hierarchical config
One important aspects of models is that they can be composed as one model/config can contain another config. Instantiating such models is simple as the subparameters can be specified as dictionary and `pydantic will take care of transforming them into the correct class:
class Parent(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid") # safer
data: MyModel
obj = Parent(data={"x": 12})
assert obj.data.x == 12
This makes it easy to specify configs as yaml and load them into a model, eg:
import yaml
string = """
data:
x: 12
y: whatever
"""
dictconfig = yaml.safe_load(string)
obj = Parent(**dictconfig)