Explanations

Why? The philosophy

Pure python

The tools here do not provide a script API but a way to do everything directly from Python. Specific script APIs can be easily composed on top of it if need be.

Parameter validation

Configurations should be validated before running to avoid discovering bugs much later (eg: missing parameter, inconsistent parameters, wrong type etc). We do this by using pydantic.BaseModel which works as dataclasses but validate all parameters.

Fast configs

Running a grid search requires creating a bunch of configs, so configurations should be easy and fast to create, and therefore not defer loading data/pytorch models/etc to later

No parameter duplication - easy to exetend

Configuration hold the underlying actual functions/classes parameters. To avoid duplicating the parameters, we opt for having coupled configs and actual classes/functions like below:

class MyClassCfg(pydantic.BaseModel):
    x: int = 12
    y: str = "hello"

    def build(self) -> "MyClass":
        return MyClass(self)


class MyClass:
    def __init__(self, cfg: MyClassCfg):
        self.cfg = cfg

With this easy pattern, building an object from the config is easy (cfg.build()), and adding new parameters only requires updating the config, with effective typing and low risk of silently ignored parameters because of a mismatch between configs and functions.

Cached/distributed computation

The main aim of this package is to provide objects that slightly modify methods and make them distributed and cached their results in a breeze. The infra objects that make this possible are configurations that let you specify how caching should be performed and how computation should be distributed directly within your experiment config (including slurm partitions, number of gpus etc)

Modularity

Pydantic hierarchical configuration and discriminated unions allows for modularity and reusability, as several sub-configs can be proposed for a training config, and plugging a new sub-config is straightforward.

MapInfra / Task Infra differences

TaskInfra must be applied to a method with no parameter (except self). It links 1 computation to 1 job and therefore provides easy tools for accessing the job stdout/stderr/status etc.

MapInfra on the other hand must be applied to a method with 1 parameter (in addition to self) which must be a m-sized iterator/sequence of items. It requires stating how to provide a unique uid for each item (throught the item_uid function, and it maps m computation (1 for each item) to n <= m jobs, packing several computations together. Because of this non-bijective mapping, there is no support for checking jobs stderr/stdout/status.

uid computation

A unique id, the uid, is computed for each pydantic model/each config based on public instance attributes:

  • which are non-defaults

  • which are not excluded through the _exclude_from_cls_uid class attribute list/tuple (or class method returning a list/tuple). This allows removing parameters which do not impact the result (eg: number of workers, device, ect…).

Note: infra objects have all their parameters excluded except version as the parameters affects how the computation is performed but not the result

Furthermore, a specific “cache” uid is also computed for which additional parameters can be excluded to account for parameters which do not impact the cached computation, but impact the class as a whole (attributes which are used to post-process the cached computation). This is done by specifying exclude_from_cache_uid in the infra.apply method. This cache uid is used as storage folder name for the cache. Exclusion can be specified as a list/tuple of field, or as a method, or as the name of a method (with format method:<method_name>). Notice that when subclassing, if you specified the exclusion as a function, the original function will be used (not the new function if it was overriden), if you want to use the new one, then you should specify the method through its name.

See more in the example from the how-to guide.

ConfDict

To simplify working with configuration dictionaries, we use ConfDict classes (see their API). In practice, they are dictionary which breaks into sub-dictionnaries on "." characters such as in a config. Data can be specified either through dotted-keywords or directly through sub-dictionaries or a mixture of both:

from exca import ConfDict

cfdict = ConfDict({"training.optim.lr": 0.01})
assert cfdict == {"training": {"optim": {"lr": 0.01}}}

ConfDict instance have a few convenient methods:

# flattens the dictionary
assert cfdict.flat() == {"training.optim.lr": 0.01}

# export to yaml (can take a file as argument)
assert cfdict.to_yaml() == "training.optim.lr: 0.01\n"

# uid computation
assert cfdict.to_uid() == "training.optim.lr=0.01-346ce994"

Infra objects extensively use such dictionaries and have a config method for instantiating the ConfDict generated from an object:

task = TutorialTask(param=13)
cfdict = task.infra.config(uid=True, exclude_defaults=True)
assert cfdict == {"param": 13}

They are used for uid computation as shown above (with uid=True, exclude_defaults=True) but also to clone the instance and updating its value, so that you can pass new values either through dotted-name format or through sub-dictionaries:

# exports a ConfDict and reinstantiate from it
new = task.infra.clone_obj({"param": 14})
assert new.param == 14

Caching

Cache folders are created as <full_module_import_name>.<class_name>.<method_name>,<version>/<param1=value1,...>

Eg: mypackage.mymodule.TutorialTask.process,1/param=13-fbfu2iow

Under the hood, data are stored using the CacheDict class (see API here). This class has a dict interface (keys, items, values, get/set item, contains), the difference is that the data can be stored to/loaded from disk automatically. The class is initialized with 2 parameters:

  • the storage folder, if present the data will be stored to disk in the folder, or reloaded from disk

  • keep_in_ram flag, if True the data will be cached in RAM when stored/reloaded, for faster access

Example

import numpy as np
from exca import cachedict
# create a cache dict (specialized for numpy arrays)
cache = cachedict.CacheDict(folder=tmp_path, keep_in_ram=True)

# the dictionary is empty:
assert not cache

# add a value into the cache
x = np.random.rand(2, 12)
cache["blublu"] = x
assert "blublu" in cache
# the value is now available
np.testing.assert_almost_equal(cache["blublu"], x)
assert set(cache.keys()) == {"blublu"}

# create a new dict instance with same cache folder
cache2 = cachedict.CacheDict(folder=tmp_path)
# the data is still available (loading from cache folder)
assert set(cache2.keys()) == {"blublu"}

Note that the caching currently heavily relies on the filesystem. This is a limitation of the current version, as caches with more that ~100,000 can become slow to initialize.