# Tutorials

`pydantic` is a package providing model/configuration classes and allows for parameter validation when instantiating the object. `exca` package builds on top of it, and provides "infra" pydantic configuration that can be part of a parent pydantic configuration and change the way it behaves. In particular, it lets one add caching and remote computation to its methods. Check-out the package [philosophy](philosophy) for more in depths explanation of the "whys" of this package.

If you are not familiar with `pydantic`, have a look first at the [Pydantic models section](#pydantic-models).

## Installation

`pip install exca`

## Two types of infra: Task and Map

Infras currently come in 2 flavors.

### TaskInfra
(infra/tutorials:TaskInfra)=
Consider you have one pydantic model/config that fully defines one processing to perform, for instance through a `process` method like below: 


```python
import numpy as np
import typing as tp
import pydantic

class TutorialTask(pydantic.BaseModel):
    param: int = 12

    def process(self) -> float:
        return self.param * np.random.rand()
```

Adding an infra on the `process` model only requires adding an [`TaskInfra`](#exca.TaskInfra) object to the config:


```python continuation
import typing as tp
import torch
import exca


class TutorialTask(pydantic.BaseModel):
    param: int = 12
    infra: exca.TaskInfra = exca.TaskInfra(version="1")

    @infra.apply
    def process(self) -> float:
        return self.param * np.random.rand()
```

`TaskInfra` provides configuration for caching and computation, in particular providing a `folder` activates caching through the filesystem:


```python continuation fixture:tmp_path
task = TutorialTask(param=1, infra={"folder": tmp_path})
out = task.process()
# calling process again will load the cache and not a new random number
assert out == task.process()
```

Adding `cluster="auto"` to the `infra` would trigger computation either on slurm cluster if available, or in a dedicated process otherwise. See the [API reference for all the details](#exca.TaskInfra)


### Map infra
(tutorial-map)=

The `TaskInfra` above is limited to methods that do not take additional arguments / computations that are fully defined by the configuration such as an experiment/a training for instance. Consider now that the configuration defines a computation to be applied to a list of items (eg: process a list of images / texts etc), this is the use case for the [`MapInfra`](#exca.MapInfra):

```python
import typing as tp
import pydantic
import numpy as np
from exca import MapInfra

class TutorialMap(pydantic.BaseModel):
    param: int = 12
    infra: MapInfra = MapInfra(version="1")

    @infra.apply(item_uid=str)  
    def process(self, items: tp.Iterable[int]) -> tp.Iterator[np.ndarray]:
        for item in items:
            yield np.random.rand(item, self.param)
```

As opposed to the `TaskInfra`, the `MapInfra.apply` method now requires an `item_uid` parameter that states how to map each item of the input iterable into a unique string which will be used for identification/caching.

From then, calling `whatever.process([1, 2, 3])` will trigger (possibly) remote computation and caching/storage.
You can control the remote resources through the `infra` instance.
Eg: the following  will trigger the computation in the current process (change `"cluster": None` to `auto` to have it run on `slurm` cluster if available or in a dedicated process)

```python continuation fixture:tmp_path
mapper = TutorialMap(infra={"cluster": None, "folder": tmp_path, "cpus_per_task": 1})
mapper.process([1, 2, 3])
```

See the [API reference for all the details](#exca.TaskInfra)

### Features of MapInfra and TaskInfra

This section provides an overview of parameters and features of infra, but the full [API reference page](exca.TaskInfra) will provide mode options and details if need be.

Common useful parameters include:
- `folder`: where to create the cache folder
- `mode`: one of:
   - `cached`: cache is returned if available (error or not), otherwise computed (and cached). This is the default behavior.
   - `force`: cache is ignored, and result is (re)computed (and cached)
   - `retry` (only for `TaskInfra`): cache is returned if available except if it's an error, otherwise (re)computed (and cached)
- submitit/slurm parameters (eg: `gpus_per_node`, `cpus_per_node`, `slurm_partition`, `slurm_constraint` etc)

All infra object have common features such as:
- **config export**: through `task.infra.config(uid=False, exclude_defaults=True)`.
- **uid/xp folder**: through `task.infra.uid_folder()`. The folder is always populated with the full config, and the reduced uid config. It also contains a symlink to the job folder.

When filesystem caching is used, the folder will contain useful information:
- `config.yaml`: the full configuration (all parameters) of the pydantic model
- `full-uid.yaml`: the config defining the task/map, including defaults (not including non-uid related configs such as number of workers etc)
- `uid.yaml`: the minimal config defining the task/map (not including defaults, nor non-uid related configs such as number of workers etc).

It will also optionally contain:
- `code` (if `workdir` is specified): a symlink to the directory where the task was executed
- `submitit` (for `TaskInfra` if `cluster` is not `None`): a symlink to the folder containing all `submitit` related files for the task (stdout, stderr, batch file etc)


`TaskInfra` also has additional features, in particular:
- *job access*: through `task.infra.job()`. Jobs submitted through `submitit` have `stdout()`, `stderr()` and `cancel()`, and more. All jobs have methods `result()`, `done()`, `wait()`. Calling `infra.job()` submits the job if it does not already exists.
- *cache/job clearing*: through `task.infra.clear_job()`

## Quick comparison

| **feature \ tool**            | lru_cache | hydra |  submitit | stool | exca |
| ----------------------------- | :-------: | :---: |  :------: | :---: | :--: |
| RAM cache                     | ✔         | ✘     |  ✘        |  ✘    | ✔    |
| file cache                    | ✘         | ✘     |  ✘        |  ✘    | ✔    |
| remote compute                | ✘         | ✔     |  ✔        |  ✔    | ✔    |
| pure python (vs commandline)  | ✔         | ✘     |  ✔        |  ✘    | ✔    |
| hierarchical config           | ✘         | ✔     |  ✘        |  ✘    | ✔    |


## Simplified infra decorator

For quick experimentation with infra, the `infra.helpers.with_infra` function decorator can add an infra parameter on most functions (with simple arguments).

```python fixture:tmp_path
import numpy as np
import exca

@exca.helpers.with_infra(folder=tmp_path)
def my_func(a: int, b: int) -> np.ndarray:
    return np.random.rand(a, b)

out = my_func(a=3, b=4)
out2 = my_func(a=3, b=4)

np.testing.assert_array_equal(out2, out)  # should the same (as cached)
```

On the long run this is not adviced as this will prevent you from using many features of infra (running an array of jobs, checking their status etc)


## Pydantic models
(pydantic-models)=

This is a quick recap of important features of `pydantic` models. Models do not have an `__init__` method, parameters are instead specified directly in the class (as if they were class attributes, but they will not be):


```python 
import pydantic

class MyModel(pydantic.BaseModel):
    x: int
    y: str = "blublu"

mymodel = MyModel(x=12)
assert mymodel.x == 12
```

One can then instantiate it easily with `mymodel = MyModel(x=12)` and access attributes like `mymodel.x`. One important feature is the typechecking when instantiating the objects, as `x` is typed as an `int`, the field will not accept a string, and the following code would raise an exception: `mymodel = MyModel(x="wrong")`.


**Note**: `pydantic` is very similar to the more standard `dataclasses` with a few important features: models are type checked (dataclasses are not), one can set mutable default values like `[]` without risks (with dataclasses this can be buggy or require a factory), and one can use discriminators for sub-configs ([more on that here](howto-discriminator)).


For more safety, one should set `extra="forbid"` for models as this will trigger an error as well if you instantiate an object with parameters that do not exist in the model:

```python continuation
import pydantic

class MyModel(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # safer
    x: int
    y: str = "blublu"

# MyModel(x=12, wrong_parameter=12)  # will not work anymore
```

**Note**: adding a default infra automatically sets `extra="forbid"` as a default in the pydantic class `model_config`, as it is much safer to avoid silent errors.


###  Hierarchical config

One important aspects of models is that they can be composed as one model/config can contain another config. Instantiating such models is simple as the subparameters can be specified as dictionary and `pydantic will take care of transforming them into the correct class:

```python continuation
class Parent(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # safer
    data: MyModel

obj = Parent(data={"x": 12})
assert obj.data.x == 12
```

This makes it easy to specify configs as yaml and load them into a model, eg:
```python continuation
import yaml

string = """
data:
  x: 12
  y: whatever
"""

dictconfig = yaml.safe_load(string)
obj = Parent(**dictconfig)
```