# Exca - Execution and caching

This is an explanation to why `exca` was built. If you are only intereseted in how to use it, you can move to [tutorials](tutorials.md) and [how-to](howto.md) pages.

Here are the challenges we want to face:
1. config validation and remote computation
2. hierarchical computation and modularity
3. experiment/computation caching


## Challenge #1: Early configuration validation
```bash notest
srun --cpus-per-task=4 --time=60 python -m mytask --z=12

>> srun: job 34633429 queued and waiting for resources
>> srun: job 34633429 has been allocated resources
...
...
>> usage: mytask.py [-h] [--x X] [--y Y]
>> mytask.py: error: unrecognized arguments: --z=12
>> srun: error: learnfair0478: task 0: Exited with exit code 2
```

Or similarly: 
```bash notest
mytask.py: error: argument --y: invalid int value: 'blublu'
```

### Observations and consequences

- Configurations should be validated before running on the cluster!
  - need some tool for validation
    → verify configurations locally first
    → in Python
    → submit from Python as well (avoid boilerplate additional bash command)
- Resource configuration (srun params) and computation configurations (mytask params) come in 2 different places (and sometimes formats)
    → specify resource configuration within the same configuration as the computation configuration? (while keeping them distinct in some way?!)


### Parameter validation with Pydantic

Pydantic (21k★ on github) works like dataclasses, but with (fast) validation:
```python
import pydantic

class MyTask(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # pydantic boilerplate
    x: int
    y: str = "blublu"

mytask = MyTask(x=12)
mytask.x  # this is 12

# MyTask(x="blublu")  
# >> ValidationError: 1 validation error for MyTask (x hould be a valid integer)
```

Pydantic supports hierarchical configurations:

```python continuation
class Parent(pydantic.BaseModel):
    task: MyTask

obj = Parent(task={"x": 12})  # parses the dict into a MyTask class
obj.task.x  # this is 12
```

#### Note: discarded options
- `dataclasses`: no dynamic type check
- `omegaconf`: can typecheck (when using dataclasses) but is slow and not well-maintained
- `attrs`: probably usable (but smaller community 5k★ Vs 21k★) 


### Local/remote submission with exca

Convenient pattern (more on this later): tie computation to the config class:

```python
class MyTask(pydantic.BaseModel):
    x: int
    y: str = "blublu"
    model_config = pydantic.ConfigDict(extra="forbid")  # pydantic boilerplate

    def compute(self) -> int:
        print(self.y)
        return 2 * self.x
```

Then if we want to enable remote computation, we add a `exca.TaskInfra` subconfiguration:
```python
class MyTask(pydantic.BaseModel):
    x: int
    y: str = "blublu"
    infra: exca.TaskInfra = exca.TaskInfra()
    # note: automatically sets extra="forbid"

    @infra.apply
    def compute(self) -> int:
        print(self.y)
        return 2 * self.x
```
By default, this changes nothing, but you can now parametrize the infra to run the `compute` method on slurm, eg:

```python continuation fixture:tmp_path
config = f"""
x: 12
y: whatever
infra:  # resource parameters
  cluster: slurm
  folder: {tmp_path}
  cpus_per_task: 4
"""

dictconfig = yaml.safe_load(config)
obj = MyTask(**dictconfig)  # validation happens locally
out = obj.compute()  # runs in a slurm job!
assert out == 24 
```


Note that the config now holds both the computation parameters (`x` and `y`) as well as the resources parameters (through `infra`) but they are separated through the hierarchical structure of the config.


## Challenge #2: Complex experiments - hierarchical configurations
Do's and don'ts with `pydantic`'s configurations

### Parametrizing pattern

Seen in many codebases:
```python
class ConvCfg(pydantic.BaseModel):
    layers: int = 12
    kernel: int = 5
    channels: int = 128


class ConvModel(torch.nn.Module):

    def __init__(self, layers: int, kernel: int, channels: int, other: int = 12) -> None:
        self.layers = layers
        self.kernel = kernel
        self.channels = channels
        self.other = other
        ...  # build layers, add forward method


# then in your code
cfg = ConvCfg(layers=10, kernel=5, channels=16)
model = ConvModel(layers=cfg.layers, kernel=cfg.kernel, channels=cfg.channels)
```
Issues:
- a lot of duplicated code/work
- easy to mess up when propagating a new parameter as it needs 4 changes: the config, the model init parameters, the content of the init, the instantiation of the model from the config (any typo? any mismatch generating a silent bug?)
- some defaults may not be configurable


Here is a simpler pattern:
```python
class ConvCfg(pydantic.BaseModel):
    layers: int = 12
    kernel: int = 5
    channels: int = 128

    def build(self) -> torch.nn.Module:
        # instantiate when needed
        # (do not slow down config initialization)
        return ConvModel(self)  


class ConvModel(torch.nn.Module):

    def __init__(self, cfg: ConvCfg) -> None:
        self.cfg = cfg
        ...  # build layers, add forward method

# then in your code
model = ConvCfg().build()
```

**Cost**: classes become coupled (then again, you don't need importing `ConvModel` anymore)

**Benefit**: fixes all issues mentioned above with 1 set of defaults, in a single place


### One step further - Discriminated unions

Pipelines often get complex, and require if-else conditions depending on configurations, for instance:
```python
import typing as tp

class ModelCfg(pydantic.BaseModel):
    name: tp.Literal["conv", "transformer"] = "conv"  # special discriminator field
    # shared parameters
    layers: int = 12
    # convolution parameters
    kernel: int = 5
    channels: int = 128
    # transformer parameters
    embeddings: int = 128

    def build(self) -> torch.nn.Module:
        if self.name == "conv":
            return ConvModel(self)
        else:
            return TransformerModel(self)
```

This creates coupling between different models into a unique config where some parameters are ignored depending on the cases, and can become messier and messier with more models.
Fortunately, `pydantic`'s discriminated unions easily address this issue:


```python continuation
class ConvCfg(pydantic.BaseModel):
    name: tp.Literal["conv"] = "conv"  # special discriminator field
    layers: int = 12
    kernel: int = 5
    channels: int = 128

    def build(self) -> torch.nn.Module:
        return ConvModel(self) #instantiate when needed

...

class TransformerCfg(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid")  # pydantic boilerplate: safer
    name: tp.Literal["transformer"] = "transformer"  # special discriminator field
    layers: int = 12
    embeddings: int = 128

    def build(self) -> torch.nn.Module:
        return TransformerModel(self)

...

class Trainer(pydantic.BaseModel):
    model: ConvCfg | TransformerCfg = pydantic.Field(..., discriminator="name")
    optimizer: str = "Adam"
    infra: TaskInfra = TaskInfra()

    @infra.apply
    def run(self) -> float:
        model = self.model.build()  # build either one of the model
        # specific location for this very config:
        ckpt_path = self.infra.uid_folder() / "checkpoint.pt"
        if ckpt_path.exists():
           # load
           ...
        ...
        for batch in loader:
            ...
        return accuracy


string = """
model:
  name: transformer  # specifies which model
  embeddings: 256  # only accepts transformer specific parameters
optimizer: SGD
"""
trainer = Trainer(**yaml.safe_load(string))

isinstance(trainer.model, TransformerCfg)
```

Discriminated unions make it easier to make **modular pipelines** as one can swap part of the experiment by others very easily, and still get full parameter validation.


## Challenge #3: Experiment/computation caching

`exca` can also handle **caching of the computation result** with no extra effort, so any computation already performed will only be recomputed if explicitely required.

A lot of additional benefits also come for free:
- sub-configs can have their own infra.
- running a grid search only requires a `for` loop.
- computations can be packed into a job array.
- computation can be performed in a dedicated working directory to avoid interfering with the code.

```python continuation fixture:tmp_path
string = f"""
model:
  name: transformer  # specifies which model
  embeddings: 256
optimizer: SGD
infra:
  gpus_per_node: 8
  cpus_per_task: 80
  slurm_constraint: volta32gb
  folder: {tmp_path}
  cluster: slurm
  slurm_partition: learnfair
  workdir:
    copied:
      - . # copies current working directory into a dedicated workdir
      # - whatever_other_file_or_folder
"""

trainer = Trainer(**yaml.safe_load(string))
with trainer.infra.job_array() as array:
    for layers in [12, 14, 15]:
        array.append(trainer.infra.clone_obj({"model.layers": layers}))
# leaving the context submits all trainings in a job array
# and is non-blocking

# show one of the slurm jobs
print(array[0].infra.job())
```


Overall with this way of experimenting, you easily get:
- modular pipeline with simple building blocks
- easy remote computation configuration
- validated configuration before sending to remote cluster through a job array
- cached results so that only missing elements of the array get sent