Caching/Memoization

An important part of the launcher is its caching system. When you call the schedule method with a configured module, the launcher will check if this configuration was already run in the past and reuse the results when possible. The cache is indexed on the configuration of the module, so if you change anything in the configuration input, the module will be executed from scratch and the new result will be cached with a different key. It's also important to remember that all inputs to the module that could change its results (and thus the caching) should be specified in the config input. Please note there is one exception to this mechanism: the timeout_min key will be ignored for caching so changing it won't invalidate the cache.

If you change the code of your module to a point that would change its output, you can implement the version() method to return a new value so that the cache knows that it needs to recompute from scratch even from known configs.

You can also implement thevalidate()method to check the outputs from your module and from the cache if you want to actively invalidate the cache. For example, if it’s known how many lines are to be embedded into a particular dimension (say 1024), you can validate that the output file size is e.g. num_lines * 1024 * float32.

Here is an example of rerunning the global mining pipeline that was interrupted in the middle. The caching layer recovers what was already executed successfully. This was started with the same command that would require a full run:

python yourpipeline.py src_lang=bn tgt_lang=hi +data=ccg

Here are the logs:

[global_mining][INFO] - output: .../global_mining/outputs/2021-11-02/08-56-40
[global_mining][INFO] - working dir: .../global_mining/outputs/2021-11-02/08-56-40
[mining_utils][WARNING] - No mapping for lang bn
[embed_text][INFO] - Number of shards: 55
[embed_text][INFO] - Embed bn (hi), 55 files
[stopes_launcher][INFO] - for encode.bn.55 found 55 already cached array results,0 left to compute out of 55
[train_faiss_index][INFO] - lang=bn, sents=135573728, required=40000000, index type=OPQ64,IVF65536,PQ64
[stopes_launcher][INFO] - index-train.bn.iteration_2 done from cache
[stopes_launcher][INFO] - for populate_index.OPQ64,IVF65536,PQ64.bn found 44 already cached array results,11 left to compute out of 55
[stopes_launcher][INFO] - submitted job array for populate_index.OPQ64,IVF65536,PQ64.bn: ['48535900_0', '48535900_1', '48535900_2', '48535900_3', '48535900_4', '48535900_5', '48535900_6', '48535900_7', '48535900_8', '48535900_9', '48535900_10']
[mining_utils][WARNING] - No mapping for lang hi
[embed_text][INFO] - Number of shards: 55
[embed_text][INFO] - Embed hi (hi), 55 files
[stopes_launcher][INFO] - for encode.hi.55 found 55 already cached array results,0 left to compute out of 55
[train_faiss_index][INFO] - lang=hi, sents=162844151, required=40000000, index type=OPQ64,IVF65536,PQ64

We can see that the launcher has found out that it doesn't need to run the encode and train index steps for the bn lang (src language) and can skip straight to populating the index with embeddings, but it also already processed 44 shards for that step, so will only re-schedule jobs for 11 shards. In parallel, it is also processing the tgt language (hi) and found that it still needs to run the index training step as it also recoverred all the encoded shards.

All this was done automatically. The person launching the pipeline doesn't have to micromanage what has already succeeded and what needs to be started when.

Each module can specify the get_config_for_cache function in order to customize the keys that should be ignored or taken into account for the cache.