Skip to main content
NO LANGUAGES LEFT BEHIND
Driving inclusion through machine translation

stopes

Large-Scale Translation Tooling

Easy to Use

Easy to Use

stopes was designed to provide a modular API to build and reproduce pipelines core to large translation work. In particular data mining and evaluation. Where you run your pipeline and how you scale it is independent of its core logic. Everything is config-driven so you can easily reproduce and track results.

Batteries Included

Batteries Included

stopes lets you focus on your core data and evaluation needs by providing common modules used for this task and letting you write your pipelines with idiomatic python. Common optimizations have also been built-in to help you scale your work.

State-of-the-art Pipelines

State-of-the-art Pipelines

stopes was developed as part of the Meta AI No Language Left Behind research project. It comes with state-of-the-art pipelines out of the box. You can run our multimodal mining and distillation pipelines and reproduce our research with just a few command lines.

No-coding Mining

stopes comes with the Multimodal Mining Pipeline that was used by the NLLB and Seamless Communication teams. You can use it out of the box without extra coding. You will need to setup an environment and create a config file to point to your data, but you can start mining (locally or on a slurm cluster) speech and text using the SONAR embedding space. Check out the Quickstart guide.

python -m stopes.pipelines.bitext.global_mining_pipeline \
src_lang=fuv \
tgt_lang=zul \
demo_dir=./demo \
+preset=demo\
output_dir=. \
embed_text=laser3

Reproducible research

_target_: stopes.modules.preprocess.train_spm.TrainSpmModule
config:
output_dir: ???
vocab_size: 50_000
input_sentence_size: 5_000_000
character_coverage: 0.999995
model_type: "unigram"
shuffle_input_sentence: True
num_threads : 4

stopes is based on Hydra, giving you full control over the behavior of your pipeline. Experiments are easily reproducible along with your results.

Modular pipeline definition

stopes pipelines are composed of modules. No more duplicated, out-of sync code: your most common preprocessing steps can be shared among all your pipelines.

You will find in this repository some implementations of a number of modules that are useful for translation data mining and evaluation, Neural Machine Translation data pre-processing and model training. For example, we provide modules to build faiss indexes, encode text with LASER and HuggingFace Transformers, mine bitext or train and evaluate FAIRSEQ models.

import asyncio

import hydra
from omegaconf import DictConfig
from stopes.core.utils import clone_config
from stopes.modules.bitext.indexing.populate_faiss_index import PopulateFAISSIndexModule
from stopes.modules.bitext.indexing.train_faiss_index_module import TrainFAISSIndexModule

# the pipeline
async def pipeline(config):
# setup a launcher to connect jobs together
launcher = hydra.utils.instantiate(config.launcher)

# train the faiss index
trained_index = await launcher.schedule(TrainFAISSIndexModule(
config=config.train_index
))

# pass in the trained index to the next step through config
with clone_config(config.populate_index) as config_with_index:
config_with_index.index=trained_index

# fill the index with content
populated_index = await launcher.schedule(PopulateFAISSIndexModule(
config=config_with_index
))
print(f"Indexes are populated in: {populated_index}")

# setup main with Hydra
@hydra.main(config_path="conf", config_name="config")
def main(config: DictConfig) -> None:
asyncio.run(pipeline(config))