Getting started with mining
Welcome to stopes
, this is a quickstart guide to discover how to run automated pipelines with stopes
. In this example, you'll be running
global mining with the stopes
toolchain. (Inspired by
CCMatrix).
Installation
Follow the installation steps from the project's README, we recommend doing this in a separate conda environment.
Getting Data
To run the global mining pipeline, you first need to get some monolingual data. The WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African Languages has some interesting monolingual data for some African languages.
You also need some trained encoder, we usually use stopes
with LASER and we can
find such trained encoders for the languages in the WMT22 shared task too.
The demo/mining/prepare.sh
script will download the monolingual data and LASER encoders
for you. Start by running this script and wait for the download to finish.
prepare.sh
was built for the quickstart demo, it will only download the encoders and data released as part of the African Languages workshop. The NLLB project has released many encoders and data that you can leverage with stopes
.
Configuring the pipeline
In stopes
pipelines, we use hydra to configure the runs.
With hydra, you can configure everything with "overrides" on the cli, but it's
often easier to put the configurations in yaml files as there is a lot of things
to setup.
stopes/pipelines/bitext/conf/preset/demo.yaml
is a demo configuration for the
data and encoders that we've downloaded in the previous steps. Check out the
comments in that file.
The important parts of that preset config is:
- we setup the launcher to run on your local computer (no need for a cluster)
- we setup an alias for a
demo_dir
folder, so you can point to the data/models from the cli - we setup some information about the
data
:- some naming, to get nice file names as outputs
- where the data is found (with
shard_glob
)
- we tell the pipeline where to find the encoder and SentencePiece model (SPM) uses
to embed the text. We do that for each lang in
lang_configs
. Practically, if you are only processing a few languages, you don't need so many entries, here we preset them for all languages from the WMT22 task
Speech Mining
If you are interested in SONAR multimodal mining, check the Speech Mining page for documentation.
Language codes are important, but not standardized everywhere. The stopes
library does not make any assumptions to what codes you are using. As you will see in the configuration and the prepare script, codes are mostly important in naming data input files. If you want to use a different coding scheme, make sure that the files and config use the same naming conventions.
Run the Pipeline
You can now start the pipeline with:
python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=fuv tgt_lang=zul demo_dir=.../stopes-repo/demo/mining +preset=demo output_dir=. embed_text=laser3
src_lang
andtgt_lang
specify the pair of languages we want to process,demo_dir
is the new variable we introduce in our preset/demo.yaml file, to point to where theprepare.sh
script downloaded our data; make sure to specify an absolute path,+preset=demo
tells hydra to load the demo.yaml preset file to set our defaults (the+
here is because we are telling hydra to append a group that doesn't exist in the default config, see the hydra doc for details),output_dir
specifies where we want the output (current run directory),embed_text=laser3
tells the pipeline to use the laser3 encoding code to load the models and encode the text.
Try using a different encoder
In the previous run, we used embed_text=laser3
, which will encode text with
the language specific laser3, but you can also use other encoders. For instance,
stopes ships with HuggingFace
sentence-transformers, so you can
use different encoders if you want to experiment.
You need to install sentence-transformers
in your environment:
pip install --user sentence-transformers
Here is an example to run LaBSE:
python -m stopes.pipelines.bitext.global_mining_pipeline src_lang=fuv tgt_lang=zul demo_dir=.../stopes-repo/demo/mining +preset=demo output_dir=. embed_text=hf_labse lang_configs=null embedding_dimensions=768