NLLB Monolingual Pipeline
This is the monolingual "cleaning" pipeline, it does a few things:
- split paragraphs in sentences
- run some moses normalization+cleaning on the sentences
- filter the sentences that do not match some criteria (length, character ratios, etc.)
- run script detection at the sentence level, if this doesn't match the expected lang, throw the sentence out
- run lid detection at the sentence level, if this doesn't match the expected lang, throw the sentence out
- deduplicate sentences (this is done by sorting sentences)
The core filtering is in monolingual_line_processor.py
and utils/text_filter.py
Run it
python monolingual_pipeline.py data_dir=yourdatahere langs='[umb,ssw]'
should be enough to get it running.
data_dir
is where the raw data is, should have subfolders per lang and files named with the pattern corpus_name.lang.xzlangs
an array of langs to process in this run
Usefull overrides
launcher.cluster=local local_tmp_dir=/tmp/monolingual
if you want to run this locally instead of on the slurmpreproces_requirements.cpus_per_task=40
this is the number of CPUs used to process each lang file in a slurm job. Higher means it will go faster, but you'll have a harder time to get a machine from the queuecorpus_filter=yourcorpus
filter the lang files you'll process to only work on a specific corpusinput_file_glob_template
replace this if the files in your data_dir do not follow the expected template
See monolingual.yaml
for more possible configurations.
Outputs
The run will be started with a custom working directory that follows the pattern: outputs/{date}/{start_time}
, all the logs will go there (including executor_logs from slurm jobs). By default, the data output is set in monolingual.yaml
to be output_dir: .
this means that the outputs will go to lang dirs in the working directory and will go to different places depending on the day/time you start the run. This is useful for testing, but if you want to output somewhere else (like a central clean monolingual repo), override the output_dir=/somethingstable/
when starting the run.
Logging
the run will log to wandb monolingual dashboard. Go to wanddb and make sure to enable grouping. Choose to group by "group" and "lang". There will be one sub-run per process (see num_cpu above) per lang + a global run for the root script. The global run will have a funny name and will only report data at the end of everything. You can check progress in each subrun.