This tutorial will guide you through conducting systematic benchmarks using fairseq2.
We’ll focus on practical examples using language models, covering:
Training speed benchmarks
Multi-node scaling efficiency
Hyperparameter sweeps
Performance profiling
Note
The examples will use LLaMA models, but the concepts apply to any model architecture.
fairseq2 CLI is designed to support distributed training across multiple nodes, and it facilitates the sweeping of hyperparameters across different environments.
Example SLURM Script
#!/bin/bash#SBATCH --job-name=fairseq2_benchmark#SBATCH --nodes=4#SBATCH --ntasks-per-node=8#SBATCH --gpus-per-node=8# List of environments to testenvs=("fairseq2_pt22""fairseq2_pt24")# Run benchmarksforenv_namein"${envs[@]}";docondaactivate$env_nameforiin{0..1};do# Two runs per environmentecho"Running $env_name run $i"srunfairseq2lminstruction_finetune\--presetllama3_1_70b_instruct\--config-fileconfigs/benchmark.yaml\--benchmark_outputs/${env_name}/run_${i}# output directorydonecondadeactivate
done
Example benchmark.yaml
# Training configmax_num_steps:1000batch_size:4max_seq_len:2048# Distributed trainingdata_parallelism:"fsdp"tensor_parallel_size:8# Optimizationoptimizer:lr:2e-5weight_decay:0.1mixed_precision:"static"dtype:"bfloat16"
fairseq2 provides powerful sweep functionality with its fairseq2.recipes.utils.sweep_tagger.SweepTagger.
It helps ensure:
Consistent directory structure across nodes
Reproducible experiments
Easy comparison of different configurations
For example, when running multi-node training:
#!/bin/bash#SBATCH --job-name=mt_sweep#SBATCH --nodes=4#SBATCH --ntasks-per-node=8#SBATCH --gpus-per-node=8# Language pairs to sweeplang_pairs=("eng-fra""eng-deu""eng-spa")# Run MT sweepsforpairin"${lang_pairs[@]}";dosrc_lang=${pair%-*}tgt_lang=${pair#*-}# fairseq2 CLI will automatically use SweepTagger to create# a unique directory based on the configsrunfairseq2mttrain\--presetnllb_600m\--config-fileconfigs/mt.yaml\--configsource_lang=$src_langtarget_lang=$tgt_lang\--sweep_outputs/# Base output directory
The fairseq2 CLI will:
Parse the config file and command line overrides
Use fairseq2.recipes.utils.sweep_tagger.SweepTagger to generate a unique tag based on sweep keys
Create a subdirectory using this tag under the base output directory
Ensure all nodes write to the same directory structure
If fmt is provided, it will be used to generate the tag in a customizable format
Note
Use --no-sweep-dir when you want to disable automatic sweep directory creation. This is useful when:
Running quick tests/debugging
Using custom directory structures
Different recipes support different sweep keys.
The following examples will show how to configure sweep tags for different recipes.
For language models, we have two main finetuning approaches.
Instruction Finetuning (SFT)
fromfairseq2.recipes.lm.instruction_finetuneimport(InstructionFinetuneConfig,instruction_finetune_presets)fromfairseq2.recipes.utils.sweep_taggerimportSweepTagger# Configure LM sweepsweep_keys={"batch_size","max_seq_len","dtype","tensor_parallel_size"}sweep_tagger=SweepTagger(world_size=8,allowed_keys=sweep_keys)# Example instruction finetuning configconfig={"max_num_steps":1000,"batch_size":4,"max_seq_len":2048,"dtype":"bfloat16"}# Generate unique tag for this configtag=sweep_tagger.generate("llama3_1_70b_instruct",config,fmt="ps_{preset}.ws_{world_size}.{batch_size}_{max_seq_len}_{dtype}",)output_dir=Path(f"sweep_outputs/{tag}")
Preference Finetuning (DPO)
fromfairseq2.recipes.lm.preference_finetune.dpoimport(DpoConfig,create_dpo_unit)fromfairseq2.recipes.utils.sweep_taggerimportSweepTagger# Configure DPO sweepsweep_keys={"batch_size","max_seq_len","beta",# DPO-specific"nll_scale",# DPO-specific"reference_tensor_parallel_size",# DPO-specific"length_normalization"# DPO-specific}sweep_tagger=SweepTagger(world_size=8,sweep_keys=sweep_keys)# Example DPO configconfig={"max_num_steps":1000,"batch_size":4,"max_seq_len":2048,"beta":0.1,"nll_scale":0.0,"reference_model":"llama3_1_8b_instruct","reference_tensor_parallel_size":1,"length_normalization":False}# Generate unique tag for this configtag=sweep_tagger.generate("llama3_1_8b_dpo",config)output_dir=Path(f"sweep_outputs/{tag}")
Example SLURM script for running DPO sweeps:
#!/bin/bash#SBATCH --job-name=dpo_sweep#SBATCH --nodes=4#SBATCH --ntasks-per-node=8#SBATCH --gpus-per-node=8# List of beta values to sweepbetas=(0.10.20.5)# Run DPO sweepsforbetain"${betas[@]}";dosrunfairseq2lmpreference_finetune\--presetllama3_1_8b_dpo\--config-fileconfigs/dpo.yaml\--config"beta=$beta"--sweep_outputs/
done
MT recipes include additional sweep keys specific to translation tasks.
Example MT sweep
fromfairseq2.recipes.mt.trainimportload_mt_trainer,mt_train_presetsfromfairseq2.recipes.utils.sweep_taggerimportSweepTagger# Configure MT sweepsweep_keys={"lr","weight_decay","source_lang",# MT-specific"target_lang",# MT-specific"max_seq_len","batch_size"}sweep_tagger=SweepTagger(world_size=8,sweep_keys=sweep_keys)# Example MT configconfig={"source_lang":"eng","target_lang":"fra","optimizer_config":{"lr":2e-5,"weight_decay":0.1}}# Generate unique tag for this configtag=sweep_tagger.generate("nllb_600m",config)output_dir=Path(f"sweep_outputs/{tag}")
Speech models also have their own set of sweep parameters:
Example wav2vec2 sweep
fromfairseq2.models.wav2vec2.asrimportwav2vec2_asr_archsfromfairseq2.recipes.utils.sweep_taggerimportSweepTagger# wav2vec2-specific sweep keyssweep_keys={"freeze_encoder_for_n_steps","max_audio_len","min_audio_len","normalize_audio",}sweep_tagger=SweepTagger(world_size=8,allowed_keys=sweep_keys)# Example wav2vec2 configconfig={"freeze_encoder_for_n_steps":1_000,"max_audio_len":100_000,"min_audio_len":1_000,"normalize_audio":True}# Generate unique tag for this configtag=sweep_tagger.generate("wav2vec2_base",config,fmt="ps_{preset}.ws_{world_size}.mal_{max_audio_len}.minal_{min_audio_len}.norm_{normalize_audio}",)output_dir=Path(f"sweep_outputs/{tag}")
fairseq2 uses PyTorch’s profiler to help analyze performance bottlenecks.
The profiler results will be saved to TensorBoard format in the output directory.
It allows you to visualize the performance of your model in detail.
It is also a useful tool for gathering performance metrics for hyperparameter sweeps.
Analysis of Profiler Results
To visualize the results, start Tensorboard at the output directory:
You can also plot the results in a customized way for your own analysis:
fromtensorboard.backend.event_processingimportevent_accumulatorimportpandasaspdimportseabornassnsimportmatplotlib.pyplotaspltdefparse_tensorboard(path,scalars):ea=event_accumulator.EventAccumulator(path,size_guidance={event_accumulator.SCALARS:0},)ea.Reload()return{k:pd.DataFrame(ea.Scalars(k))forkinscalars}defanalyze_performance(log_dir):# Parse metricsmetrics=parse_tensorboard(log_dir,["Wall Time"])# or "Elements per Second", "Elapsed Time"# Calculate statisticswall_time=metrics["Wall Time"]steps_per_second=len(wall_time)/wall_time["value"].sum()# Visualizeplt.figure(figsize=(10,6))sns.lineplot(data=wall_time,x="step",y="value")plt.title("Training Wall Time per Step")plt.show()returnsteps_per_second