.. _tutorial-preference-optimization: ====================================== :octicon:`law` Preference Optimization ====================================== .. dropdown:: What you will learn :icon: multi-select :animate: fade-in * How to run preference optimization recipe * How to customize the criterion units and use DPO/CPO/ORPO/SimPO * How to run preference optimization with multiple nodes in fairseq2 .. dropdown:: Prerequisites :icon: multi-select :animate: fade-in * Get familiar with fairseq2 basics (:ref:`basics-overview`) * Ensure you have fairseq2 installed (:ref:`installation`) * Get familiar with the tutorial on end-to-end fine-tuning (:ref:`tutorial-end-to-end-fine-tuning`) Overview -------- #. **Prepare** * Download the `LLaMA3.1 8B instruction finetuned model`_ from HuggingFace * Download the `gsm8k data`_ prepared for this tutorial #. **Fine-Tune** * One simple command to run the preference optimization recipe * Choose between different preference optimization methods (DPO/CPO/ORPO/SimPO) * Customize configurations for each method * Accelerate the training with multiple nodes #. **Generate** * One simple command to generate from the finetuned model * Since this step is similar to what has been covered in :ref:`tutorial-end-to-end-fine-tuning`, we would not elaborate on this topic. Prepare ------- Model ^^^^^ Follow the `HuggingFace Models Tutorial`_ to download the `LLaMA3.1 8B instruction finetuned model`_, which can be run on volta32gb GPUs. Once you have the model in your local path, (`e.g.`` ``/models/Llama-3.1-8B/original/consolidated.00.pth``), you need to register the model in a YAML card so that fairseq2 will know from where to pull the model (read more about :ref:`basics-assets`). To do that: * Create a YAML file (e.g. ``my_llama3_1_8b_instruct.yaml``) with the following content: .. code-block:: yaml name: llama3_1_8b_instruct@user checkpoint: "/models/Meta-Llama-3-8B-Instruct/original/consolidated.00.pth" tokenizer: "/models/Meta-Llama-3-8B-Instruct/original/tokenizer.model" .. tip:: The ``@user`` specifies this is your special environment. This can also be extended to help resolve different domain name for your clusters * Save the file in one of the following locations: * `Option 1`: Place it in the default fairseq2 asset directory * ``mkdir -p ~/.config/fairseq2/assets`` * ``mv my_llama3_1_8b_instruct.yaml ~/.config/fairseq2/assets/`` * `Option 2`: Specify a custom directory and point ``FAIRSEQ2_USER_ASSET_DIR`` to it * ``export FAIRSEQ2_USER_ASSET_DIR=/path/to/custom/asset/directory`` * ``mv my_llama3_1_8b_instruct.yaml /path/to/custom/asset/directory/`` You can check out the predefined fairseq2 LLaMA model cards `here`_. Dataset ^^^^^^^ Follow the `HuggingFace Datasets Tutorial`_ to download the `gsm8k data`_, (formatted with fairseq2 flavor) to your local path (`e.g.` ``/datasets/facebook/fairseq2-lm-gsm8k/``). We will use the ``dpo/train.jsonl`` to fine-tune the model and use the ``test/test.jsonl`` for evaluation. Fine-Tune --------- One-Liner ^^^^^^^^^ Running the preference optimization recipe is as simple as: .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config \ dataset.path=/datasets/facebook/fairseq2-lm-gsm8k/dpo \ model.name=llama3_1_8b_instruct \ trainer.dtype=float16 \ regime.num_steps=1000 \ regime.num_data_epochs=20 \ regime.checkpoint_every_n_steps=1000 By default, DPO (direct preference optimization) is applied (``--config criterion.name=dpo``). The use of other methods (CPO/ORPO/SimPO) is documented below. The configuration fields are detailed in the page :ref:`basics-recipe`. The fields follows a nested structure, where each field is a key-value pair. In the example above, we have made changes to config sections including ``dataset``, ``model``, ``trainer``, ``regime``. Dumping Configuration ^^^^^^^^^^^^^^^^^^^^^ For a quick overview of all the sections and fields, you can use the ``--dump-config`` command: .. code-block:: bash fairseq2 lm preference_finetune --dump-config Preference Optimization Methods ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ fairseq2 supports four different preference optimization methods: 1. **DPO (Direct Preference Optimization)** * Paper: https://arxiv.org/abs/2305.18290 * Key configuration parameters: - ``beta``: Coefficient of regularization towards reference model (default: 0.1) - ``nll_scale``: Coefficient of NLL loss (default: 0.0) - ``length_normalization``: Whether to use length normalized rewards (default: False) - ``reference_model``: Name of reference model (default: llama3_1_8b_instruct) - ``reference_dtype``: Data type of reference model (default: float16) .. dropdown:: Example preset for DPO :icon: code :animate: fade-in Here's an example preset config for DPO: .. code-block:: yaml # dpo.yaml model: _set_: name: llama3_1_8b_instruct dataset: _set_: name: gsm8k_dpo path: null family: generic_preference source_encode_mode: prompt target_encode_mode: prompt_response mask_source_tokens: true min_seq_len: 1 max_seq_len: 8192 max_num_tokens: 16384 batch_size: null example_shuffle_window: 10000 batch_shuffle_window: 1000 num_prefetch: 4 extras: {} criterion: _set_: name: dpo config: reference_model: name: llama3_1_8b_instruct reference_dtype: bfloat16 beta: 0.1 nll_scale: 0.0 length_normalization: false gang: _set_: tensor_parallel_size: 1 timeout: 15 high_priority: true monitored: false trainer: _set_: dtype: bfloat16 data_parallelism: fsdp mixed_precision: static gradient_accumulation: 1 activation_checkpointing: true max_gradient_norm: null fp16_loss_scale: - 128.0 - 0.0001 torch_compile: false profile: null gradient_check: false anomaly_detection: false fsdp: _set_: version: v1 granularity: layer hsdp: false reshard_after_forward: true fp32_reduce: true optimizer: _set_: name: adamw config: _set_: lr: 5.5e-06 betas: - 0.9 - 0.95 eps: 1.0e-08 weight_decay: 0.1 amsgrad: false maximize: false capturable: false differentiable: false impl: auto use_fp32: false lr_scheduler: _set_: name: cosine_annealing config: _set_: cycle_len: null num_warmup_steps: 0 cycle_mul: 1.0 lr_mul: 1.0 start_lr: 0.0 final_lr: null final_lr_scale: 0.2 regime: _set_: num_steps: 5000 num_data_epochs: null score_metric: null lower_score_better: false validate_after_n_steps: 0 validate_every_n_steps: null validate_after_n_data_epochs: 0 validate_every_n_data_epochs: null checkpoint_after_n_steps: 0 checkpoint_every_n_steps: 1000 checkpoint_after_n_data_epochs: 0 checkpoint_every_n_data_epochs: null keep_last_n_checkpoints: 1 keep_best_n_checkpoints: null keep_last_n_models: null keep_best_n_models: null publish_metrics_after_n_steps: 0 publish_metrics_every_n_steps: 10 publish_metrics_after_n_data_epochs: 0 publish_metrics_every_n_data_epochs: null common: _set_: seed: 2 metric_recorders: log: _set_: enabled: true jsonl: _set_: enabled: true tensorboard: _set_: enabled: true wandb: _set_: enabled: false project: null run: null profilers: torch: _set_: enabled: false skip_n_steps: 4 wait_n_steps: 0 num_warmup_steps: 1 num_active_steps: 4 repeat: 1 assets: _set_: extra_path: null checkpoint_dir: null .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config-file /path/to/dpo.yaml 2. **CPO (Contrastive Preference Optimization)** * Paper: https://arxiv.org/abs/2401.08417 * Key configuration parameters: - ``beta``: Coefficient for preferred vs dispreferred sequences (default: 1.0) - ``nll_scale``: Coefficient of NLL loss (default: 1.0) .. dropdown:: Example preset for CPO :icon: code :animate: fade-in Here's an example preset config for CPO: .. code-block:: yaml # cpo.yaml model: _set_: name: llama3_1_8b_instruct dataset: _set_: path: /checkpoint/seamless/data/gsm8k_data/dpo batch_size: 1 criterion: _set_: name: cpo config: beta: 0.1 nll_scale: 0.0 Then, to run the preference finetuning recipe with CPO unit: .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config-file /path/to/cpo.yaml 3. **ORPO (Odds Ratio Preference Optimization)** * Paper: https://arxiv.org/abs/2403.07691 * Key configuration parameters: - ``orpo_lambda``: Coefficient of odds-ratio component (default: 1.0) - ``nll_scale``: Coefficient of NLL loss (default: 1.0) .. dropdown:: Example preset for ORPO :icon: code :animate: fade-in Here's an example preset config for ORPO: .. code-block:: yaml # orpo.yaml model: _set_: name: llama3_1_8b_instruct dataset: _set_: path: /checkpoint/seamless/data/gsm8k_data/dpo batch_size: 1 criterion: _set_: name: orpo config: nll_scale: 0.0 orpo_lambda: 0.1 Then, to run the preference finetuning recipe with ORPO unit: .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config-file /path/to/orpo.yaml 4. **SimPO (Simple Preference Optimization)** * Paper: https://arxiv.org/abs/2405.14734 * Key configuration parameters: - ``beta``: Coefficient of KL-divergence regularization (default: 1.0) - ``gamma``: Target reward margin between completions (default: 0.5) - ``nll_scale``: Coefficient of NLL loss (default: 0.0) .. dropdown:: Example preset for SimPO :icon: code :animate: fade-in Here's an example preset config for SimPO: .. code-block:: yaml # simpo.yaml model: _set_: name: llama3_1_8b_instruct dataset: _set_: path: /checkpoint/seamless/data/gsm8k_data/dpo batch_size: 1 criterion: _set_: name: simpo config: beta: 2 nll_scale: 0.0 Then, to run the preference finetuning recipe with SimPO unit: .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config-file /path/to/simpo.yaml Iterative Training ^^^^^^^^^^^^^^^^^^ Sometimes you may want to continue fine-tuning from a previously trained checkpoint, either to: - Resume interrupted training - Fine-tune on additional data - Perform iterative fine-tuning with different hyperparameters fairseq2 provides a clean way to handle this through the checkpoint system (learn more about :ref:`basics-ckpt-management`): .. code-block:: bash fairseq2 lm preference_finetune $OUTPUT_DIR --config \ common.assets.checkpoint_dir=/path/to/checkpoint \ model.name=last_checkpoint \ # this will pick up the last checkpoint dataset.path=/path/to/data .. dropdown:: To pick up a specific checkpoint :icon: code :animate: fade-in .. code-block:: bash CKPT_DIR="/checkpoint/user/experiments/run_0/checkpoints" CKPT="checkpoint_step_1000" # e.g. checkpoint of step 1000 fairseq2 lm preference_finetune $OUTPUT_DIR --config \ common.assets.checkpoint_dir=$CKPT_DIR \ model.name=$CKPT \ dataset.path=/path/to/new/data \ dataset.max_num_tokens=4096 \ trainer.dtype=float16 .. note:: If you want to pick a specific checkpoint instead of the last checkpoint, the ``model`` parameter must be set to ``checkpoint_step_X`` where X matches the step number of the checkpoint you want to load. Multi-Node ^^^^^^^^^^ To help accelerate the training, fairseq2 is able to automatically detect multi-node setup. - `Option 1`: Slurm .. code-block:: bash srun --nodes=2 --ntasks-per-node=8 \ fairseq2 lm preference_finetune $OUTPUT_DIR \ ... - `Option 2`: Torchrun .. code-block:: bash torchrun --standalone --nproc-per-node 8 --no-python \ fairseq2 lm preference_finetune $OUTPUT_DIR \ ... Generate -------- Once we have finished the training, we can find in the ``$OUTPUT_DIR`` the model checkpoints in ``$OUTPUT_DIR/checkpoints``. With that, we can now generate over the test dataset! You can either use fairseq2 native generation recipe: .. code-block:: bash CKPT_DIR="/checkpoint/$USER/my_experiment/checkpoints" CKPT="last_checkpoint" SAVE_DIR="/checkpoint/$USER/my_experiment/generations" DATASET="/datasets/facebook/fairseq2-lm-gsm8k/test/test.jsonl" fairseq2 lm generate $SAVE_DIR --no-sweep-dir --config \ common.assets.checkpoint_dir=$CKPT_DIR \ model.name=$CKPT \ seq_generator.config.temperature=0.1 \ dataset.path=$DATASET Or accelerate with VLLM: .. code-block:: python from vllm import LLM llm = LLM( model=, # path of your model tokenizer=, # path of your tokenizer files ) output = llm.generate("Hello, my name is") print(output) For the simplicity of our documentation, please refer to :ref:`tutorial-end-to-end-fine-tuning` for more details. See Also -------- - :doc:`Design Philosophy ` - :doc:`Recipe ` - :doc:`CLI ` - :doc:`Assets ` - :ref:`tutorial-end-to-end-fine-tuning` .. _LLaMA3.1 8B instruction finetuned model: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main .. _gsm8k data: https://huggingface.co/datasets/facebook/fairseq2-lm-gsm8k .. _here: https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/assets/cards/models/llama.yaml .. _HuggingFace Models Tutorial: https://huggingface.co/docs/hub/en/models-downloading .. _HuggingFace Datasets Tutorial: https://huggingface.co/docs/hub/en/datasets-downloading .. _HF script: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py .. _VLLM documentation: https://vllm.readthedocs.io/en/latest/