This repository contains the code of RoSTE introduced in our work: "RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models"
install environment
git clone https://github.com/Hong-Lab-UMN-ECE/RoSTE
cd RoSTE
conda create -n roste python==3.10 -y
conda activate roste
pip install -r requirements.txt
install the Fast Hadamard Transform package
git clone https://github.com/Dao-AILab/fast-hadamard-transform
cd fast-hadamard-transform
pip install .
We provide the details of TL;DR summarization experiments for the Pythia and Qwen models.
Dataset: TL;DR
Models:
To keep computational invariance when using rotation, we need to transform Pythia models to Llama/Qwen-type model architecture. More specifically, we modify the Q, K, and V linear layers of the MHSA module in Pythia. The transformed Pythia models are also compatible with QuaRot and SpinQuant.
python convert_pythia_to_llama_format.py --legacy_model_dir EleutherAI/pythia-1b-deduped --new_model_dir ./save/pythia-1b/ckpt/pythia-1b-deduped-new
We keep R1 as offline mergeable rotation and R2, R3, R4 as online rotations during training. Before inference, R2 can be merged into weights.
We first fuse norm (LayerNorm / RMSNorm) into weights and apply R1.
python rotate_model_r1.py --model_dir Qwen/Qwen2.5-0.5B --is_tldr_data --is_rotate_R1 --is_save --rotated_model_dir ./save/qwen2.5-0.5b/ckpts/qwen2.5-0.5b-r1
Then we compute the quantization error
python rotate_model_r234_quant_error.py --model_dir ./save/qwen2.5-0.5b/ckpt/qwen2.5-0.5b-r1 --output_folder ./rotation_config/qwen/
Next we search the optimal rotation configuration based on two quantization error logs.
python rotate_model_r234_search_config.py --output_folder ./rotation_config/qwen/
We provide three training methods: SFT, QA-SFT with STE, and QA-SFT with RoSTE.
SFT
accelerate launch \
--config_file configs/ds_z3.yaml \
train_sft.py \
--config configs/recipes/qwen2.5_7b_sft.yaml
QA-SFT with STE
accelerate launch \
--config_file configs/ds_z3.yaml \
train_qa_sft_ste.py \
--config configs/recipes/qwen2.5_7b_qa_sft_ste.yaml
QA-SFT with RoSTE
accelerate launch \
--config_file configs/ds_z3.yaml \
train_qa_sft_roste.py \
--config configs/recipes/qwen2.5_7b_qa_sft_roste.yaml
We evaluate the final models on the TL;DR test dataset, which supports multi-GPU inference.
accelerate launch eval_tldr.py --model_dir Qwen/Qwen2.5-0.5B --method base --batch_size 8
Our code implementation is built upon open-source projects TL;DR Summarization, Tulu 3 and Huggingface TRL. The implementation of rotation is based on QuaRot and SpinQuant. We sincerely appreciate the efforts of these teams for their contributions to open-source research and development.
If you find our work useful in your research please consider citing our paper:
@article{wei2025roste,
title={RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models},
author={Wei, Quan and Yau, Chung-Yiu and Wai, Hoi-To and Zhao, Katie Yang and Kang, Dongyeop and Park, Youngsuk and Hong, Mingyi and others},
journal={arXiv preprint arXiv:2502.09003},
year={2025}
}