This repo represent a tiny and reforged version of the original MeDistil-d2n
framework and the related paper studies for the BioASQ workshop on multilingual clinical texts summarization.
The original project has a major limitation of Seq2Seq
trainer dependencies.
The goal of the project is to bridge the gap with fine-tuning SLM LLM models (AutoModelCasualLM
) on long-input context by heavily rely on decoder
based models with following input Formatting Concepts.
- ✅ Replacement of
Seq2SeqTrainer
:AutoModelCasualLM
models (Qwen
series in particular).- Support instruction tuning
- ✅ Refactoring and narrowing the scope, dropping dependencies.
- ✅ Switch dependencies to
Python 3.10+
- Narrow scope of the framework. We don not support DeepSpeed by default
- Reforge data preparation concept (Qwen2.5 support) (see Formatting Concepts)
- Refactor evaluation
- Fixed
Trainer
limitation on not-exploiting.generate
call forpredictions
- Fixed
- Dataset cropping
- Support rationale annotation using third-party API hosting (OpenRouter)
- Reforge prefix
TaskPrefixTrainer
.- Reforge list of parameters
-
‼️ Memory leakage on evaluation
- The complete list of dependencies
pip install -r requirements.txt
- Download
punkt_tab
fornltk
import nltk
nltk.download('punkt_tab')
Manual Training:
./distill_ft_qwen25_test.sh --from_pretrained "AutoModelCasualLM-from-HF" --dataset "multiclinsum" --model_type "distill"
NOTE: We use the following post-processing script for dataset preparation.
List of the parameters
-
--from_pretrained
: Model from hugging face that nestingAutoModelCasualLM
-
--dataset
:multiclinsum
(see downloading script and post-processing) -
--alpha
: Task weight for multi-task training.$Loss = alpha * pred_l + (1 - alpha) * rationale_l$
-
--model_type
:-
standard
: Standard finetuning (baseline) -
distill
: Distilling step-by-step
-
The pretrained models are publicly available:
Model 🤗 | Link |
---|---|
nicolay-r/qwen25-05b-multiclinsum-distil |
model-card |
nicolay-r/qwen25-05b-multiclinsum-standard |
model-card |
We use bulk-chain
project to infer:
rationale
prompts, necessary for distill-based fine-tuning [using this script].- Test data for competition submissions [using this script]
- MultiClinSum
- We use the following script for downloading datasets.
- Web: https://temu.bsc.es/multiclinsum
- Data: https://zenodo.org/records/15463353
- BioASQ: http://bioasq.org/
- Data formatting for QWEN
- Fine-tuning setup
- bulk-chain: https://github.com/nicolay-r/bulk-chain
- Annotation and test-set inference.