pip install -U "git+https://github.com/lapp0/distily.git"
Distily allows you to distill a model with
- Quantized weights: e.g. TriLM, bitnet
- Distinct architecture: State-Space models such as Mamba, Mixture-of-Experts (MoE)
- Modified architecture: Decrease (or increase) the
- number of layers
- width and depth of attention heads and dense layer.
- the number of attention and KV heads.
Minimal Example: distily_gpt2
Command to create a distilled gpt2
with only 6 layers:
python3 -m distily.run \
--teacher_model_name_or_path gpt2 \
--output_dir distily_gpt2 \
--hub_model_id "distily/distily_gpt2" \
--push_to_hub True \
--student_model_config {"n_layers": 6} \
--student_model_as_bitnet True
The Resulting distily_gpt2
Model has (TODO: explain metrics).
For more examples, review the Examples documentation.
To push to hub, you must prepare your hub token
HF_WRITE=<your hub token> python3 -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('${HF_WRITE}')"
TODO: commit the linked docs once complete
Using Distily
- How Distillation Works: The Distily Recipe
- Quickstart / Examples
- Parameter Selection
Available Models
Contributing
- Standard knowledge distillation using logits.
- Distill using intermediate features including hidden states and attentions.
- Implement Value-Transfer (simply distillation loss on v of q,k,v)
- Improve sampling efficiency through synthetic data generation.
- Implement cross-entropy classification loss (traditional LLM loss function)
- Apply projector to logits (https://arxiv.org/pdf/2310.17183)
- Apply "teacher recording", run teacher inference once, use features dataset any number of times.
- Distill to model with fewer
num_hidden_layers
by implementing layer mappers. - Distill to a model with modified module dimensions and behaviors (e.g.,
intermediate_size
,hidden_act
) by employing projectors. - Distill to a model with modified
num_attention_heads
andnum_key_value_heads
.
- Distill to Bitnet (b1.58)
- Distill to State-Space / Mamba
- Distill to MoE
- Distill to Parameter Sharing (ALBERT-style) Model