SDFT: Learning Without Forgetting via Self-Distillation
No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.

SDFT: Learning Without Forgetting via Self-Distillation
No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.
TL;DR
- Problem: Traditional SFT causes catastrophic forgetting when learning new tasks
- Solution: SDFT (Self-Distillation Fine-Tuning)
- Key insight: Model generates its own training signals conditioned on demonstrations (On-policy)
- Result: Sequential skill accumulation without performance degradation
1. Why Does SFT Cause Forgetting?
The Limitation of Supervised Fine-Tuning
The fundamental issue with SFT:
Training data distribution ≠ Model's current output distribution| Aspect | SFT | Problem |
|---|---|---|
| Learning type | Off-policy | Learns only from ground truth |
| Distribution shift | Large | Gap between model output and training data |
| Outcome | Forgetting | New data overwrites existing distribution |
Off-policy vs On-policy
Off-policy (SFT):
Model output: "The cat sat on the..."
Ground truth: "A feline rested upon the..."
→ Forces learning regardless of model's natural outputOn-policy (SDFT):
Model output: "The cat sat on the mat."
Training signal: Generated from model's own output
→ Improves while maintaining current distribution2. SDFT: Self-Distillation Fine-Tuning
Core Idea
Show the model demonstrations, then let it generate its own training data based on that knowledge.
How It Works
1. Provide Demonstrations
[example input] → [example output]
2. Model generates conditionally
π(y|x, demonstrations)
3. Self-learning from generated data
Model output → Training signalMathematical Formulation
Traditional SFT objective:
SDFT objective:
Key difference: $y$ is generated by the model itself, not a fixed ground truth.
3. Why SDFT Works
The Importance of Distribution Preservation
SFT: P_model → P_new_data (destroys existing distribution)
SDFT: P_model → P_model + Δ (preserves while improving)Advantages of Self-Distillation
| Aspect | SFT | SDFT |
|---|---|---|
| Training signal | External data | Self-generated |
| Distribution change | Abrupt | Gradual |
| Prior capabilities | Lost | Preserved |
| Reward function | Not needed | Not needed |
Comparison with RL
Reinforcement learning is also on-policy, but:
- RL: Requires reward function (difficult to design)
- SDFT: No reward function needed (just demonstrations)
4. Experimental Results
Continual Learning Performance
Key findings from the paper:
| Metric | SFT | SDFT |
|---|---|---|
| New task accuracy | Baseline | Higher |
| Prior task retention | Sharp decline | Mostly preserved |
| Multi-task accumulation | Performance decay | Continuous improvement |
Sequential Learning Scenario
Task sequence: Task A → Task B → Task C
SFT results:
- Task A: 90% → 60% → 40% (continuous decline)
- Task B: — → 85% → 55%
- Task C: — → — → 80%
SDFT results:
- Task A: 90% → 88% → 86% (nearly maintained)
- Task B: — → 85% → 83%
- Task C: — → — → 82%5. Implementation Concept
Pseudo-code
def sdft_training_step(model, input_x, demonstrations):
# 1. Generate model output conditioned on demonstrations
with torch.no_grad():
# In-context generation
y_generated = model.generate(
input_x,
context=demonstrations,
temperature=1.0
)
# 2. Train on generated output (on-policy)
loss = -model.log_prob(y_generated, given=input_x)
return lossKey Components
- Demonstration conditioning: Show examples to the model
- Self-generation: Model produces its own outputs
- On-policy learning: Train on self-generated outputs
6. Limitations and Considerations
Current Limitations
| Limitation | Description |
|---|---|
| Demonstration quality | Requires good examples |
| Generation cost | Need to generate at each training step |
| Base model dependency | Model needs some initial capability |
Practical Considerations
- Demonstration selection: Need representative and diverse examples
- Generation diversity: Adjust temperature for varied outputs
- Learning rate: Too fast learning still causes forgetting
7. Practical Implications
When to Use SDFT?
✅ Good fit:
- Continuous model updates required
- Domain adaptation while preserving general capabilities
- Reward function design is difficult
❌ Not suitable:
- Learning completely new capabilities (no related knowledge)
- Fast one-shot adaptation needed
Future Outlook
SDFT could become a core technique for continuous LLM updates.
- Model updates: No full retraining needed when adding knowledge
- Domain adaptation: Maintain general abilities when specializing
- Safety: Preserve safety training results
Key Takeaways
| Concept | Description |
|---|---|
| Catastrophic Forgetting | New learning overwrites existing knowledge |
| Off-policy (SFT) | Learning from external data distribution → causes forgetting |
| On-policy (SDFT) | Learning from model's own distribution → prevents forgetting |
| Self-Distillation | Model teaches itself |
| Demonstration-conditioned | Examples guide model output |
References
- Paper: Self-Distillation Enables Continual Learning
- Project Page: https://self-distillation.github.io/SDFT.html
- Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (MIT CSAIL)
Subscribe to Newsletter
Related Posts

Gemma 4 — Google's Open Model That Rewrites the Rules
First Gemma model under Apache 2.0. Arena #3 overall. 31B Dense, 26B MoE (3.8B active), E4B/E2B edge models. AIME 89.2%, Codeforces ELO 2150, 256K context, multimodal.

TurboQuant in Practice — KV Cache Compression with llama.cpp and HuggingFace
Build llama.cpp with turbo3, HuggingFace integration, memory calculator, config guide. 536K context on 70B models.

TurboQuant Explained — Google's Extreme KV Cache Compression Algorithm
Compress KV cache to 3-bit with PolarQuant + Lloyd-Max. 4.6x memory savings with zero accuracy loss, no retraining.