SDFT: Learning Without Forgetting via Self-Distillation

No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.

TL;DR

Problem: Traditional SFT causes catastrophic forgetting when learning new tasks
Solution: SDFT (Self-Distillation Fine-Tuning)
Key insight: Model generates its own training signals conditioned on demonstrations (On-policy)
Result: Sequential skill accumulation without performance degradation

1. Why Does SFT Cause Forgetting?

The Limitation of Supervised Fine-Tuning

The fundamental issue with SFT:

python

Training data distribution ≠ Model's current output distribution

Aspect	SFT	Problem
Learning type	Off-policy	Learns only from ground truth
Distribution shift	Large	Gap between model output and training data
Outcome	Forgetting	New data overwrites existing distribution

Off-policy vs On-policy

Off-policy (SFT):

python

Model output: "The cat sat on the..."
Ground truth: "A feline rested upon the..."
→ Forces learning regardless of model's natural output

On-policy (SDFT):

python

Model output: "The cat sat on the mat."
Training signal: Generated from model's own output
→ Improves while maintaining current distribution

2. SDFT: Self-Distillation Fine-Tuning

Core Idea

Show the model demonstrations, then let it generate its own training data based on that knowledge.

How It Works

python

1. Provide Demonstrations
   [example input] → [example output]

2. Model generates conditionally
   π(y|x, demonstrations)

3. Self-learning from generated data
   Model output → Training signal

Mathematical Formulation

Traditional SFT objective:

$\mathcal{L}_\text{SFT} = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \pi_\theta(y|x)]$

SDFT objective:

$\mathcal{L}_\text{SDFT} = -\mathbb{E}_{x\sim\mathcal{D}}[\log \pi_\theta(y|x)] \quad \text{where } y \sim \pi_\theta(\cdot|x, \text{demos})$

Key difference: $y$ is generated by the model itself, not a fixed ground truth.

3. Why SDFT Works

The Importance of Distribution Preservation

python

SFT:  P_model → P_new_data (destroys existing distribution)
SDFT: P_model → P_model + Δ (preserves while improving)

Advantages of Self-Distillation

Aspect	SFT	SDFT
Training signal	External data	Self-generated
Distribution change	Abrupt	Gradual
Prior capabilities	Lost	Preserved
Reward function	Not needed	Not needed

Comparison with RL

Reinforcement learning is also on-policy, but:

RL: Requires reward function (difficult to design)
SDFT: No reward function needed (just demonstrations)

4. Experimental Results

Continual Learning Performance

Key findings from the paper:

Metric	SFT	SDFT
New task accuracy	Baseline	Higher
Prior task retention	Sharp decline	Mostly preserved
Multi-task accumulation	Performance decay	Continuous improvement

Sequential Learning Scenario

python

Task sequence: Task A → Task B → Task C

SFT results:
- Task A: 90% → 60% → 40%  (continuous decline)
- Task B: — → 85% → 55%
- Task C: — → — → 80%

SDFT results:
- Task A: 90% → 88% → 86%  (nearly maintained)
- Task B: — → 85% → 83%
- Task C: — → — → 82%

5. Implementation Concept

Pseudo-code

python

def sdft_training_step(model, input_x, demonstrations):
    # 1. Generate model output conditioned on demonstrations
    with torch.no_grad():
        # In-context generation
        y_generated = model.generate(
            input_x,
            context=demonstrations,
            temperature=1.0
        )

    # 2. Train on generated output (on-policy)
    loss = -model.log_prob(y_generated, given=input_x)

    return loss

Key Components

Demonstration conditioning: Show examples to the model
Self-generation: Model produces its own outputs
On-policy learning: Train on self-generated outputs

6. Limitations and Considerations

Current Limitations

Limitation	Description
Demonstration quality	Requires good examples
Generation cost	Need to generate at each training step
Base model dependency	Model needs some initial capability

Practical Considerations

Demonstration selection: Need representative and diverse examples
Generation diversity: Adjust temperature for varied outputs
Learning rate: Too fast learning still causes forgetting

7. Practical Implications

When to Use SDFT?

✅ Good fit:

Continuous model updates required
Domain adaptation while preserving general capabilities
Reward function design is difficult

❌ Not suitable:

Learning completely new capabilities (no related knowledge)
Fast one-shot adaptation needed

Future Outlook

SDFT could become a core technique for continuous LLM updates.

Model updates: No full retraining needed when adding knowledge
Domain adaptation: Maintain general abilities when specializing
Safety: Preserve safety training results

Key Takeaways

Concept	Description
Catastrophic Forgetting	New learning overwrites existing knowledge
Off-policy (SFT)	Learning from external data distribution → causes forgetting
On-policy (SDFT)	Learning from model's own distribution → prevents forgetting
Self-Distillation	Model teaches itself
Demonstration-conditioned	Examples guide model output

References

Paper: Self-Distillation Enables Continual Learning
Project Page: https://self-distillation.github.io/SDFT.html
Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (MIT CSAIL)

SDFT: Learning Without Forgetting via Self-Distillation