Back to Home

GraftLLM 🦙

Knowledge Fusion of Large Language Models Via Modular SkillPacks

ICLR 2026, arXiv:2505.18502v3

Talk Map

  1. Motivation and problem framing
  2. Method and architecture
  3. Experimental evidence
  4. Takeaways and appendix

Intro & Motivation

Why not just distill, finetune, or merge?

Why Cross-Capability Transfer Matters

Current pressure

  • Teams already own multiple specialist LLMs.
  • Full retraining is expensive and slow.
  • Naive multitask fusion creates interference.

Desired outcome

  • Modular transfer across heterogeneous backbones.
  • High task fidelity under parameter budgets.
  • Continual updates without catastrophic forgetting.
Heterogeneous transfer bottleneck comparison figure
Knowledge grafting intuition illustration
Figure 3 comparison of distillation and grafting scenarios

Method

GraftLLM pipeline and mechanisms

Figure 4 overview of GraftLLM with specific knowledge, fusion, and forget-free learning
Cross-capability transfer pipeline with module-aware compression and routing

Problem Formulation

Transfer source capability to target, then isolate specialized adaptation:

$$\Delta\theta = \theta^*_{\mathrm{tgt}} - \theta_{\mathrm{tgt}}$$

Task vector extraction from source and target models

Goal: compose transferable skill deltas without corrupting base-model generality.

Module-specific Adaptive Strategy

After cross-capability transfer, each module delta is compressed with an operator chosen by module role and sensitivity.

$$\hat{\Delta\theta}=\{C_m(\Delta\theta_m)\}_{m\in\mathcal{M}}$$

  • \(C_m(\cdot)\) can be pruning, low-rank decomposition, or quantization.
  • The compressed set forms the transferable SkillPack used in fusion and continual updates.

SkillPack: module-aware adaptive compression strategy

Module-aware adaptive compression strategy for SkillPack construction

SkillPack Compression Math I

Embedding and output head: pruning

$$\Delta\theta^{\mathrm{embed}}=\operatorname{Prune}_{\alpha}\!\left(\Delta\theta^{\mathrm{embed}}\right)$$

Attention modules: low-rank SVD

$$\Delta\theta^{\mathrm{attn}}\approx U_r\Sigma_rV_r^\top,\ \operatorname{rank}(\Sigma_r)=r$$

MLP modules: conservative rank selection

$$\sum_{i=1}^{k}\sigma_i^2=\beta\sum_{i=1}^{\min(d_{\mathrm{out}},d_{\mathrm{in}})}\sigma_i^2$$

Keep the smallest rank \(k\) that satisfies the explained-variance threshold \(\beta\).

SkillPack Compression Math II: Quantization

$$\hat{\theta}=\operatorname{Quant}_k(\theta,\mathbf{x})=\arg\min_{\hat{\theta}}\|\theta\mathbf{x}-\hat{\theta}\mathbf{x}\|^2$$

$$\hat{V}_{[r]}^\top=\operatorname{Quant}_k\!\left(V_{[r]}^\top,\mathbf{x}\right),\ \hat{U}_{[r]}=\operatorname{Quant}_k\!\left(U_{[r]},\Sigma_{[r]}\hat{V}_{[r]}^\top\mathbf{x}\right)$$

  • Apply mixed precision (\(k>1\)) based on singular-value importance.
  • Group-wise GPTQ quantization reduces storage while preserving critical module behavior.

SkillPack Composition and Reconstruction

Each SkillPack is decoded through dequantization and reconstructed via truncated SVD before fusion.

$$\Delta\theta^{(dq)}\approx U\Sigma V^\top=\Delta\theta$$

$$\theta_{\mathrm{fused}}=\theta_{\mathrm{tgt}}+\Delta\theta$$

  • \(U,\Sigma,V\) are obtained from the truncated decomposition of compressed modules.
  • The reconstructed task delta is added back to base parameters for model fusion.

Router Mechanism for Selective Integration

For \(n\) SkillPacks \(\{\hat{\Delta\theta}_i\}_{i=1}^n\), a router \(\mathcal{R}\) selects which updates to activate:

$$\theta_{\mathrm{fused}}=\theta_{\mathrm{tgt}}+\sum_{i=1}^{n}\mathcal{R}\!\left(\hat{\Delta\theta}_i\right)$$

  • Classifier-based router: lightweight FFN predicts the most suitable SkillPack.
  • Manual task-type assignment: deterministic mapping from task type to SkillPack.
  • Inference commonly uses top-1 routing to keep overhead low.

Routing Overview

Router selects the most relevant SkillPack for user query and activates one path

Evidence

Experiments and results

Baseline Methods

Pairwise LLM grafting

  • PEFT baselines: LoRA with varied ranks in both SFT and DPO stages.
  • Task vector compression: full-parameter tuning then pruning, SVD, or quantization at varied compression ratios.

Heterogeneous fusion + forget-free learning

  • Fusion baselines: multi-teacher distillation, parameter merging, routing-based, and mask-based methods.
  • Forget-free baselines: LoRA, Model Grafting, and Model Tailor.

Representative baselines include FuseLLM, Task Arithmetic/TIES/SCE/PCB/DARE/InfiFusion, Routed LoRA/Twin-Merging, and TALL Mask/EMR-Merging.

Datasets and Architectures

Coverage

  • 10 established benchmarks across instruction following, QA, reasoning, math, and coding.
  • Benchmarks are grouped into four categories with domain-specific response sampling.
  • Full benchmark listing is provided in App. E.3.

Pairwise setup

  • Target model: Llama-3.1-8B-Instruct.
  • Primary source model: Qwen-2.5-72B-Instruct.

Fusion and Continual-Learning Setups

  • Explicit fusion (FuseChat 2.0 protocol): OpenChat-3.5-7B as pivot plus six chat-model sources.
  • Implicit fusion (FuseChat 3.0 protocol): targets Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct with four stronger source LLMs.
  • Forget-free learning: sequential acquisition of math then coding capabilities using both SFT and DPO datasets.
  • Architecture details are in App. E; implementation details (training, hyperparameters, compute/runtime) are in App. F.

Pairwise GraftLLM (Sec. 5.1)

Pairwise GraftLLM results corresponding to Figures 7 and 8

Knowledge Fusion: Explicit (Sec. 5.2)

Explicit knowledge fusion results

Takeaway

GraftLLM improves explicit fusion quality over strong merging baselines while keeping parameter growth relatively low.

Knowledge Fusion: Implicit (Sec. 5.2)

Implicit knowledge fusion results

Takeaway

Implicit fusion shows stronger average transfer across benchmark tasks than representative distillation and merging alternatives.

Forget-free Learning (Sec. 5.3)

Forget-free learning comparison results

Takeaway

GraftLLM retains prior capabilities while adding new math and coding skills, outperforming representative continual-learning baselines under fixed budget.

Highly Distinct Fusion Domains (Sec. 5.4)

This setting tests fusion when source capabilities come from highly distinct domains, stressing cross-domain compatibility and conflict handling.

Result trend: routed SkillPack composition remains robust, showing better balance than direct parameter blending in disparate-task mixtures.

Results table for highly distinct fusion domains

Core Results (Key Slide)

Transfer + explicit fusion

  • Near full-finetune transfer quality under compression.
  • Routed GraftLLM (9.2B) MT-Bench avg: 7.70.
  • AlpacaEval 2.0 LC gain: +8.07 vs best parameter-fusion baseline.

Implicit fusion + continual learning

  • Average gain: +0.8 (Llama target), +1.2 (Qwen target).
  • Continual setting (10% budget): 64.3 average score.
  • Beats Model Tailor (62.2) and Model Grafting (61.3).

Manual asset slot: add Figure 5/6 plots and Table 1 crop if you want paper-native result visuals.

Ablations

Section 6: Component and Routing Analysis

6.1 Ablation Study of Each Component

Table 5 ablation study of GraftLLM components

Takeaway

Each component contributes measurably; removing module-aware compression or routing causes noticeable quality drops.

6.2 Effect of Task Difficulty and Data Settings

Figure 9 effect of task difficulty and data settings

Takeaway

Performance gaps widen on harder settings, where SkillPack compression plus routing better preserves transferred capability.

6.3 Router Behavior and SkillPack Usage

Table 6 router behavior and SkillPack analysis

Takeaway

Router selection aligns with task type and activates specialized SkillPacks, reducing interference across capabilities.

Conclusions

Limitations and Future Work

Conclusions

Main result

GraftLLM enables modular capability transfer across heterogeneous LLMs through compressed SkillPacks.

Practical impact

It reaches strong transfer/fusion quality while keeping parameter overhead and continual-learning interference low.

Overall, the method provides a composable interface for transfer, fusion, and incremental updates.

Limitations

  • Router and reconstruction can add inference overhead versus lightweight adapter-only paths.
  • Performance depends on the quality of SFT/DPO data and source-model distillation.
  • Compression choices (rank, bit allocation, pruning ratio) remain largely empirically tuned.
  • Cross-domain behavior may still degrade when tasks are highly mismatched.

Future Work

  • Learned/automatic schedules for rank and mixed-precision assignment.
  • Lower-latency routing and activation strategies for deployment.
  • Robustness extensions to broader architectures and stronger domain shifts.
  • More standardized evaluation for safety, privacy, and long-horizon continual updates.

Appendix

Backup material

Appendix A: Benchmark Highlights

Setting Metric Reported value
Explicit fusion MT-Bench avg 7.70
Explicit fusion AlpacaEval 2.0 LC gain +8.07 vs best parameter-fusion baseline
Implicit fusion Average gain +0.8 (Llama target), +1.2 (Qwen target)
Continual learning Average score 64.3 at 10% budget

Manual asset slot: optional full benchmark table image from the PDF appendix.

Appendix B: Compression Sensitivity

  • Higher SVD rank usually improves retention, with storage tradeoff.
  • Mixed precision outperforms naive uniform low-bit assignments.
  • DPO-oriented tasks degrade faster under aggressive compression than easier SFT settings.
  • MLP handling drives much of final quality variation.

Appendix C: SFT + DPO Objective Details (Backup)

$$\mathcal{L}_{\mathrm{SFT}}(\theta)=-\mathbb{E}_{(x,y)\sim\mathcal{D}_{\mathrm{SFT}}}[\log p_\theta(y\mid x)]$$

$$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}})=-\mathbb{E}_{(x,y_w,y_l)}\left[\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right)\right]$$

  • Preference pairs are sourced consistently to reduce reward bias.
  • Rule-based checks are used for math and coding tasks.

Appendix D: Future Work and Sources

  • Reduce inference overhead in reconstructed deployment path.
  • Automate rank and bit scheduling.
  • Extend routing-friendly compression to broader MoE settings.

Du et al., 2026 (ICLR). arXiv:2505.18502v3

PDF | Code