ICLR 2026, arXiv:2505.18502v3
Transfer source capability to target, then isolate specialized adaptation:
$$\Delta\theta = \theta^*_{\mathrm{tgt}} - \theta_{\mathrm{tgt}}$$
Goal: compose transferable skill deltas without corrupting base-model generality.
After cross-capability transfer, each module delta is compressed with an operator chosen by module role and sensitivity.
$$\hat{\Delta\theta}=\{C_m(\Delta\theta_m)\}_{m\in\mathcal{M}}$$
$$\Delta\theta^{\mathrm{embed}}=\operatorname{Prune}_{\alpha}\!\left(\Delta\theta^{\mathrm{embed}}\right)$$
$$\Delta\theta^{\mathrm{attn}}\approx U_r\Sigma_rV_r^\top,\ \operatorname{rank}(\Sigma_r)=r$$
$$\sum_{i=1}^{k}\sigma_i^2=\beta\sum_{i=1}^{\min(d_{\mathrm{out}},d_{\mathrm{in}})}\sigma_i^2$$
Keep the smallest rank \(k\) that satisfies the explained-variance threshold \(\beta\).
$$\hat{\theta}=\operatorname{Quant}_k(\theta,\mathbf{x})=\arg\min_{\hat{\theta}}\|\theta\mathbf{x}-\hat{\theta}\mathbf{x}\|^2$$
$$\hat{V}_{[r]}^\top=\operatorname{Quant}_k\!\left(V_{[r]}^\top,\mathbf{x}\right),\ \hat{U}_{[r]}=\operatorname{Quant}_k\!\left(U_{[r]},\Sigma_{[r]}\hat{V}_{[r]}^\top\mathbf{x}\right)$$
Each SkillPack is decoded through dequantization and reconstructed via truncated SVD before fusion.
$$\Delta\theta^{(dq)}\approx U\Sigma V^\top=\Delta\theta$$
$$\theta_{\mathrm{fused}}=\theta_{\mathrm{tgt}}+\Delta\theta$$
For \(n\) SkillPacks \(\{\hat{\Delta\theta}_i\}_{i=1}^n\), a router \(\mathcal{R}\) selects which updates to activate:
$$\theta_{\mathrm{fused}}=\theta_{\mathrm{tgt}}+\sum_{i=1}^{n}\mathcal{R}\!\left(\hat{\Delta\theta}_i\right)$$
Representative baselines include FuseLLM, Task Arithmetic/TIES/SCE/PCB/DARE/InfiFusion, Routed LoRA/Twin-Merging, and TALL Mask/EMR-Merging.
Takeaway
GraftLLM improves explicit fusion quality over strong merging baselines while keeping parameter growth relatively low.
Takeaway
Implicit fusion shows stronger average transfer across benchmark tasks than representative distillation and merging alternatives.
Takeaway
GraftLLM retains prior capabilities while adding new math and coding skills, outperforming representative continual-learning baselines under fixed budget.
This setting tests fusion when source capabilities come from highly distinct domains, stressing cross-domain compatibility and conflict handling.
Result trend: routed SkillPack composition remains robust, showing better balance than direct parameter blending in disparate-task mixtures.
Manual asset slot: add Figure 5/6 plots and Table 1 crop if you want paper-native result visuals.
Takeaway
Each component contributes measurably; removing module-aware compression or routing causes noticeable quality drops.
Takeaway
Performance gaps widen on harder settings, where SkillPack compression plus routing better preserves transferred capability.
Takeaway
Router selection aligns with task type and activates specialized SkillPacks, reducing interference across capabilities.
GraftLLM enables modular capability transfer across heterogeneous LLMs through compressed SkillPacks.
It reaches strong transfer/fusion quality while keeping parameter overhead and continual-learning interference low.
Overall, the method provides a composable interface for transfer, fusion, and incremental updates.
| Setting | Metric | Reported value |
|---|---|---|
| Explicit fusion | MT-Bench avg | 7.70 |
| Explicit fusion | AlpacaEval 2.0 LC gain | +8.07 vs best parameter-fusion baseline |
| Implicit fusion | Average gain | +0.8 (Llama target), +1.2 (Qwen target) |
| Continual learning | Average score | 64.3 at 10% budget |
Manual asset slot: optional full benchmark table image from the PDF appendix.
$$\mathcal{L}_{\mathrm{SFT}}(\theta)=-\mathbb{E}_{(x,y)\sim\mathcal{D}_{\mathrm{SFT}}}[\log p_\theta(y\mid x)]$$
$$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta;\pi_{\mathrm{ref}})=-\mathbb{E}_{(x,y_w,y_l)}\left[\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right)\right]$$
Du et al., 2026 (ICLR). arXiv:2505.18502v3