External Publication

DRM-SAINT-G: Testing Transformer growth by recomposable grafts instead of full retraining

Hugging Face Forums [Unofficial] May 19, 2026

Hi everyone,

I’m working on an experimental project called DRM-SAINT-G, a research runtime for testing whether Transformer models can gain useful capacity through structured, recomposable grafts instead of updating every weight in the model.

Repository:

github.com

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth

The core question is simple:

Can a model gain useful new capacity without retraining the full parameter space?

Instead of treating training as an all-or-nothing process, DRM-SAINT-G freezes a smaller base model and adds trainable graft modules at selected points in the network.

The current experiment is:

DRM 5M base

24 trainable graft blocks ≈ 125M effective parameter budget

Each graft block is a residual module attached to Transformer block outputs:

h_out = h + scale * down(silu(up(h)))

For the current DRM 5M configuration:

base parameters: 5,699,059 24 graft parameters: 119,296,536 effective total: 124,995,595

This is being compared against a full DRM 125M smoke baseline.

Current reference numbers:

Full DRM 125M:

100 steps took roughly 4 hours on RTX 4090
validation loss: 9.0499

DRM 5M + 24 grafts:

effective budget: ~125M parameters
CUDA peak in smoke: ~3.43 GiB
recomposable checkpoint: works
checkpoint reload/recompose loss diff: 0.0
short tests show controlled validation behavior

Important caveat: this does not yet beat the full 125M model in absolute validation loss.

The current honest result is:

Full 125M validation loss: 9.0499 5M + 24 grafts short-run loss: ~10.4159 gap still exists: ~1.36 loss points

But the operational efficiency difference is large enough that the next experiment is more interesting than equal-step comparison:

Full 125M: 100 steps in ~4 hours

DRM 5M + 24 grafts: train for the same ~4 hours

The hypothesis is not that grafting magically replaces full pretraining. The hypothesis is more specific:

structured local growth + global loss + recomposable checkpoints may provide better memory/time/checkpoint efficiency under constrained hardware.

Although I’m testing this first with my custom drm_transformer backbone, the grafting mechanism is not DRM-specific. In principle, the same idea can be applied to any Transformer-style model where we can attach modules to hidden states or target projections:

GPT-style causal LMs
encoder-decoder Transformers
custom Transformer backbones
potentially PEFT-style workflows

DRM-SAINT-G should be viewed as related to PEFT, LoRA/QLoRA, adapters, sparse updates, modular growth, and structured matrix factorization. I’m not claiming general superiority over LoRA or full training. LoRA/QLoRA remain required baselines.

The distinction I’m exploring is:

route where capacity should grow, train compact structured grafts, validate each graft by gain per parameter / byte / time, keep the model recomposable through graft artifacts.

Current infrastructure that works:

frozen base model
trainable graft blocks
5M → ~125M effective parameter budget
CUDA-friendly execution
recomposable graft checkpoints
reload/recompose validation matching exactly
validation loss tracking
roadmap toward 350M and eventually partial 70B adaptation

Near-term next steps:

Finish the 4-hour wall-clock comparison: full 125M vs 5M + 24 grafts.
Improve graft selection: choose grafts by validation gain, not just by index.
Compare against LoRA/QLoRA on equivalent Transformer backbones.
Scale the grafted path toward 350M effective capacity on RTX 4090.
Publish clearer reports on:
- validation loss
- VRAM
- tokens/s
- checkpoint size
- gain per parameter
- gain per GPU-hour

I’d be interested in feedback from the Hugging Face community, especially around:

fair PEFT baselines to compare against;
best practices for adapter/graft checkpoint formats;
whether similar “growth by recomposable modules” experiments already exist;
how to make this compatible with standard HF Transformer models.

Again, this is early research, not a finished claim. But the first result is promising on the efficiency axis: a ~125M effective grafted model path running with low VRAM and recomposable checkpoints, while a full 125M baseline is much more expensive to train.

Repo:

github.com

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

Discussion in the ATmosphere