External Publication
Visit Post

DRM-SAINT-G: Testing Transformer growth by recomposable grafts instead of full retraining

Hugging Face Forums [Unofficial] May 19, 2026
Source

Hi everyone,

I’m working on an experimental project called DRM-SAINT-G, a research runtime for testing whether Transformer models can gain useful capacity through structured, recomposable grafts instead of updating every weight in the model.

Repository:

github.com

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth

The core question is simple:

Can a model gain useful new capacity without retraining the full parameter space?

Instead of treating training as an all-or-nothing process, DRM-SAINT-G freezes a smaller base model and adds trainable graft modules at selected points in the network.

The current experiment is:

DRM 5M base

  • 24 trainable graft blocks ≈ 125M effective parameter budget

Each graft block is a residual module attached to Transformer block outputs:

h_out = h + scale * down(silu(up(h)))

For the current DRM 5M configuration:

base parameters: 5,699,059 24 graft parameters: 119,296,536 effective total: 124,995,595

This is being compared against a full DRM 125M smoke baseline.

Current reference numbers:

Full DRM 125M:

  • 100 steps took roughly 4 hours on RTX 4090
  • validation loss: 9.0499

DRM 5M + 24 grafts:

  • effective budget: ~125M parameters
  • CUDA peak in smoke: ~3.43 GiB
  • recomposable checkpoint: works
  • checkpoint reload/recompose loss diff: 0.0
  • short tests show controlled validation behavior

Important caveat: this does not yet beat the full 125M model in absolute validation loss.

The current honest result is:

Full 125M validation loss: 9.0499 5M + 24 grafts short-run loss: ~10.4159 gap still exists: ~1.36 loss points

But the operational efficiency difference is large enough that the next experiment is more interesting than equal-step comparison:

Full 125M: 100 steps in ~4 hours

vs

DRM 5M + 24 grafts: train for the same ~4 hours

The hypothesis is not that grafting magically replaces full pretraining. The hypothesis is more specific:

structured local growth + global loss + recomposable checkpoints may provide better memory/time/checkpoint efficiency under constrained hardware.

Although I’m testing this first with my custom drm_transformer backbone, the grafting mechanism is not DRM-specific. In principle, the same idea can be applied to any Transformer-style model where we can attach modules to hidden states or target projections:

  • GPT-style causal LMs
  • encoder-decoder Transformers
  • custom Transformer backbones
  • potentially PEFT-style workflows

DRM-SAINT-G should be viewed as related to PEFT, LoRA/QLoRA, adapters, sparse updates, modular growth, and structured matrix factorization. I’m not claiming general superiority over LoRA or full training. LoRA/QLoRA remain required baselines.

The distinction I’m exploring is:

route where capacity should grow, train compact structured grafts, validate each graft by gain per parameter / byte / time, keep the model recomposable through graft artifacts.

Current infrastructure that works:

  • frozen base model
  • trainable graft blocks
  • 5M → ~125M effective parameter budget
  • CUDA-friendly execution
  • recomposable graft checkpoints
  • reload/recompose validation matching exactly
  • validation loss tracking
  • roadmap toward 350M and eventually partial 70B adaptation

Near-term next steps:

  1. Finish the 4-hour wall-clock comparison: full 125M vs 5M + 24 grafts.

  2. Improve graft selection: choose grafts by validation gain, not just by index.

  3. Compare against LoRA/QLoRA on equivalent Transformer backbones.

  4. Scale the grafted path toward 350M effective capacity on RTX 4090.

  5. Publish clearer reports on:

    • validation loss
    • VRAM
    • tokens/s
    • checkpoint size
    • gain per parameter
    • gain per GPU-hour

I’d be interested in feedback from the Hugging Face community, especially around:

  • fair PEFT baselines to compare against;
  • best practices for adapter/graft checkpoint formats;
  • whether similar “growth by recomposable modules” experiments already exist;
  • how to make this compatible with standard HF Transformer models.

Again, this is early research, not a finished claim. But the first result is promising on the efficiency axis: a ~125M effective grafted model path running with low VRAM and recomposable checkpoints, while a full 125M baseline is much more expensive to train.

Repo:

github.com

GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...

DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth

Discussion in the ATmosphere

Loading comments...