DRM-SAINT-G: Testing Transformer growth by recomposable grafts instead of full retraining
Hi everyone,
I’m working on an experimental project called DRM-SAINT-G, a research runtime for testing whether Transformer models can gain useful capacity through structured, recomposable grafts instead of updating every weight in the model.
Repository:
github.com
GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...
DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth
The core question is simple:
Can a model gain useful new capacity without retraining the full parameter space?
Instead of treating training as an all-or-nothing process, DRM-SAINT-G freezes a smaller base model and adds trainable graft modules at selected points in the network.
The current experiment is:
DRM 5M base
- 24 trainable graft blocks ≈ 125M effective parameter budget
Each graft block is a residual module attached to Transformer block outputs:
h_out = h + scale * down(silu(up(h)))
For the current DRM 5M configuration:
base parameters: 5,699,059 24 graft parameters: 119,296,536 effective total: 124,995,595
This is being compared against a full DRM 125M smoke baseline.
Current reference numbers:
Full DRM 125M:
- 100 steps took roughly 4 hours on RTX 4090
- validation loss: 9.0499
DRM 5M + 24 grafts:
- effective budget: ~125M parameters
- CUDA peak in smoke: ~3.43 GiB
- recomposable checkpoint: works
- checkpoint reload/recompose loss diff: 0.0
- short tests show controlled validation behavior
Important caveat: this does not yet beat the full 125M model in absolute validation loss.
The current honest result is:
Full 125M validation loss: 9.0499 5M + 24 grafts short-run loss: ~10.4159 gap still exists: ~1.36 loss points
But the operational efficiency difference is large enough that the next experiment is more interesting than equal-step comparison:
Full 125M: 100 steps in ~4 hours
vs
DRM 5M + 24 grafts: train for the same ~4 hours
The hypothesis is not that grafting magically replaces full pretraining. The hypothesis is more specific:
structured local growth + global loss + recomposable checkpoints may provide better memory/time/checkpoint efficiency under constrained hardware.
Although I’m testing this first with my custom drm_transformer backbone, the grafting mechanism is not DRM-specific. In principle, the same idea can be applied to any Transformer-style model where we can attach modules to hidden states or target projections:
- GPT-style causal LMs
- encoder-decoder Transformers
- custom Transformer backbones
- potentially PEFT-style workflows
DRM-SAINT-G should be viewed as related to PEFT, LoRA/QLoRA, adapters, sparse updates, modular growth, and structured matrix factorization. I’m not claiming general superiority over LoRA or full training. LoRA/QLoRA remain required baselines.
The distinction I’m exploring is:
route where capacity should grow, train compact structured grafts, validate each graft by gain per parameter / byte / time, keep the model recomposable through graft artifacts.
Current infrastructure that works:
- frozen base model
- trainable graft blocks
- 5M → ~125M effective parameter budget
- CUDA-friendly execution
- recomposable graft checkpoints
- reload/recompose validation matching exactly
- validation loss tracking
- roadmap toward 350M and eventually partial 70B adaptation
Near-term next steps:
Finish the 4-hour wall-clock comparison: full 125M vs 5M + 24 grafts.
Improve graft selection: choose grafts by validation gain, not just by index.
Compare against LoRA/QLoRA on equivalent Transformer backbones.
Scale the grafted path toward 350M effective capacity on RTX 4090.
Publish clearer reports on:
- validation loss
- VRAM
- tokens/s
- checkpoint size
- gain per parameter
- gain per GPU-hour
I’d be interested in feedback from the Hugging Face community, especially around:
- fair PEFT baselines to compare against;
- best practices for adapter/graft checkpoint formats;
- whether similar “growth by recomposable modules” experiments already exist;
- how to make this compatible with standard HF Transformer models.
Again, this is early research, not a finished claim. But the first result is promising on the efficiency axis: a ~125M effective grafted model path running with low VRAM and recomposable checkpoints, while a full 125M baseline is much more expensive to train.
Repo:
github.com
GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...
DRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth
Discussion in the ATmosphere