{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibmdbiretlekwiho5wmcxk4oqnfuzprd6ikn6zggalkm6qdqt3kqa",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mma4wgtdbud2"
},
"path": "/t/drm-saint-g-testing-transformer-growth-by-recomposable-grafts-instead-of-full-retraining/176105#post_1",
"publishedAt": "2026-05-19T18:32:36.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"github.com",
"GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for..."
],
"textContent": "Hi everyone,\n\nI’m working on an experimental project called DRM-SAINT-G, a research runtime for testing whether Transformer models can gain useful capacity through structured, recomposable grafts instead of updating every weight in the model.\n\nRepository:\n\ngithub.com\n\n### GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...\n\nDRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth\n\nThe core question is simple:\n\nCan a model gain useful new capacity without retraining the full parameter space?\n\nInstead of treating training as an all-or-nothing process, DRM-SAINT-G freezes a smaller base model and adds trainable graft modules at selected points in the network.\n\nThe current experiment is:\n\nDRM 5M base\n\n * 24 trainable graft blocks\n≈ 125M effective parameter budget\n\n\n\nEach graft block is a residual module attached to Transformer block outputs:\n\nh_out = h + scale * down(silu(up(h)))\n\nFor the current DRM 5M configuration:\n\nbase parameters: 5,699,059\n24 graft parameters: 119,296,536\neffective total: 124,995,595\n\nThis is being compared against a full DRM 125M smoke baseline.\n\nCurrent reference numbers:\n\nFull DRM 125M:\n\n * 100 steps took roughly 4 hours on RTX 4090\n * validation loss: 9.0499\n\n\n\nDRM 5M + 24 grafts:\n\n * effective budget: ~125M parameters\n * CUDA peak in smoke: ~3.43 GiB\n * recomposable checkpoint: works\n * checkpoint reload/recompose loss diff: 0.0\n * short tests show controlled validation behavior\n\n\n\nImportant caveat: this does not yet beat the full 125M model in absolute validation loss.\n\nThe current honest result is:\n\nFull 125M validation loss: 9.0499\n5M + 24 grafts short-run loss: ~10.4159\ngap still exists: ~1.36 loss points\n\nBut the operational efficiency difference is large enough that the next experiment is more interesting than equal-step comparison:\n\nFull 125M:\n100 steps in ~4 hours\n\nvs\n\nDRM 5M + 24 grafts:\ntrain for the same ~4 hours\n\nThe hypothesis is not that grafting magically replaces full pretraining. The hypothesis is more specific:\n\nstructured local growth + global loss + recomposable checkpoints\nmay provide better memory/time/checkpoint efficiency under constrained hardware.\n\nAlthough I’m testing this first with my custom drm_transformer backbone, the grafting mechanism is not DRM-specific. In principle, the same idea can be applied to any Transformer-style model where we can attach modules to hidden states or target projections:\n\n * GPT-style causal LMs\n * encoder-decoder Transformers\n * custom Transformer backbones\n * potentially PEFT-style workflows\n\n\n\nDRM-SAINT-G should be viewed as related to PEFT, LoRA/QLoRA, adapters, sparse updates, modular growth, and structured matrix factorization. I’m not claiming general superiority over LoRA or full training. LoRA/QLoRA remain required baselines.\n\nThe distinction I’m exploring is:\n\nroute where capacity should grow,\ntrain compact structured grafts,\nvalidate each graft by gain per parameter / byte / time,\nkeep the model recomposable through graft artifacts.\n\nCurrent infrastructure that works:\n\n * frozen base model\n * trainable graft blocks\n * 5M → ~125M effective parameter budget\n * CUDA-friendly execution\n * recomposable graft checkpoints\n * reload/recompose validation matching exactly\n * validation loss tracking\n * roadmap toward 350M and eventually partial 70B adaptation\n\n\n\nNear-term next steps:\n\n 1. Finish the 4-hour wall-clock comparison:\nfull 125M vs 5M + 24 grafts.\n\n 2. Improve graft selection:\nchoose grafts by validation gain, not just by index.\n\n 3. Compare against LoRA/QLoRA on equivalent Transformer backbones.\n\n 4. Scale the grafted path toward 350M effective capacity on RTX 4090.\n\n 5. Publish clearer reports on:\n\n * validation loss\n * VRAM\n * tokens/s\n * checkpoint size\n * gain per parameter\n * gain per GPU-hour\n\n\n\nI’d be interested in feedback from the Hugging Face community, especially around:\n\n * fair PEFT baselines to compare against;\n * best practices for adapter/graft checkpoint formats;\n * whether similar “growth by recomposable modules” experiments already exist;\n * how to make this compatible with standard HF Transformer models.\n\n\n\nAgain, this is early research, not a finished claim. But the first result is promising on the efficiency axis: a ~125M effective grafted model path running with low VRAM and recomposable checkpoints, while a full 125M baseline is much more expensive to train.\n\nRepo:\n\ngithub.com\n\n### GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...\n\nDRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth",
"title": "DRM-SAINT-G: Testing Transformer growth by recomposable grafts instead of full retraining"
}