{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibmdbiretlekwiho5wmcxk4oqnfuzprd6ikn6zggalkm6qdqt3kqa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mma4wgtdbud2"
  },
  "path": "/t/drm-saint-g-testing-transformer-growth-by-recomposable-grafts-instead-of-full-retraining/176105#post_1",
  "publishedAt": "2026-05-19T18:32:36.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "github.com",
    "GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for..."
  ],
  "textContent": "Hi everyone,\n\nI’m working on an experimental project called DRM-SAINT-G, a research runtime for testing whether Transformer models can gain useful capacity through structured, recomposable grafts instead of updating every weight in the model.\n\nRepository:\n\ngithub.com\n\n### GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...\n\nDRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth\n\nThe core question is simple:\n\nCan a model gain useful new capacity without retraining the full parameter space?\n\nInstead of treating training as an all-or-nothing process, DRM-SAINT-G freezes a smaller base model and adds trainable graft modules at selected points in the network.\n\nThe current experiment is:\n\nDRM 5M base\n\n  * 24 trainable graft blocks\n≈ 125M effective parameter budget\n\n\n\nEach graft block is a residual module attached to Transformer block outputs:\n\nh_out = h + scale * down(silu(up(h)))\n\nFor the current DRM 5M configuration:\n\nbase parameters: 5,699,059\n24 graft parameters: 119,296,536\neffective total: 124,995,595\n\nThis is being compared against a full DRM 125M smoke baseline.\n\nCurrent reference numbers:\n\nFull DRM 125M:\n\n  * 100 steps took roughly 4 hours on RTX 4090\n  * validation loss: 9.0499\n\n\n\nDRM 5M + 24 grafts:\n\n  * effective budget: ~125M parameters\n  * CUDA peak in smoke: ~3.43 GiB\n  * recomposable checkpoint: works\n  * checkpoint reload/recompose loss diff: 0.0\n  * short tests show controlled validation behavior\n\n\n\nImportant caveat: this does not yet beat the full 125M model in absolute validation loss.\n\nThe current honest result is:\n\nFull 125M validation loss: 9.0499\n5M + 24 grafts short-run loss: ~10.4159\ngap still exists: ~1.36 loss points\n\nBut the operational efficiency difference is large enough that the next experiment is more interesting than equal-step comparison:\n\nFull 125M:\n100 steps in ~4 hours\n\nvs\n\nDRM 5M + 24 grafts:\ntrain for the same ~4 hours\n\nThe hypothesis is not that grafting magically replaces full pretraining. The hypothesis is more specific:\n\nstructured local growth + global loss + recomposable checkpoints\nmay provide better memory/time/checkpoint efficiency under constrained hardware.\n\nAlthough I’m testing this first with my custom drm_transformer backbone, the grafting mechanism is not DRM-specific. In principle, the same idea can be applied to any Transformer-style model where we can attach modules to hidden states or target projections:\n\n  * GPT-style causal LMs\n  * encoder-decoder Transformers\n  * custom Transformer backbones\n  * potentially PEFT-style workflows\n\n\n\nDRM-SAINT-G should be viewed as related to PEFT, LoRA/QLoRA, adapters, sparse updates, modular growth, and structured matrix factorization. I’m not claiming general superiority over LoRA or full training. LoRA/QLoRA remain required baselines.\n\nThe distinction I’m exploring is:\n\nroute where capacity should grow,\ntrain compact structured grafts,\nvalidate each graft by gain per parameter / byte / time,\nkeep the model recomposable through graft artifacts.\n\nCurrent infrastructure that works:\n\n  * frozen base model\n  * trainable graft blocks\n  * 5M → ~125M effective parameter budget\n  * CUDA-friendly execution\n  * recomposable graft checkpoints\n  * reload/recompose validation matching exactly\n  * validation loss tracking\n  * roadmap toward 350M and eventually partial 70B adaptation\n\n\n\nNear-term next steps:\n\n  1. Finish the 4-hour wall-clock comparison:\nfull 125M vs 5M + 24 grafts.\n\n  2. Improve graft selection:\nchoose grafts by validation gain, not just by index.\n\n  3. Compare against LoRA/QLoRA on equivalent Transformer backbones.\n\n  4. Scale the grafted path toward 350M effective capacity on RTX 4090.\n\n  5. Publish clearer reports on:\n\n     * validation loss\n     * VRAM\n     * tokens/s\n     * checkpoint size\n     * gain per parameter\n     * gain per GPU-hour\n\n\n\nI’d be interested in feedback from the Hugging Face community, especially around:\n\n  * fair PEFT baselines to compare against;\n  * best practices for adapter/graft checkpoint formats;\n  * whether similar “growth by recomposable modules” experiments already exist;\n  * how to make this compatible with standard HF Transformer models.\n\n\n\nAgain, this is early research, not a finished claim. But the first result is promising on the efficiency axis: a ~125M effective grafted model path running with low VRAM and recomposable checkpoints, while a full 125M baseline is much more expensive to train.\n\nRepo:\n\ngithub.com\n\n### GitHub - gnai-creator/DRM-SAINT-G: DRM-SAINT-G - DRM grafting with SAINT-Phi for...\n\nDRM-SAINT-G - DRM grafting with SAINT-Phi for compact model adaptation and growth",
  "title": "DRM-SAINT-G: Testing Transformer growth by recomposable grafts instead of full retraining"
}