{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidpv5jkj7mpg3tuwaqixfzb27jvafg6otyxv7e7qxhl6fwgsw7xte",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhfhy2o7ptv2"
  },
  "path": "/t/working-my-way-up-to-build-a-ai-model-from-scratch/174377#post_2",
  "publishedAt": "2026-03-19T05:47:52.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "T2V models",
    "Stanford CS336",
    "GitHub",
    "Hugging Face",
    "arXiv",
    "Google AI for Developers"
  ],
  "textContent": "> build a LLM of my own from scratch (say a prompt to video LLM!)\n\nIf this refers to T2V models, building it from scratch would be financially unfeasible…\n\n* * *\n\nThe best way to work up to “my own AI model” today is **not** to jump straight into training a full prompt-to-video model from random weights. The effective path in 2026 is: **train small models from scratch to learn the mechanics, then build the real system by adapting strong open models**. That is not a compromise. It is the fastest route to both real understanding and something usable. Stanford’s CS336 is still explicitly about language modeling from scratch, Hugging Face still teaches tokenizer/model training separately, and Open-Sora 2.0 still shows that serious video pretraining is a large-scale engineering project, not a normal first build. (Stanford CS336)\n\n## First, fix the mental model\n\nA “prompt-to-video LLM” is usually **not one model**. In the current open stack, the system is usually split into: a **text model or text encoder** that understands the prompt, a **video latent compressor** such as a 3D VAE, a **video generator** built with diffusion or flow matching, and then a decoder that turns latent video back into pixels. HunyuanVideo describes an LLM-based text encoder plus a Causal 3D VAE, and LTX-2 describes itself as an audio-video foundation model rather than a plain language model. (GitHub)\n\nThat means there are really **three different goals** hidden inside “build a model from scratch”:\n\n  1. **Learn the internals deeply.**\n  2. **Build a useful product.**\n  3. **Pretrain your own foundation model.**\n\n\n\nThose are different projects with different budgets, different timelines, and different failure modes. Most confusion starts when people mix them together. (Stanford CS336)\n\n## What “from scratch” should mean for you\n\nFor your stage, “from scratch” should mainly mean:\n\n  * write and train a **small language model** yourself\n  * write and train a **small diffusion model** yourself\n  * then build the real prompt-to-video system with pretrained parts\n\n\n\nThat gives you the concepts without forcing you into research-lab-scale compute on day one. Hugging Face’s course still says training from scratch makes sense mainly when you have **a lot of data** and it is **very different** from the data used by existing models. (Hugging Face)\n\n## What to build from scratch first\n\n### 1. A tiny text model\n\nStart with a tokenizer and a small decoder-only transformer. The important lessons here are not “how to get a great model,” but:\n\n  * what tokenization actually does\n  * how next-token prediction works\n  * how attention works\n  * how sampling works\n  * how training loss behaves\n  * how evaluation differs from vibes\n\n\n\nHugging Face’s tokenizer chapter is useful because it states the key point clearly: **training a tokenizer is not the same as training a model**. CS336 is the best current full-stack course for this path. (Hugging Face)\n\n### 2. A tiny diffusion model\n\nSince your end goal is video, you should learn diffusion early. Hugging Face’s Diffusion Course and the official basic training tutorial still provide the cleanest entry. The tutorial explicitly walks through training a `UNet2DModel` from scratch, which is exactly the right scale for learning denoising, conditioning, and sampling before you touch video. (Hugging Face)\n\nThis is why I would not treat “LLM” as the whole problem. For text-to-video, the hard part is usually not just language understanding. It is latent video generation, temporal consistency, and the data/training pipeline around that. Open-Sora and HunyuanVideo make that structure very clear. (arXiv)\n\n## The better way to build something real today\n\nIf your real goal is “I want a system that turns prompts into videos,” the strongest 2026 approach is this:\n\n  1. **Use a small text or multimodal model for planning and prompt rewriting.**\n  2. **Use an open video model as the actual generator.**\n  3. **Adapt it with LoRA or the model’s official trainer.**\n  4. **Add ranking, filtering, and post-processing.**\n\n\n\nThat is much closer to how modern open systems are actually used. HunyuanVideo has an official prompt-rewrite component, LTX-2 emphasizes production-ready audio+video workflows, and Wan provides practical local inference paths. (GitHub)\n\n## Which current base models are worth considering\n\n### For planning, rewriting, and control\n\nUse a **small open text or multimodal model** for the controller side.\n\n**Gemma 3n** is a good choice if you want a compact multimodal controller that can run on everyday devices. Google’s docs say it is optimized for phones, laptops, and tablets, and it supports text, vision, and audio input. (Google AI for Developers)\n\n**Qwen3.5** is a good choice if you want a small modern open text family for prototyping and task-specific fine-tuning. The official model cards explicitly position the small Qwen3.5 models for prototyping and task-specific tuning. (Hugging Face)\n\n**Qwen3-VL** is useful if you want the planner to inspect images, reference frames, or storyboards and then output prompts or structured instructions. The official collection and model card present it as a text-image-video-capable family. (Hugging Face)\n\nThese models are good for:\n\n  * prompt cleanup\n  * shot planning\n  * storyboard text\n  * structured output\n  * caption expansion\n  * tool calls\n\n\n\nThey are **not** the actual video engine. (Google AI for Developers)\n\n### For actual video generation\n\n**Wan2.1** is still one of the easiest low-barrier entries. Its official repo says the T2V-1.3B model needs only **8.19 GB VRAM** and can generate a 5-second 480p clip on a 4090 in about 4 minutes. If your main goal is “get a local open video model running,” this is still a strong starting point. (GitHub)\n\n**Wan2.2** is the more current Wan line. The official repo adds newer task branches such as image-to-video and text+image-to-video, and it documents single-GPU inference paths for the released models. If your hardware is decent and you want the newer stack, prefer 2.2 over 2.1. (GitHub)\n\n**HunyuanVideo-1.5** is the stronger current open base if you care more about quality and official training support. Its repo says training code and a LoRA tuning script were released in December 2025, and it supports distributed training, FSDP, context parallelism, and gradient checkpointing. (GitHub)\n\n**LTX-2** is the most interesting if your long-term goal is a controllable creative system rather than just raw generation. The official repo positions it as the first DiT-based audio-video foundation model with synchronized audio/video, multiple performance modes, and production-oriented outputs. The trainer package supports LoRA, full fine-tuning, and IC-LoRA/video-to-video training on custom datasets. (GitHub)\n\n**Open-Sora** is the research reference, not the starter project. It is the clearest open example of what a full video training stack looks like. Open-Sora 1.2 describes a reproducible setup with about **30 million video clips** totaling about **80k hours** , and Open-Sora 2.0 says a commercial-level model was trained for about **$200k**. That is the right thing to study if you want to understand the frontier, but not the right first thing to build. (arXiv)\n\n## Fine-tune or train from scratch\n\nToday, the default answer should be:\n\n  * **use a pretrained base**\n  * **adapt with LoRA**\n  * use **QLoRA or other memory-saving approaches** if hardware is tight\n  * only consider full pretraining after you have evidence the base model is the bottleneck\n\n\n\nPEFT exists exactly for this. Its official docs say PEFT methods adapt large pretrained models by training only a small number of extra parameters, cutting compute and storage cost. The LoRA docs explain the core idea directly: low-rank adapters reduce the number of trainable parameters drastically. TRL’s SFTTrainer is the standard supervised fine-tuning path on top of that. (GitHub)\n\nThat means your practical path is usually:\n\n  * pick a current base\n  * define the behavior you want\n  * create a focused dataset\n  * fine-tune adapters\n  * evaluate\n  * only then decide if you need something heavier\n\n\n\n## How I would structure the roadmap\n\n### Stage 1. Learn the mechanics\n\nBuild:\n\n  * a tokenizer\n  * a tiny GPT-like model\n  * a tiny diffusion model\n\n\n\nResources:\n\n  * CS336\n  * Hugging Face LLM Course\n  * Hugging Face Diffusion Course\n(Stanford CS336)\n\n\n\n### Stage 2. Build a usable pipeline\n\nUse:\n\n  * Gemma 3n or Qwen3.5/Qwen3-VL for planning\n  * Wan2.1 or Wan2.2 for the easiest video start\n  * HunyuanVideo-1.5 if you want a stronger base with official training support\n  * LTX-2 if you care about audio-video and stronger control\n(Google AI for Developers)\n\n\n\n### Stage 3. Scale only when needed\n\nWhen your single-GPU scripts stop being enough, move to:\n\n  * **Accelerate**\n  * **FSDP**\n  * **DeepSpeed**\n\n\n\nThe official Accelerate docs explicitly cover FSDP and DeepSpeed integration for scaling training. (Hugging Face)\n\n## The two biggest pitfalls right now\n\n### 1. Following stale tutorials\n\nThis is a real 2026 problem. Hugging Face **Transformers v5** is a major release with meaningful API and architecture-handling changes, and the official blog presents it as a major simplification and modernization step. Separately, `datasets` changed enough that many older tutorials now fail with **“Dataset scripts are no longer supported.”** If you follow random 2023–2024 tutorials uncritically, some of them will simply be broken. (Hugging Face)\n\n### 2. Confusing inference with training\n\nA model that is easy to **run** can still be hard to **tune**. Wan gives accessible inference paths, but HunyuanVideo-1.5’s training release makes clear that serious tuning workflows involve distributed training features and a specific optimizer recommendation. LTX-2’s trainer also signals a more demanding setup than “download and click run.” (GitHub)\n\n## My recommendation, plainly\n\nIf your goal is **understanding** , train small models from scratch.\n\nIf your goal is **a usable prompt-to-video system today** , do this instead:\n\n  * use a small controller model such as **Gemma 3n** or **Qwen3.5/Qwen3-VL**\n  * use **Wan2.1/2.2** , **HunyuanVideo-1.5** , or **LTX-2** as the generation base\n  * adapt with **LoRA/PEFT**\n  * train with **TRL** or the model’s official trainer\n  * spend a lot of effort on prompts, data quality, and evaluation\n\n\n\nIf your goal is **your own full video foundation model from zero** , treat Open-Sora as the benchmark for what that really implies in data, cost, and engineering. (Google AI for Developers)\n\nThe shortest honest summary is:\n\n**Learn from scratch. Ship with pretrained bases. Fine-tune before you pretrain.**",
  "title": "Working my way up to build a AI Model from scratch"
}