Working my way up to build a AI Model from scratch
build a LLM of my own from scratch (say a prompt to video LLM!)
If this refers to T2V models, building it from scratch would be financially unfeasible…
The best way to work up to “my own AI model” today is not to jump straight into training a full prompt-to-video model from random weights. The effective path in 2026 is: train small models from scratch to learn the mechanics, then build the real system by adapting strong open models. That is not a compromise. It is the fastest route to both real understanding and something usable. Stanford’s CS336 is still explicitly about language modeling from scratch, Hugging Face still teaches tokenizer/model training separately, and Open-Sora 2.0 still shows that serious video pretraining is a large-scale engineering project, not a normal first build. (Stanford CS336)
First, fix the mental model
A “prompt-to-video LLM” is usually not one model. In the current open stack, the system is usually split into: a text model or text encoder that understands the prompt, a video latent compressor such as a 3D VAE, a video generator built with diffusion or flow matching, and then a decoder that turns latent video back into pixels. HunyuanVideo describes an LLM-based text encoder plus a Causal 3D VAE, and LTX-2 describes itself as an audio-video foundation model rather than a plain language model. (GitHub)
That means there are really three different goals hidden inside “build a model from scratch”:
- Learn the internals deeply.
- Build a useful product.
- Pretrain your own foundation model.
Those are different projects with different budgets, different timelines, and different failure modes. Most confusion starts when people mix them together. (Stanford CS336)
What “from scratch” should mean for you
For your stage, “from scratch” should mainly mean:
- write and train a small language model yourself
- write and train a small diffusion model yourself
- then build the real prompt-to-video system with pretrained parts
That gives you the concepts without forcing you into research-lab-scale compute on day one. Hugging Face’s course still says training from scratch makes sense mainly when you have a lot of data and it is very different from the data used by existing models. (Hugging Face)
What to build from scratch first
1. A tiny text model
Start with a tokenizer and a small decoder-only transformer. The important lessons here are not “how to get a great model,” but:
- what tokenization actually does
- how next-token prediction works
- how attention works
- how sampling works
- how training loss behaves
- how evaluation differs from vibes
Hugging Face’s tokenizer chapter is useful because it states the key point clearly: training a tokenizer is not the same as training a model. CS336 is the best current full-stack course for this path. (Hugging Face)
2. A tiny diffusion model
Since your end goal is video, you should learn diffusion early. Hugging Face’s Diffusion Course and the official basic training tutorial still provide the cleanest entry. The tutorial explicitly walks through training a UNet2DModel from scratch, which is exactly the right scale for learning denoising, conditioning, and sampling before you touch video. (Hugging Face)
This is why I would not treat “LLM” as the whole problem. For text-to-video, the hard part is usually not just language understanding. It is latent video generation, temporal consistency, and the data/training pipeline around that. Open-Sora and HunyuanVideo make that structure very clear. (arXiv)
The better way to build something real today
If your real goal is “I want a system that turns prompts into videos,” the strongest 2026 approach is this:
- Use a small text or multimodal model for planning and prompt rewriting.
- Use an open video model as the actual generator.
- Adapt it with LoRA or the model’s official trainer.
- Add ranking, filtering, and post-processing.
That is much closer to how modern open systems are actually used. HunyuanVideo has an official prompt-rewrite component, LTX-2 emphasizes production-ready audio+video workflows, and Wan provides practical local inference paths. (GitHub)
Which current base models are worth considering
For planning, rewriting, and control
Use a small open text or multimodal model for the controller side.
Gemma 3n is a good choice if you want a compact multimodal controller that can run on everyday devices. Google’s docs say it is optimized for phones, laptops, and tablets, and it supports text, vision, and audio input. (Google AI for Developers)
Qwen3.5 is a good choice if you want a small modern open text family for prototyping and task-specific fine-tuning. The official model cards explicitly position the small Qwen3.5 models for prototyping and task-specific tuning. (Hugging Face)
Qwen3-VL is useful if you want the planner to inspect images, reference frames, or storyboards and then output prompts or structured instructions. The official collection and model card present it as a text-image-video-capable family. (Hugging Face)
These models are good for:
- prompt cleanup
- shot planning
- storyboard text
- structured output
- caption expansion
- tool calls
They are not the actual video engine. (Google AI for Developers)
For actual video generation
Wan2.1 is still one of the easiest low-barrier entries. Its official repo says the T2V-1.3B model needs only 8.19 GB VRAM and can generate a 5-second 480p clip on a 4090 in about 4 minutes. If your main goal is “get a local open video model running,” this is still a strong starting point. (GitHub)
Wan2.2 is the more current Wan line. The official repo adds newer task branches such as image-to-video and text+image-to-video, and it documents single-GPU inference paths for the released models. If your hardware is decent and you want the newer stack, prefer 2.2 over 2.1. (GitHub)
HunyuanVideo-1.5 is the stronger current open base if you care more about quality and official training support. Its repo says training code and a LoRA tuning script were released in December 2025, and it supports distributed training, FSDP, context parallelism, and gradient checkpointing. (GitHub)
LTX-2 is the most interesting if your long-term goal is a controllable creative system rather than just raw generation. The official repo positions it as the first DiT-based audio-video foundation model with synchronized audio/video, multiple performance modes, and production-oriented outputs. The trainer package supports LoRA, full fine-tuning, and IC-LoRA/video-to-video training on custom datasets. (GitHub)
Open-Sora is the research reference, not the starter project. It is the clearest open example of what a full video training stack looks like. Open-Sora 1.2 describes a reproducible setup with about 30 million video clips totaling about 80k hours , and Open-Sora 2.0 says a commercial-level model was trained for about $200k. That is the right thing to study if you want to understand the frontier, but not the right first thing to build. (arXiv)
Fine-tune or train from scratch
Today, the default answer should be:
- use a pretrained base
- adapt with LoRA
- use QLoRA or other memory-saving approaches if hardware is tight
- only consider full pretraining after you have evidence the base model is the bottleneck
PEFT exists exactly for this. Its official docs say PEFT methods adapt large pretrained models by training only a small number of extra parameters, cutting compute and storage cost. The LoRA docs explain the core idea directly: low-rank adapters reduce the number of trainable parameters drastically. TRL’s SFTTrainer is the standard supervised fine-tuning path on top of that. (GitHub)
That means your practical path is usually:
- pick a current base
- define the behavior you want
- create a focused dataset
- fine-tune adapters
- evaluate
- only then decide if you need something heavier
How I would structure the roadmap
Stage 1. Learn the mechanics
Build:
- a tokenizer
- a tiny GPT-like model
- a tiny diffusion model
Resources:
- CS336
- Hugging Face LLM Course
- Hugging Face Diffusion Course (Stanford CS336)
Stage 2. Build a usable pipeline
Use:
- Gemma 3n or Qwen3.5/Qwen3-VL for planning
- Wan2.1 or Wan2.2 for the easiest video start
- HunyuanVideo-1.5 if you want a stronger base with official training support
- LTX-2 if you care about audio-video and stronger control (Google AI for Developers)
Stage 3. Scale only when needed
When your single-GPU scripts stop being enough, move to:
- Accelerate
- FSDP
- DeepSpeed
The official Accelerate docs explicitly cover FSDP and DeepSpeed integration for scaling training. (Hugging Face)
The two biggest pitfalls right now
1. Following stale tutorials
This is a real 2026 problem. Hugging Face Transformers v5 is a major release with meaningful API and architecture-handling changes, and the official blog presents it as a major simplification and modernization step. Separately, datasets changed enough that many older tutorials now fail with “Dataset scripts are no longer supported.” If you follow random 2023–2024 tutorials uncritically, some of them will simply be broken. (Hugging Face)
2. Confusing inference with training
A model that is easy to run can still be hard to tune. Wan gives accessible inference paths, but HunyuanVideo-1.5’s training release makes clear that serious tuning workflows involve distributed training features and a specific optimizer recommendation. LTX-2’s trainer also signals a more demanding setup than “download and click run.” (GitHub)
My recommendation, plainly
If your goal is understanding , train small models from scratch.
If your goal is a usable prompt-to-video system today , do this instead:
- use a small controller model such as Gemma 3n or Qwen3.5/Qwen3-VL
- use Wan2.1/2.2 , HunyuanVideo-1.5 , or LTX-2 as the generation base
- adapt with LoRA/PEFT
- train with TRL or the model’s official trainer
- spend a lot of effort on prompts, data quality, and evaluation
If your goal is your own full video foundation model from zero , treat Open-Sora as the benchmark for what that really implies in data, cost, and engineering. (Google AI for Developers)
The shortest honest summary is:
Learn from scratch. Ship with pretrained bases. Fine-tune before you pretrain.
Discussion in the ATmosphere