External Publication

Working my way up to build a AI Model from scratch

Hugging Face Forums [Unofficial] March 20, 2026

If the motivation is purely academic—with no regard for profitability—or if it’s a hobby for a wealthy individual like Elon Musk, there’s no problem at all. But for anyone else, building a competitive T2V model from scratch is quite unrealistic. Money and resources are the biggest hurdles.

A T2V model is a collection of many models. Stable Diffusion is also a collection, but a T2V model is even more so. The knowledge required to build it is so extensive that, even with the help of advanced AI, it would be difficult for a single expert to construct all the model architectures and datasets, so hiring additional staff would likely be necessary. The equipment is not only expensive but also requires data center-level power consumption.

If an individual aims for some kind of “tangible results,” I think it’s more realistic to try to shine in niche areas—such as fine-tuning existing models using LoRA, creating user-friendly frontends or services, or building workflows and pipelines that combine existing models in innovative ways to produce compelling outputs…

While the following figures are merely estimates, given current market trends, it is reasonable to assume that prices will continue to rise:

These are planning estimates , not hard deadlines. The time windows are an inference from three things: the open video stack is still improving with releases like Wan2.2 and HunyuanVideo-1.5 ; the market is still fragmented rather than winner-take-all; and application-layer startups are getting traction by focusing on workflow and consistency rather than training new foundation models. (GitHub)

1) Time window by goal

Goal	What it really means	Time to build something credible	How long the opportunity likely stays meaningful	My view
Prototype on open bases	Prompt → video → selection → export	4–10 weeks	Open now	Good bet
Niche workflow product	Ads, storyboard, avatar, ecommerce, brand consistency	3–9 months	12–36 months	Best bet
Generic consumer T2V app	“Type prompt, get cool clip”	2–6 months	6–12 months before standing out gets much harder	Weakening fast
New T2V foundation model from scratch	Full pretraining, data pipeline, eval, infra	12–24+ months	Bad race already for most solo builders	Poor bet

Why this table looks like this: a16z says enterprise image/video deployments use a median of 14 models , which implies room for orchestration and workflow products, not just raw generation. Reuters’ Higgsfield story points the same way: they integrate third-party models and add a proprietary reasoning/workflow layer. (Andreessen Horowitz)

2) Hardware and budget by path

Path	Practical goal	Minimum workable hardware	Comfortable hardware	Rough GPU budget	Other resources	Source / basis
A. Build on open bases	Get a functional T2V system running	1× 24GB GPU	1× 48GB GPU	$100–$1,000	64GB RAM, 200GB+ SSD is a good working assumption	Wan2.2 says 720p/24fps can run on consumer cards like 4090 ; LTX docs recommend 64GB+ RAM and 200GB+ SSD. (GitHub)
B. Fine-tune / LoRA-tune video bases	Better control, style, consistency, niche adaptation	1× 32–48GB GPU	1× 80GB GPU	$500–$5,000+	64GB+ RAM, 200GB+ SSD, more storage for datasets/checkpoints	LTX-2 trainer recommends 80GB+ VRAM , with a low-VRAM config for 32GB GPUs; HunyuanVideo-I2V says 60GB minimum for 720p inference and 80GB recommended. (GitHub)
C. From-scratch T2V pretraining	New base model	Cluster	Large H100/H200 cluster	$70k–$200k+	Multi-TB storage, large data pipeline, engineering time, eval stack	Open-Sora 1.2 reports 35k H100 GPU-hours on > 30M clips / ~80k hours; Open-Sora 2.0 reports $200k for a commercial-level model. (GitHub)

3) GPU tier cheat sheet

GPU tier	Best use	What it can realistically do	What it usually cannot do comfortably
24GB	Cheapest serious entry	Run lighter open T2V stacks, build prototypes, tune small controller LLMs	Serious video LoRA tuning on heavier stacks is tight
32GB	Low-VRAM tuning tier	Some low-VRAM video fine-tuning paths	Heavy official video tuning remains constrained
48GB	Practical sweet spot	Better local iteration, more breathing room for video tuning	Still below the “official comfortable” tier for many heavy video stacks
80GB	Serious single-GPU work	Hunyuan/LTX-class serious tuning and high-end inference	Still not enough for true from-scratch frontier training
8× 80GB / 8× H200	Serious distributed work	Official-style training workflows, bigger ablations	Still expensive and overkill for first projects
~200 H200-class GPUs	Frontier pretraining	Real from-scratch T2V base-model training	Not a solo-builder path

This tiering is anchored by official docs: HunyuanVideo-1.5 supports consumer inference with 14GB minimum when offloading is enabled; HunyuanVideo-I2V recommends 80GB ; LTX recommends A100 80GB or H100 and 64GB+ RAM ; Open-Sora 2.0 used large H200 clusters. (GitHub)

4) Current public cloud price anchors

GPU	Public price anchor	Notes	Source
RTX 4090 24GB	from $0.34/hr	Cheap prototype tier	(Runpod)
L4 24GB	about $0.43–$0.44/hr	Useful 24GB cloud option	(Runpod)
RTX 6000 Ada 48GB	$0.74/hr	Good 48GB option	(Runpod)
L40S 48GB	$0.79/hr	Strong 48GB option	(Runpod)
A100 80GB	$1.19/hr	Strong budget training tier	(Runpod)
H100 80GB	from $1.99/hr , often $2.39/hr	Faster, but a step up in cost	(Runpod)
H200 141GB	from $3.59/hr on budget cloud; $4.975/hr per GPU on AWS Capacity Blocks (Tokyo p5e)	Premium tier	(Runpod)

5) What those rates mean in actual project terms

These are simple arithmetic estimates from the public hourly prices above.

Scenario	Example hardware	Rough wall-clock	Approx GPU cost
Prototype sprint	1× RTX 4090	1 week continuous	~$57
Prototype sprint, more headroom	1× RTX 6000 Ada	1 week continuous	~$124
Prototype sprint, strong 48GB	1× L40S	1 week continuous	~$133
Small serious tuning run	1× A100 80GB	3 days continuous	~$86
Small serious tuning run	1× H100 80GB	3 days continuous	~$143–$172
1-week serious tuning	1× A100 80GB	7 days continuous	~$200
1-week serious tuning	1× H100 80GB	7 days continuous	~$334–$401
1-week premium run	1× H200	7 days continuous	~$603–$836
Open-Sora 1.2-scale pretraining anchor	35,000 H100 GPU-hours	n/a	~$70k–$84k on low-cost H100 pricing
Open-Sora 2.0 commercial-level anchor	large H200 cluster	n/a	~$200k reported

Notes:

The H100 row is shown as a range because public references differ between low-cost and more secure/retail pricing. (Runpod)
The Open-Sora rows are the clearest public anchors for what “real” from-scratch T2V training costs. (GitHub)

6) Non-GPU resources

Resource	Prototype / fine-tune assumption	From-scratch assumption	Source
System RAM	64GB+ is a good working target	More if preprocessing on the same box	(LTX Documentation)
Fast local SSD	200GB+	More if storing datasets/checkpoints locally	(LTX Documentation)
Object storage	1–5TB is often enough to start	Many TB becomes normal	S3 Standard is $0.023/GB-month for the first 50TB. (Amazon Web Services, Inc.)
Example storage cost	200GB SSD ≈ $34/month on GCP US example	n/a	(Google Cloud)
Data engineering	Helpful	Mandatory	Open-Sora 1.2 scale: > 30M clips / ~80k hours. (GitHub)

7) Decision table

Your budget / setup	Best move
< $500	Do not try to train a new video model. Build a prototype on open bases using 24GB cloud GPUs.
$500–$2k	Build a real workflow prototype. Maybe one or two narrow LoRA experiments.
$2k–$10k	Serious niche fine-tuning and repeated iteration become realistic.
$10k–$50k	You can run a small team-style tuning program, but still not a serious from-scratch frontier race.
$70k+	From-scratch pretraining enters the conversation, but only as a serious engineering project.
$200k+	Now you are in the same broad budget class as Open-Sora 2.0’s reported commercial-level training effort.

8) Bottom line

Question	Best answer
How much time do I have?	Enough time to ship a meaningful niche/workflow product. Not much time to stand out with a generic “prompt in, video out” app.
How much hardware do I need?	24GB to start, 48GB for a practical sweet spot, 80GB for serious single-GPU tuning, and a cluster for from-scratch pretraining.
How much money do I need?	$100–$1,000 for a prototype, $500–$5,000+ for serious fine-tuning, $70k–$200k+ for real from-scratch T2V pretraining.
What is the best bet?	Build on open bases, fine-tune for a niche, and compete on control, consistency, and workflow rather than raw generation.