External Publication
Visit Post

Working my way up to build a AI Model from scratch

Hugging Face Forums [Unofficial] March 20, 2026
Source

If the motivation is purely academic—with no regard for profitability—or if it’s a hobby for a wealthy individual like Elon Musk, there’s no problem at all. But for anyone else, building a competitive T2V model from scratch is quite unrealistic. Money and resources are the biggest hurdles.

A T2V model is a collection of many models. Stable Diffusion is also a collection, but a T2V model is even more so. The knowledge required to build it is so extensive that, even with the help of advanced AI, it would be difficult for a single expert to construct all the model architectures and datasets, so hiring additional staff would likely be necessary. The equipment is not only expensive but also requires data center-level power consumption.

If an individual aims for some kind of “tangible results,” I think it’s more realistic to try to shine in niche areas—such as fine-tuning existing models using LoRA, creating user-friendly frontends or services, or building workflows and pipelines that combine existing models in innovative ways to produce compelling outputs…

While the following figures are merely estimates, given current market trends, it is reasonable to assume that prices will continue to rise:


These are planning estimates , not hard deadlines. The time windows are an inference from three things: the open video stack is still improving with releases like Wan2.2 and HunyuanVideo-1.5 ; the market is still fragmented rather than winner-take-all; and application-layer startups are getting traction by focusing on workflow and consistency rather than training new foundation models. (GitHub)

1) Time window by goal

Goal What it really means Time to build something credible How long the opportunity likely stays meaningful My view
Prototype on open bases Prompt → video → selection → export 4–10 weeks Open now Good bet
Niche workflow product Ads, storyboard, avatar, ecommerce, brand consistency 3–9 months 12–36 months Best bet
Generic consumer T2V app “Type prompt, get cool clip” 2–6 months 6–12 months before standing out gets much harder Weakening fast
New T2V foundation model from scratch Full pretraining, data pipeline, eval, infra 12–24+ months Bad race already for most solo builders Poor bet

Why this table looks like this: a16z says enterprise image/video deployments use a median of 14 models , which implies room for orchestration and workflow products, not just raw generation. Reuters’ Higgsfield story points the same way: they integrate third-party models and add a proprietary reasoning/workflow layer. (Andreessen Horowitz)

2) Hardware and budget by path

Path Practical goal Minimum workable hardware Comfortable hardware Rough GPU budget Other resources Source / basis
A. Build on open bases Get a functional T2V system running 1× 24GB GPU 1× 48GB GPU $100–$1,000 64GB RAM, 200GB+ SSD is a good working assumption Wan2.2 says 720p/24fps can run on consumer cards like 4090 ; LTX docs recommend 64GB+ RAM and 200GB+ SSD. (GitHub)
B. Fine-tune / LoRA-tune video bases Better control, style, consistency, niche adaptation 1× 32–48GB GPU 1× 80GB GPU $500–$5,000+ 64GB+ RAM, 200GB+ SSD, more storage for datasets/checkpoints LTX-2 trainer recommends 80GB+ VRAM , with a low-VRAM config for 32GB GPUs; HunyuanVideo-I2V says 60GB minimum for 720p inference and 80GB recommended. (GitHub)
C. From-scratch T2V pretraining New base model Cluster Large H100/H200 cluster $70k–$200k+ Multi-TB storage, large data pipeline, engineering time, eval stack Open-Sora 1.2 reports 35k H100 GPU-hours on > 30M clips / ~80k hours; Open-Sora 2.0 reports $200k for a commercial-level model. (GitHub)

3) GPU tier cheat sheet

GPU tier Best use What it can realistically do What it usually cannot do comfortably
24GB Cheapest serious entry Run lighter open T2V stacks, build prototypes, tune small controller LLMs Serious video LoRA tuning on heavier stacks is tight
32GB Low-VRAM tuning tier Some low-VRAM video fine-tuning paths Heavy official video tuning remains constrained
48GB Practical sweet spot Better local iteration, more breathing room for video tuning Still below the “official comfortable” tier for many heavy video stacks
80GB Serious single-GPU work Hunyuan/LTX-class serious tuning and high-end inference Still not enough for true from-scratch frontier training
8× 80GB / 8× H200 Serious distributed work Official-style training workflows, bigger ablations Still expensive and overkill for first projects
~200 H200-class GPUs Frontier pretraining Real from-scratch T2V base-model training Not a solo-builder path

This tiering is anchored by official docs: HunyuanVideo-1.5 supports consumer inference with 14GB minimum when offloading is enabled; HunyuanVideo-I2V recommends 80GB ; LTX recommends A100 80GB or H100 and 64GB+ RAM ; Open-Sora 2.0 used large H200 clusters. (GitHub)

4) Current public cloud price anchors

GPU Public price anchor Notes Source
RTX 4090 24GB from $0.34/hr Cheap prototype tier (Runpod)
L4 24GB about $0.43–$0.44/hr Useful 24GB cloud option (Runpod)
RTX 6000 Ada 48GB $0.74/hr Good 48GB option (Runpod)
L40S 48GB $0.79/hr Strong 48GB option (Runpod)
A100 80GB $1.19/hr Strong budget training tier (Runpod)
H100 80GB from $1.99/hr , often $2.39/hr Faster, but a step up in cost (Runpod)
H200 141GB from $3.59/hr on budget cloud; $4.975/hr per GPU on AWS Capacity Blocks (Tokyo p5e) Premium tier (Runpod)

5) What those rates mean in actual project terms

These are simple arithmetic estimates from the public hourly prices above.

Scenario Example hardware Rough wall-clock Approx GPU cost
Prototype sprint 1× RTX 4090 1 week continuous ~$57
Prototype sprint, more headroom 1× RTX 6000 Ada 1 week continuous ~$124
Prototype sprint, strong 48GB 1× L40S 1 week continuous ~$133
Small serious tuning run 1× A100 80GB 3 days continuous ~$86
Small serious tuning run 1× H100 80GB 3 days continuous ~$143–$172
1-week serious tuning 1× A100 80GB 7 days continuous ~$200
1-week serious tuning 1× H100 80GB 7 days continuous ~$334–$401
1-week premium run 1× H200 7 days continuous ~$603–$836
Open-Sora 1.2-scale pretraining anchor 35,000 H100 GPU-hours n/a ~$70k–$84k on low-cost H100 pricing
Open-Sora 2.0 commercial-level anchor large H200 cluster n/a ~$200k reported

Notes:

  • The H100 row is shown as a range because public references differ between low-cost and more secure/retail pricing. (Runpod)
  • The Open-Sora rows are the clearest public anchors for what “real” from-scratch T2V training costs. (GitHub)

6) Non-GPU resources

Resource Prototype / fine-tune assumption From-scratch assumption Source
System RAM 64GB+ is a good working target More if preprocessing on the same box (LTX Documentation)
Fast local SSD 200GB+ More if storing datasets/checkpoints locally (LTX Documentation)
Object storage 1–5TB is often enough to start Many TB becomes normal S3 Standard is $0.023/GB-month for the first 50TB. (Amazon Web Services, Inc.)
Example storage cost 200GB SSD ≈ $34/month on GCP US example n/a (Google Cloud)
Data engineering Helpful Mandatory Open-Sora 1.2 scale: > 30M clips / ~80k hours. (GitHub)

7) Decision table

Your budget / setup Best move
< $500 Do not try to train a new video model. Build a prototype on open bases using 24GB cloud GPUs.
$500–$2k Build a real workflow prototype. Maybe one or two narrow LoRA experiments.
$2k–$10k Serious niche fine-tuning and repeated iteration become realistic.
$10k–$50k You can run a small team-style tuning program, but still not a serious from-scratch frontier race.
$70k+ From-scratch pretraining enters the conversation, but only as a serious engineering project.
$200k+ Now you are in the same broad budget class as Open-Sora 2.0’s reported commercial-level training effort.

8) Bottom line

Question Best answer
How much time do I have? Enough time to ship a meaningful niche/workflow product. Not much time to stand out with a generic “prompt in, video out” app.
How much hardware do I need? 24GB to start, 48GB for a practical sweet spot, 80GB for serious single-GPU tuning, and a cluster for from-scratch pretraining.
How much money do I need? $100–$1,000 for a prototype, $500–$5,000+ for serious fine-tuning, $70k–$200k+ for real from-scratch T2V pretraining.
What is the best bet? Build on open bases, fine-tune for a niche, and compete on control, consistency, and workflow rather than raw generation.

Discussion in the ATmosphere

Loading comments...