{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiem5ozrh7ny5f3ym2qpwb3hftrk4wj7cmy2fh5nexdrkji7tmdgnu",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhitg2dg4qj2"
},
"path": "/t/working-my-way-up-to-build-a-ai-model-from-scratch/174377#post_4",
"publishedAt": "2026-03-20T12:59:15.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"fine-tuning existing models using LoRA",
"GitHub",
"Andreessen Horowitz",
"Runpod",
"LTX Documentation",
"Amazon Web Services, Inc.",
"Google Cloud"
],
"textContent": "If the motivation is purely academic—with no regard for profitability—or if it’s a hobby for a wealthy individual like Elon Musk, there’s no problem at all. But for anyone else, building a competitive T2V model from scratch is quite unrealistic. Money and resources are the biggest hurdles.\n\nA T2V model is a collection of many models. Stable Diffusion is also a collection, but a T2V model is even more so. The knowledge required to build it is so extensive that, even with the help of advanced AI, it would be difficult for a single expert to construct all the model architectures and datasets, so hiring additional staff would likely be necessary. The equipment is not only expensive but also requires data center-level power consumption.\n\nIf an individual aims for some kind of “tangible results,” I think it’s more realistic to try to shine in niche areas—such as fine-tuning existing models using LoRA, creating user-friendly frontends or services, or building workflows and pipelines that combine existing models in innovative ways to produce compelling outputs…\n\nWhile the following figures are merely estimates, given current market trends, it is reasonable to assume that prices will continue to rise:\n\n* * *\n\nThese are **planning estimates** , not hard deadlines. The time windows are an **inference** from three things: the open video stack is still improving with releases like **Wan2.2** and **HunyuanVideo-1.5** ; the market is still fragmented rather than winner-take-all; and application-layer startups are getting traction by focusing on workflow and consistency rather than training new foundation models. (GitHub)\n\n## 1) Time window by goal\n\nGoal | What it really means | Time to build something credible | How long the opportunity likely stays meaningful | My view\n---|---|---|---|---\nPrototype on open bases | Prompt → video → selection → export | **4–10 weeks** | **Open now** | Good bet\nNiche workflow product | Ads, storyboard, avatar, ecommerce, brand consistency | **3–9 months** | **12–36 months** | Best bet\nGeneric consumer T2V app | “Type prompt, get cool clip” | **2–6 months** | **6–12 months** before standing out gets much harder | Weakening fast\nNew T2V foundation model from scratch | Full pretraining, data pipeline, eval, infra | **12–24+ months** | Bad race already for most solo builders | Poor bet\n\n**Why this table looks like this:** a16z says enterprise image/video deployments use a **median of 14 models** , which implies room for orchestration and workflow products, not just raw generation. Reuters’ Higgsfield story points the same way: they integrate third-party models and add a proprietary reasoning/workflow layer. (Andreessen Horowitz)\n\n## 2) Hardware and budget by path\n\nPath | Practical goal | Minimum workable hardware | Comfortable hardware | Rough GPU budget | Other resources | Source / basis\n---|---|---|---|---|---|---\n**A. Build on open bases** | Get a functional T2V system running | **1× 24GB GPU** | **1× 48GB GPU** | **$100–$1,000** | 64GB RAM, 200GB+ SSD is a good working assumption | Wan2.2 says 720p/24fps can run on consumer cards like **4090** ; LTX docs recommend **64GB+ RAM** and **200GB+ SSD**. (GitHub)\n**B. Fine-tune / LoRA-tune video bases** | Better control, style, consistency, niche adaptation | **1× 32–48GB GPU** | **1× 80GB GPU** | **$500–$5,000+** | 64GB+ RAM, 200GB+ SSD, more storage for datasets/checkpoints | LTX-2 trainer recommends **80GB+ VRAM** , with a low-VRAM config for **32GB** GPUs; HunyuanVideo-I2V says **60GB minimum** for 720p inference and **80GB recommended**. (GitHub)\n**C. From-scratch T2V pretraining** | New base model | **Cluster** | **Large H100/H200 cluster** | **$70k–$200k+** | Multi-TB storage, large data pipeline, engineering time, eval stack | Open-Sora 1.2 reports **35k H100 GPU-hours** on **> 30M clips / ~80k hours**; Open-Sora 2.0 reports **$200k** for a commercial-level model. (GitHub)\n\n## 3) GPU tier cheat sheet\n\nGPU tier | Best use | What it can realistically do | What it usually cannot do comfortably\n---|---|---|---\n**24GB** | Cheapest serious entry | Run lighter open T2V stacks, build prototypes, tune small controller LLMs | Serious video LoRA tuning on heavier stacks is tight\n**32GB** | Low-VRAM tuning tier | Some low-VRAM video fine-tuning paths | Heavy official video tuning remains constrained\n**48GB** | Practical sweet spot | Better local iteration, more breathing room for video tuning | Still below the “official comfortable” tier for many heavy video stacks\n**80GB** | Serious single-GPU work | Hunyuan/LTX-class serious tuning and high-end inference | Still not enough for true from-scratch frontier training\n**8× 80GB / 8× H200** | Serious distributed work | Official-style training workflows, bigger ablations | Still expensive and overkill for first projects\n**~200 H200-class GPUs** | Frontier pretraining | Real from-scratch T2V base-model training | Not a solo-builder path\n\nThis tiering is anchored by official docs: HunyuanVideo-1.5 supports consumer inference with **14GB minimum** when offloading is enabled; HunyuanVideo-I2V recommends **80GB** ; LTX recommends **A100 80GB or H100** and **64GB+ RAM** ; Open-Sora 2.0 used large H200 clusters. (GitHub)\n\n## 4) Current public cloud price anchors\n\nGPU | Public price anchor | Notes | Source\n---|---|---|---\nRTX 4090 24GB | **from $0.34/hr** | Cheap prototype tier | (Runpod)\nL4 24GB | **about $0.43–$0.44/hr** | Useful 24GB cloud option | (Runpod)\nRTX 6000 Ada 48GB | **$0.74/hr** | Good 48GB option | (Runpod)\nL40S 48GB | **$0.79/hr** | Strong 48GB option | (Runpod)\nA100 80GB | **$1.19/hr** | Strong budget training tier | (Runpod)\nH100 80GB | **from $1.99/hr** , often **$2.39/hr** | Faster, but a step up in cost | (Runpod)\nH200 141GB | **from $3.59/hr** on budget cloud; **$4.975/hr per GPU** on AWS Capacity Blocks (Tokyo p5e) | Premium tier | (Runpod)\n\n## 5) What those rates mean in actual project terms\n\nThese are **simple arithmetic estimates** from the public hourly prices above.\n\nScenario | Example hardware | Rough wall-clock | Approx GPU cost\n---|---|---|---\nPrototype sprint | 1× RTX 4090 | 1 week continuous | **~$57**\nPrototype sprint, more headroom | 1× RTX 6000 Ada | 1 week continuous | **~$124**\nPrototype sprint, strong 48GB | 1× L40S | 1 week continuous | **~$133**\nSmall serious tuning run | 1× A100 80GB | 3 days continuous | **~$86**\nSmall serious tuning run | 1× H100 80GB | 3 days continuous | **~$143–$172**\n1-week serious tuning | 1× A100 80GB | 7 days continuous | **~$200**\n1-week serious tuning | 1× H100 80GB | 7 days continuous | **~$334–$401**\n1-week premium run | 1× H200 | 7 days continuous | **~$603–$836**\nOpen-Sora 1.2-scale pretraining anchor | 35,000 H100 GPU-hours | n/a | **~$70k–$84k** on low-cost H100 pricing\nOpen-Sora 2.0 commercial-level anchor | large H200 cluster | n/a | **~$200k** reported\n\n**Notes:**\n\n * The H100 row is shown as a range because public references differ between low-cost and more secure/retail pricing. (Runpod)\n * The Open-Sora rows are the clearest public anchors for what “real” from-scratch T2V training costs. (GitHub)\n\n\n\n## 6) Non-GPU resources\n\nResource | Prototype / fine-tune assumption | From-scratch assumption | Source\n---|---|---|---\nSystem RAM | **64GB+** is a good working target | More if preprocessing on the same box | (LTX Documentation)\nFast local SSD | **200GB+** | More if storing datasets/checkpoints locally | (LTX Documentation)\nObject storage | **1–5TB** is often enough to start | **Many TB** becomes normal | S3 Standard is **$0.023/GB-month** for the first 50TB. (Amazon Web Services, Inc.)\nExample storage cost | 200GB SSD ≈ **$34/month** on GCP US example | n/a | (Google Cloud)\nData engineering | Helpful | Mandatory | Open-Sora 1.2 scale: **> 30M clips / ~80k hours**. (GitHub)\n\n## 7) Decision table\n\nYour budget / setup | Best move\n---|---\n**< $500** | Do **not** try to train a new video model. Build a prototype on open bases using 24GB cloud GPUs.\n**$500–$2k** | Build a real workflow prototype. Maybe one or two narrow LoRA experiments.\n**$2k–$10k** | Serious niche fine-tuning and repeated iteration become realistic.\n**$10k–$50k** | You can run a small team-style tuning program, but still not a serious from-scratch frontier race.\n**$70k+** | From-scratch pretraining enters the conversation, but only as a serious engineering project.\n**$200k+** | Now you are in the same broad budget class as Open-Sora 2.0’s reported commercial-level training effort.\n\n## 8) Bottom line\n\nQuestion | Best answer\n---|---\n**How much time do I have?** | Enough time to ship a **meaningful niche/workflow product**. Not much time to stand out with a generic “prompt in, video out” app.\n**How much hardware do I need?** | **24GB** to start, **48GB** for a practical sweet spot, **80GB** for serious single-GPU tuning, and a **cluster** for from-scratch pretraining.\n**How much money do I need?** | **$100–$1,000** for a prototype, **$500–$5,000+** for serious fine-tuning, **$70k–$200k+** for real from-scratch T2V pretraining.\n**What is the best bet?** | Build on open bases, fine-tune for a niche, and compete on **control, consistency, and workflow** rather than raw generation.",
"title": "Working my way up to build a AI Model from scratch"
}