{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifx3o5x2edz4jeixglmd74ua4tjmsatfhyazubtum6zmouhybvlti",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi7wdmg544r2"
  },
  "path": "/t/indic-faker-generate-realistic-indian-synthetic-data-for-nlp-ml-8-languages-native-scripts-batch-dataframe-export/174762#post_1",
  "publishedAt": "2026-03-29T12:50:26.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "https://github.com/adwaith-0/indic-faker"
  ],
  "textContent": "Hi HuggingFace community!\n\nI’ve released **indic-faker** — a Python library for generating realistic Indian synthetic data, specifically designed for ML/NLP pipelines.\n\n**Why it matters for HF users:**\n\n  * Need Indian text data for fine-tuning? Generate thousands of realistic Indian names, addresses, and profiles in **8 native scripts** (Devanagari, Tamil, Malayalam, Telugu, Bengali, Kannada, Gujarati, Marathi)\n\n  * Need structured synthetic data? `fake.to_dataframe(10000)` gives you 10K records as a pandas DataFrame\n\n  * Need reproducible datasets? Seed support: `IndicFaker(seed=42)`\n\n\n\n\n**Example — generating a multilingual Indian dataset:**\n\n\n    python\n\n\nfrom indic_faker import IndicFaker\n\nimport pandas as pd\n\n# Generate 1000 records per language\n\nall_data = []\n\nfor lang in [“hi”, “ta”, “ml”, “te”, “bn”, “kn”, “gu”, “mr”]:\n\n\n    fake = IndicFaker(language=lang, seed=42)\n\n\nfor _ in range(1000):\n\n\n        all_data.append({\n\n\n“name_latin”: fake.name(),\n\n“name_native”: fake.name(script=“native”),\n\n“language”: lang,\n\n“phone”: fake.phone(),\n\n“city”: fake.city(),\n\n\n        })\n\n\ndf = pd.DataFrame(all_data)\n\nprint(f\"Generated {len(df)} multilingual Indian records\")\n\ndf.to_csv(“indic_dataset_8k.csv”, index=False)\n\n**Install:** `pip install indic-faker` **GitHub:** https://github.com/adwaith-0/indic-faker\n\nWould love to hear if this is useful for your Indian language projects! Open to feature requests.",
  "title": "Indic-faker: Generate realistic Indian synthetic data for NLP/ML — 8 languages, native scripts, batch DataFrame export"
}