{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifx3o5x2edz4jeixglmd74ua4tjmsatfhyazubtum6zmouhybvlti",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mia52ivjwtk2"
},
"path": "/t/indic-faker-generate-realistic-indian-synthetic-data-for-nlp-ml-8-languages-native-scripts-batch-dataframe-export/174762#post_1",
"publishedAt": "2026-03-29T12:50:26.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"https://github.com/adwaith-0/indic-faker"
],
"textContent": "Hi HuggingFace community!\n\nI’ve released **indic-faker** — a Python library for generating realistic Indian synthetic data, specifically designed for ML/NLP pipelines.\n\n**Why it matters for HF users:**\n\n * Need Indian text data for fine-tuning? Generate thousands of realistic Indian names, addresses, and profiles in **8 native scripts** (Devanagari, Tamil, Malayalam, Telugu, Bengali, Kannada, Gujarati, Marathi)\n\n * Need structured synthetic data? `fake.to_dataframe(10000)` gives you 10K records as a pandas DataFrame\n\n * Need reproducible datasets? Seed support: `IndicFaker(seed=42)`\n\n\n\n\n**Example — generating a multilingual Indian dataset:**\n\n\n python\n\n\nfrom indic_faker import IndicFaker\n\nimport pandas as pd\n\n# Generate 1000 records per language\n\nall_data = []\n\nfor lang in [“hi”, “ta”, “ml”, “te”, “bn”, “kn”, “gu”, “mr”]:\n\n\n fake = IndicFaker(language=lang, seed=42)\n\n\nfor _ in range(1000):\n\n\n all_data.append({\n\n\n“name_latin”: fake.name(),\n\n“name_native”: fake.name(script=“native”),\n\n“language”: lang,\n\n“phone”: fake.phone(),\n\n“city”: fake.city(),\n\n\n })\n\n\ndf = pd.DataFrame(all_data)\n\nprint(f\"Generated {len(df)} multilingual Indian records\")\n\ndf.to_csv(“indic_dataset_8k.csv”, index=False)\n\n**Install:** `pip install indic-faker` **GitHub:** https://github.com/adwaith-0/indic-faker\n\nWould love to hear if this is useful for your Indian language projects! Open to feature requests.",
"title": "Indic-faker: Generate realistic Indian synthetic data for NLP/ML — 8 languages, native scripts, batch DataFrame export"
}