External Publication
Visit Post

Indic-faker: Generate realistic Indian synthetic data for NLP/ML — 8 languages, native scripts, batch DataFrame export

Hugging Face Forums [Unofficial] March 29, 2026
Source

Hi HuggingFace community!

I’ve released indic-faker — a Python library for generating realistic Indian synthetic data, specifically designed for ML/NLP pipelines.

Why it matters for HF users:

  • Need Indian text data for fine-tuning? Generate thousands of realistic Indian names, addresses, and profiles in 8 native scripts (Devanagari, Tamil, Malayalam, Telugu, Bengali, Kannada, Gujarati, Marathi)

  • Need structured synthetic data? fake.to_dataframe(10000) gives you 10K records as a pandas DataFrame

  • Need reproducible datasets? Seed support: IndicFaker(seed=42)

Example — generating a multilingual Indian dataset:

python

from indic_faker import IndicFaker

import pandas as pd

Generate 1000 records per language

all_data = []

for lang in [“hi”, “ta”, “ml”, “te”, “bn”, “kn”, “gu”, “mr”]:

fake = IndicFaker(language=lang, seed=42)

for _ in range(1000):

    all_data.append({

“name_latin”: fake.name(),

“name_native”: fake.name(script=“native”),

“language”: lang,

“phone”: fake.phone(),

“city”: fake.city(),

    })

df = pd.DataFrame(all_data)

print(f"Generated {len(df)} multilingual Indian records")

df.to_csv(“indic_dataset_8k.csv”, index=False)

Install: pip install indic-faker GitHub: https://github.com/adwaith-0/indic-faker

Would love to hear if this is useful for your Indian language projects! Open to feature requests.

Discussion in the ATmosphere

Loading comments...