{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreibzhglmfoiy2yahogidnlcuvp4gqqisfp74vrnfb2ff55sipzotee",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjzczgbfi2a2"
},
"path": "/t/spanish-historical-web-corpus-unique-categories-religion-folklore-conspiracies-boe/175446#post_1",
"publishedAt": "2026-04-21T13:49:05.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Pepere45 (Dang)",
"https://spanishcorpusai.tech"
],
"textContent": "Hi everyone, I’ve been building a Spanish historical web corpus collected from the Internet Archive (Wayback Machine) covering 2002–2023, and I wanted to share it with the community. What makes it different: Most Spanish corpora focus on news and Wikipedia. This one goes deeper into categories that are virtually non-existent elsewhere: - Religion & Catholic traditions (Semana Santa, pilgrimages, cofradías) - Folklore & regional legends (Galician meigas, Basque basajaun, Celtic myths) - Esotericism & mysticism (astrology, tarot, occult Spanish web) - Conspiracies & pseudoscience — critical for misinformation detection - BOE legal texts — formal administrative Spanish since 2004 - Oposiciones exam materials — formal academic Spanish - Regional news from all 17 autonomous communities - Forums & colloquial Spanish (2003–2022) All records include automatic labeling: - topics, region, sentiment + score - linguistic era (web_1_0 → ia_era) - quality score (0–100), readability, lexical density - MD5 dedup hash Format: JSONL (Hugging Face compatible, auto-converted to Parquet) Available now: Pepere45 (Dang) More datasets coming this week (Wikipedia ES, religion, folklore, esotericism). Open to research collaborations, bulk licensing and custom extractions. Contact: info@spanishcorpusai.tech | https://spanishcorpusai.tech Happy to answer questions!",
"title": "Spanish Historical Web Corpus — unique categories (religion, folklore, conspiracies, BOE)"
}