{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibzhglmfoiy2yahogidnlcuvp4gqqisfp74vrnfb2ff55sipzotee",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjzczgbfi2a2"
  },
  "path": "/t/spanish-historical-web-corpus-unique-categories-religion-folklore-conspiracies-boe/175446#post_1",
  "publishedAt": "2026-04-21T13:49:05.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Pepere45 (Dang)",
    "https://spanishcorpusai.tech"
  ],
  "textContent": "Hi everyone, I’ve been building a Spanish historical web corpus collected from the Internet Archive (Wayback Machine) covering 2002–2023, and I wanted to share it with the community. What makes it different: Most Spanish corpora focus on news and Wikipedia. This one goes deeper into categories that are virtually non-existent elsewhere: - Religion & Catholic traditions (Semana Santa, pilgrimages, cofradías) - Folklore & regional legends (Galician meigas, Basque basajaun, Celtic myths) - Esotericism & mysticism (astrology, tarot, occult Spanish web) - Conspiracies & pseudoscience — critical for misinformation detection - BOE legal texts — formal administrative Spanish since 2004 - Oposiciones exam materials — formal academic Spanish - Regional news from all 17 autonomous communities - Forums & colloquial Spanish (2003–2022) All records include automatic labeling: - topics, region, sentiment + score - linguistic era (web_1_0 → ia_era) - quality score (0–100), readability, lexical density - MD5 dedup hash Format: JSONL (Hugging Face compatible, auto-converted to Parquet) Available now: Pepere45 (Dang) More datasets coming this week (Wikipedia ES, religion, folklore, esotericism). Open to research collaborations, bulk licensing and custom extractions. Contact: info@spanishcorpusai.tech | https://spanishcorpusai.tech Happy to answer questions!",
  "title": "Spanish Historical Web Corpus — unique categories (religion, folklore, conspiracies, BOE)"
}