Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiavhbwbkr433qwhcvvtxlimjpfe75m2jfyw67jxw5e2a2ivpnl2m4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mltttd4kexd2"
  },
  "path": "/t/dataset-cli-1m-975k-nl-shell-pairs-13-languages-6-shells-apache-2-0/176022#post_1",
  "publishedAt": "2026-05-14T21:08:17.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "CLI-1M Explorer - a Hugging Face Space by carosh",
    "@CaroDaShellShib"
  ],
  "textContent": "Hi HF community! I just published carosh/cli-1m — the first large-scale multilingual dataset for natural-language → shell command generation.\n\n**Numbers:**\n\n- 975,933 training pairs\n\n- 6 shells: bash, zsh, fish, PowerShell, nushell, oils-osh\n\n- 13 languages\n\n- 18 industry buckets\n\n- 108× NL2Bash (the previous reference dataset)\n\n**Load it:**\n\n```python\n\nfrom datasets import load_dataset\n\n# SFT training\n\nds = load_dataset(“carosh/cli-1m”, split=“train”)\n\n# 50k browse-friendly subset\n\nds = load_dataset(“carosh/cli-1m”, name=“sample”, split=“train”)\n\n# Domain-specific (security only)\n\nds = load_dataset(“carosh/cli-1m”, name=“domains”, split=“security”)\n\n```\n\n**Interactive explorer:** CLI-1M Explorer - a Hugging Face Space by carosh\n\n**Help wanted:** Looking for native speakers of Hebrew, Arabic, Hindi, Korean, or Russian to spot-check 50 translations each (~30 min, full credit in dataset card). DM @CaroDaShellShib on X or reply here to join the contributor waitlist.\n\nApache-2.0. Feedback welcome.",
  "title": "[Dataset] CLI-1M: 975K NL→shell pairs — 13 languages, 6 shells, Apache-2.0"
}