Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiey3qmqmrsur7vr6hmjhtfzdpnki4mzyzb6rkusixbu7v33obyaki",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgaqvyigbad2"
  },
  "path": "/t/i-need-to-create-my-own-dataset-based-on-mlabonne-orpo-dpo-mix-40k-but-when-i-does-it-and-create-a-dataset-for-orpo-training-it-gives-error/83014#post_4",
  "publishedAt": "2026-03-04T15:09:16.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Yep, that error is happening because `apply_chat_template()` expects a **list of message dicts** like `[{ \"role\": \"...\", \"content\": \"...\" }, ...]`, but your CSV has `chosen` and `rejected` as plain strings. So Jinja tries to read `.role` on a string and crashes.\n\nTwo easy fixes.\n\n### Fix 1: Wrap your strings into chat messages (most common)\n\nIf your CSV has:\n\n  * `prompt` = user instruction\n\n  * `chosen` = preferred assistant answer\n\n  * `rejected` = worse assistant answer\n\n\n\n\nThen convert them like this:\n\n\n    import os\n    from datasets import load_dataset\n\n    dataset = load_dataset(\"csv\", data_files=\"my_file.csv\")[\"train\"]\n\n    def format_chat_template(row):\n        chosen_msgs = [\n            {\"role\": \"user\", \"content\": row[\"prompt\"]},\n            {\"role\": \"assistant\", \"content\": row[\"chosen\"]},\n        ]\n        rejected_msgs = [\n            {\"role\": \"user\", \"content\": row[\"prompt\"]},\n            {\"role\": \"assistant\", \"content\": row[\"rejected\"]},\n        ]\n\n        row[\"chosen\"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)\n        row[\"rejected\"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)\n        return row\n\n    dataset = dataset.map(format_chat_template, num_proc=os.cpu_count())\n    dataset = dataset.train_test_split(test_size=0.01, seed=42)\n\n\n\nThat will make your data compatible with the `mlabonne/orpo-dpo-mix-40k` style schema.\n\n### Fix 2: If your CSV already stores JSON lists of messages\n\nIf your `chosen` column looks like a JSON string (starts with `[` and has `role/content`), then parse it first:\n\n\n    import json\n\n    def format_chat_template(row):\n        chosen_msgs = json.loads(row[\"chosen\"])\n        rejected_msgs = json.loads(row[\"rejected\"])\n        row[\"chosen\"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)\n        row[\"rejected\"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)\n        return row\n\n\n\nQuick check: print one `row[\"chosen\"]` before mapping to confirm whether it is plain text or JSON.\n\nAlso, are you still hitting this issue now, or did you solve it and you are trying to improve your ORPO dataset quality next?",
  "title": "I need to create my own dataset based on mlabonne/orpo-dpo-mix-40k. but when i does it and create a dataset for ORPO training it gives error"
}