External Publication
Visit Post

I need to create my own dataset based on mlabonne/orpo-dpo-mix-40k. but when i does it and create a dataset for ORPO training it gives error

Hugging Face Forums [Unofficial] March 4, 2026
Source

Yep, that error is happening because apply_chat_template() expects a list of message dicts like [{ "role": "...", "content": "..." }, ...], but your CSV has chosen and rejected as plain strings. So Jinja tries to read .role on a string and crashes.

Two easy fixes.

Fix 1: Wrap your strings into chat messages (most common)

If your CSV has:

  • prompt = user instruction

  • chosen = preferred assistant answer

  • rejected = worse assistant answer

Then convert them like this:

import os
from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_file.csv")["train"]

def format_chat_template(row):
    chosen_msgs = [
        {"role": "user", "content": row["prompt"]},
        {"role": "assistant", "content": row["chosen"]},
    ]
    rejected_msgs = [
        {"role": "user", "content": row["prompt"]},
        {"role": "assistant", "content": row["rejected"]},
    ]

    row["chosen"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)
    return row

dataset = dataset.map(format_chat_template, num_proc=os.cpu_count())
dataset = dataset.train_test_split(test_size=0.01, seed=42)

That will make your data compatible with the mlabonne/orpo-dpo-mix-40k style schema.

Fix 2: If your CSV already stores JSON lists of messages

If your chosen column looks like a JSON string (starts with [ and has role/content), then parse it first:

import json

def format_chat_template(row):
    chosen_msgs = json.loads(row["chosen"])
    rejected_msgs = json.loads(row["rejected"])
    row["chosen"] = tokenizer.apply_chat_template(chosen_msgs, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(rejected_msgs, tokenize=False)
    return row

Quick check: print one row["chosen"] before mapping to confirm whether it is plain text or JSON.

Also, are you still hitting this issue now, or did you solve it and you are trying to improve your ORPO dataset quality next?

Discussion in the ATmosphere

Loading comments...