External Publication
Visit Post

Automatic -100 masking of the questions in Labels

Hugging Face Forums [Unofficial] May 21, 2026
Source

Hi. A frontier model tried to convince me to use “from trl import DataCollatorForCompletionOnlyLM” instead of “from transformers import DataCollatorForLanguageModeling” reason being that “the DataCollatorForCompletionOnlyLM handles the-100 masking automatically of the question part in labels.”

When I countered that trl/DataCollatorForCompletionOnlyLM was deprecated, it acknowledged its error and stated that transformers/DataCollatorForLanguageModeling coupled with a prompt-completion dataset format and setting "completion_only_loss=True" will take care of the -100 masking automatically. No need for a customer data_collator function to mask manually.

My dataset is of the conversational format (messages: system-user-assistant). I printed the batch output coming out of the data_collator and I can see that no -100 masking of the question part of labels took place (as shown below). Is there a transformer data_collator setting or SFT setting that can be used to force it? Does "dataset_text_field=“messages” play a role in this case the same way dataset_text_field=“text” does with a prompt-completion dataset format?

Any hints will be greatly appreciated!

Labels (Batch 0): tensor([248045, 8678, 198, 2523, 513, 264, 10631, 1558, 421, 5529, 2708, 321, 61446, 10926, 13, 248046, 198, 248045, 846, 198, 4199, 1599, 12417, 1467, 68677, 3222, 506, 18773, 30, 248046, 198, 248045, 74455, 198, 248068, 271, 248069, 271, 11280, 12417, 25, 7884, 24093, 10254, 318, 30425, 220, 17, 15, 15, 18, 310, 6443, 220, 17, 15, 15, 19, 553, 248046, 198])

Discussion in the ATmosphere

Loading comments...