Automatic -100 masking of the questions in Labels
Hi. A frontier model tried to convince me to use “from trl import DataCollatorForCompletionOnlyLM” instead of “from transformers import DataCollatorForLanguageModeling” reason being that “the DataCollatorForCompletionOnlyLM handles the-100 masking automatically of the question part in labels.”
When I countered that trl/DataCollatorForCompletionOnlyLM was deprecated, it acknowledged its error and stated that transformers/DataCollatorForLanguageModeling coupled with a prompt-completion dataset format and setting "completion_only_loss=True" will take care of the -100 masking automatically. No need for a customer data_collator function to mask manually.
My dataset is of the conversational format (messages: system-user-assistant). I printed the batch output coming out of the data_collator and I can see that no -100 masking of the question part of labels took place (as shown below). Is there a transformer data_collator setting or SFT setting that can be used to force it? Does "dataset_text_field=“messages” play a role in this case the same way dataset_text_field=“text” does with a prompt-completion dataset format?
Any hints will be greatly appreciated!
Labels (Batch 0): tensor([248045, 8678, 198, 2523, 513, 264, 10631, 1558, 421, 5529, 2708, 321, 61446, 10926, 13, 248046, 198, 248045, 846, 198, 4199, 1599, 12417, 1467, 68677, 3222, 506, 18773, 30, 248046, 198, 248045, 74455, 198, 248068, 271, 248069, 271, 11280, 12417, 25, 7884, 24093, 10254, 318, 30425, 220, 17, 15, 15, 18, 310, 6443, 220, 17, 15, 15, 19, 553, 248046, 198])
Discussion in the ATmosphere