External Publication

`refusals_v3` moderation eval failing on fine-tuning jobs again — internal errors across `gpt-4.1-nano` and `gpt-4o-mini`

OpenAI Developer Community May 12, 2026

I’m hitting a consistent failure during the post-training moderation evaluation step on fine-tuning jobs. Training completes, checkpoints and the fine-tuned model are created, then the refusals_v3 eval fails with internal error and exhausts all 3 retry attempts.

Reproduced 3 times so far: 2 runs on gpt-4.1-nano and 1 run on gpt-4o-mini. Same failure pattern every time, different base models.

Event log from the most recent run:

Retrying moderation eval refusals_v3 (attempt 3/3) due to an internal error.   00:37:07
Retrying moderation eval refusals_v3 (attempt 2/3) due to an internal error.   00:26:56
Evaluating model against our usage policies                                    00:26:56
New fine-tuned model created                                                   00:26:56
Checkpoint created at step 302                                                 00:26:56
Checkpoint created at step 151                                                 23:41:39
Fine-tuning job started                                                        23:41:37
Files validated, moving job to queued state                                    23:41:36
Validating training file: file-PrsA2qk3fi3ppPc3S1Lkgq                          23:41:36
Created fine-tuning job: ftjob-3S2R2CNYXZOiZUIIhd7x2Bqu

This looks like the same issue reported in February:

Fine-tuning job fails after 3 retries during moderation eval refusals_v3 (internal error, gpt-4.1-mini-2025-04-14) — Feb 12, 2026

In that thread, multiple users confirmed the failure across different base models and completely different (benign) datasets, and it was eventually resolved on the service side (“The refusals_v3 service seems to be up and running again”). The thread was closed without an official root-cause post. There are also a couple of related threads from January/February with the same internal error pattern.

A few things I’d appreciate clarity on:

Is refusals_v3 having issues again? Nothing on the status page as of right now, and I haven’t found a recent post about it.
The fine-tuned model is created before the eval runs — is it actually usable, or does a failed moderation eval block deployment regardless?
When the eval itself fails for internal reasons (not a content issue), what’s the recommended action?

Job ID and file ID are in the log above for anyone from the team who wants to dig in. Happy to share more details.

Thanks.

Discussion in the ATmosphere