Fine-tuning job fails after 3 retries during moderation eval refusals_v3 (internal error, gpt-4.1-mini-2025-04-14)
Hi,
I’m encountering an issue where a fine-tuning job completes training successfully but ultimately fails during the moderation evaluation phase after three retry attempts due to an internal error.
Job Information
Job ID:
ftjob-OkPixpS21QjyQWCsIEJflUJbTraining Method: Supervised
Base Model:
gpt-4.1-mini-2025-04-14
Timeline from Logs(TZ: Asia/Seoul)
15:31:07 Created fine-tuning job
15:31:07 Validating training and validation files
15:31:14 Files validated, moving job to queued state
15:31:17 Fine-tuning job started
16:18:34 Checkpoint created at step 873
16:18:34 Checkpoint created at step 1746
16:18:34 New fine-tuned model created
16:18:34 Evaluating model against our usage policies
16:29:37 Retrying moderation eval refusals_v3 (attempt 2/3) due to an internal error.
16:59:38 Retrying moderation eval refusals_v3 (attempt 3/3) due to an internal error.
After the third retry attempt, the job status becomes failed.
Observations
The training phase completes successfully.
A fine-tuned model is created before the moderation evaluation step.
The failure occurs specifically during the
refusals_v3moderation evaluation.The error message indicates an “internal error”, not a policy violation.
Questions
Is this a known issue with moderation evaluation on
gpt-4.1-mini-2025-04-14?Does this indicate a problem in my training dataset format or content?
Is there a way to debug or bypass this moderation evaluation failure?
Should I retry the job, or is there a known mitigation?
Any insight would be greatly appreciated.
Thank you!
Discussion in the ATmosphere