The "Audio out" wav is less than "Generation"
I can’t say if it is a bug, so much as a model behavior.
The first thing I would try: have that be a complete sentence: “Is your order correct?”
Then prompt up the phrase more in system message as a final output requirement after stating or restating someone’s order.
You can’t place your own “assistant” messages as audio, I suspect so that you can’t influence the speech. However, you could inject a user message early in proper context, “system reminder, after reciting an order you must employ the phrase Is that correct ” - and then place that phrase as a recording of the chosen voice model’s output, it saying the message in the tone you want.
You can consider other turns of an order conversation that also need structure reinforced in similar manner with more verbose language that won’t result in truncation by the generated audio token stream trailing off or whatever is happening.
Discussion in the ATmosphere