New feature: moderation scores in Chat API responses, by parameter
developers.openai.com
Moderate generated content - Moderation | OpenAI API
Use OpenAI moderation models to detect harmful content in text and images. You can classify standalone inputs with the moderation endpoint or request moderation scores alongside a generated response. Use the results to enforce your application’s...
- Send:
`"moderation":{"model": "omni-moderation-latest"}`
- Receive the moderation endpoint’s classification object says the documentation:
response.moderation.input
response.moderation.output
However, that is wrong documentation for the RESTful API itself.
output → moderation → output is where you’ll find the object
Or “moderation” in event "type": "response.completed"
See that you ran flagged input. See that the model was flagged for non-refusal.
Concern
This only provides inclusion of a score in a response.
It doesn’t optionally prevent the input from being run.
It only protects you from OpenAI generations. It doesn’t protect you from OpenAI.
You API organization is still in jeopardy with use of unclassified unfiltered user content - and in jeopardy anyway from lots of things unclassified by moderation and undocumented until you are banned and credits taken: distillation, cyber research, biological, etc.
Perhaps a new API method, to instill false confidence and encourage you to get banned?
Discussion in the ATmosphere