External Publication
Visit Post

Point AI at your own work before someone else does (building securely with AI / tips)

OpenAI Developer Community June 1, 2026
Source

If you are building with AI, you already have the same tool your future attacker will use against you. The sensible move is to use it first.

The capability that lets a model find a flaw in a stranger’s app is the capability that finds the flaw in yours. The asymmetry is that you get to go first, and you go first with every advantage: the source open in front of you, the architecture in your head, unlimited attempts against a staging copy, and time before anything ships. An attacker gets none of that. They work blind, from the outside, under rate limits and detection. Right up until you launch without having used any of it, every edge in this matchup belongs to you.

This is a short argument for spending that edge, and for spending it on two surfaces, your code and the behaviour of the product you wrap around the model.

Red-team the code

AI-assisted code review is the cheap, constant first pass. A few things that work better than “find bugs in this”:

  • Ask how it gets misused, beyond the obvious bugs. “How would someone abuse this endpoint?” surfaces more than “is this code correct?” because it puts the model in an adversary’s frame.

  • Trace input to sink. Ask the model to follow where user-controlled input reaches a query, a file path, a shell call, or a deserialiser, and where an object gets fetched without an ownership check. Injection and broken authorization are most of what actually gets exploited, and both are about data flow the model can follow.

  • Turn findings into tests. When it flags something, have it generate the adversarial input that proves it, and keep that input in your test suite so the issue cannot quietly come back.

  • Treat output as leads, not a verdict. It will hallucinate findings and miss real ones in the same pass. Every result is something to verify, and the value is the cheap coverage, not the confidence.

The product is its own attack surface

This is the part most teams skip, and it is the part I most want to land. Your code can be clean and your product still trivially exploitable, because an AI product has an attack surface that does not live in the code at all. It lives in the model’s behaviour.

If you ship anything where a model reads untrusted input or takes actions, you have inherited a class of problems that a code review will never catch. Worth putting a model in the attacker’s seat against, on a staging build:

  1. Prompt injection. The moment your product feeds untrusted text (a web page, an email, an uploaded document, another user’s message) into a model that can act, that text is effectively code running with your agent’s privileges. Plant instructions inside the data and watch what the agent does with them.
  2. Tool and function-call abuse. Can a user talk your agent into calling a tool it should not, or calling the right tool with arguments that cross a boundary it should respect?
  3. Exfiltration through the model. Can someone pull your system prompt, another user’s data, or a secret out of the context window with the right phrasing?
  4. Your safety layer. Whatever policy you wrap around the model, try to talk past it. Assume your users will.
  5. Business-logic abuse. Discounts, free-tier loops, rate limits, refund flows. Models are unreasonably good at finding the weird path through a workflow, which is exactly what you want before a real user finds it.

Treat the product the way an attacker treats it: as a system that can be talked into misbehaving. Then have a model try to talk it into misbehaving, repeatedly, before launch and on a schedule after.

Make it continuous, and vary the inputs

A one-time audit before launch ages badly. The version that holds up is wired into the loop: an adversarial review step in CI, an eval suite of abuse cases that runs on every change, a recurring “break the staging build” session that someone actually owns.

One lesson worth borrowing from the world of evaluation: if you always test against the same fixed list of attack strings, you have measured your defences against exactly that list and nothing beyond it. Have the model generate fresh adversarial cases each run rather than replaying last month’s, so you are testing the defence itself rather than your memory of what already got patched. A defence that only blocks the payloads you have already seen is a defence with a short shelf life. For most products this is a useful habit. For one group here it is the whole ballgame, which is the next section.

If you are building a security product, your eval is the hard part

A lot of people in this community are building the security tools themselves: scanners, triage agents, things that read a codebase or a live target and report what is wrong. If that is you, everything above applies twice over, because the two distortions stop being a hygiene concern and become the thing that decides whether your product actually works on a target it has never seen.

Two traps in particular:

  • Contaminated benchmarks. If you measure your scanner against a public set of known-vulnerable apps, part of what you are measuring is whether the underlying model already memorised those apps from its training data. The score looks great and tells you little about a target the model has never met. The team behind one of the most cited web-security benchmarks publicly retired it for this exact reason once it had been public long enough to be ingested.

  • Undefended targets. If you only ever test against bare apps with no WAF, no rate limiting, and no detection, you are measuring the easy version of the job. Real customers run defences, and a tool that is loud and easily blocked behaves very differently in production than it does on the bench.

This is the problem I went and built something for. PolyRange is an open-source framework that generates a fresh, themed, fully working target on every run, with real backing services, a real database, and real defences to get past, so the vulnerability class stays fixed while everything memorisable is randomised per deploy. A model cannot pass by recall, because the specific target did not exist before the run started. It is MIT-licensed and self-hostable, and if you are evaluating your own security product it gives you a way to produce a number that still means something next month.

I am not claiming it is the only way to do this. The general principle, generating fresh and defended targets rather than replaying a static set, is the part that matters, and you can apply it to your own eval suite without my code at all. The point is that if you are shipping a security product on top of these models, this is the axis your evaluation lives or dies on, and it is worth taking seriously before a customer takes it seriously for you.

Stay honest about what this is

AI red-teaming is jagged. It finds real issues and misses obvious ones in the same pass, and it produces confident-looking false positives that cost you time. It does not replace threat modeling, a proper pentest before a high-stakes launch, or review by someone who does security for a living. The honest framing is that it is the cheap, always-on first line that clears the easy wins and frees scarce human attention for the hard ones. Used that way it is a clear gain. Trusted as a clean bill of health it will hurt you.

The point

The capability is already sitting in your stack. Someone will eventually point it at your product. You can point it there first, today, with the source open and the clock stopped, which is a position your attacker will never get to occupy. The cheapest red-teamer you will ever have is the one already inside your dev loop, and the only real mistake is leaving it idle.

Discussion in the ATmosphere

Loading comments...