External Publication

Product Proposal: Expanding GPT-4o into Real-Time Multimodal Video Agents

OpenAI Developer Community May 17, 2026

Hi OpenAI Product Team, With the recent advancements in your real-time multimodal capabilities, the audio latency and conversational flow of your current models are incredible. However, there is a massive untapped opportunity in the presentation layer: transitioning from voice-only interactions to fully autonomous, visually interactive video agents. I am conceptualizing a framework for an AI agent that doesn’t just converse, but actively participates in video calls as a human-like presenter. The core architecture involves an agent that can dynamically control a user interface—sharing its screen, generating real-time visual highlights, and gesturing to graphics—while simultaneously explaining the content over a low-latency voice stream. The immediate use cases are highly scalable B2B applications: The Virtual Chief of Staff: An agent that joins a morning brief, pulling up live market charts and pointing to specific data sets while summarizing global news. The Interactive Tutor: An AI that doesn’t just dictate math steps, but interacts with a virtual whiteboard, circling equations and highlighting errors in real time. While the current API supports robust tool-calling, executing a synchronized, visually interactive presentation requires a much tighter integration between your speech generation, UI triggering (via JSON command outputs), and low-latency video streaming protocols like WebRTC. I would love to connect with someone on the developer relations or product team to discuss how we could optimize the OpenAI API stack for this specific visual-agent workflow, or to see if this aligns with any internal frameworks you are currently testing. Thank you for your time and the incredible tools you continue to build. Best regards, Suhan Kazi +91 8329477022

Discussion in the ATmosphere