External Publication
Visit Post

Product Proposal: Expanding GPT-4o into Real-Time Multimodal Video Agents

OpenAI Developer Community May 17, 2026
Source
Hi OpenAI Product Team, ​With the recent advancements in your real-time multimodal capabilities, the audio latency and conversational flow of your current models are incredible. However, there is a massive untapped opportunity in the presentation layer: transitioning from voice-only interactions to fully autonomous, visually interactive video agents. ​I am conceptualizing a framework for an AI agent that doesn’t just converse, but actively participates in video calls as a human-like presenter. The core architecture involves an agent that can dynamically control a user interface—sharing its screen, generating real-time visual highlights, and gesturing to graphics—while simultaneously explaining the content over a low-latency voice stream. ​The immediate use cases are highly scalable B2B applications: ​The Virtual Chief of Staff: An agent that joins a morning brief, pulling up live market charts and pointing to specific data sets while summarizing global news. ​The Interactive Tutor: An AI that doesn’t just dictate math steps, but interacts with a virtual whiteboard, circling equations and highlighting errors in real time. ​While the current API supports robust tool-calling, executing a synchronized, visually interactive presentation requires a much tighter integration between your speech generation, UI triggering (via JSON command outputs), and low-latency video streaming protocols like WebRTC. ​I would love to connect with someone on the developer relations or product team to discuss how we could optimize the OpenAI API stack for this specific visual-agent workflow, or to see if this aligns with any internal frameworks you are currently testing. ​Thank you for your time and the incredible tools you continue to build. ​Best regards, ​Suhan Kazi +91 8329477022

Discussion in the ATmosphere

Loading comments...