Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidcdsco3kdv22eqyetfajy72e5y4xvmcq5pbptoszloviqrtparu4",
    "uri": "at://did:plc:j4nmy4ymoeorm3j6hzbijapg/app.bsky.feed.post/3md3u4dx2gcq2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibm4igey63g6sjayigezgvmsanv6klcf5wobyhjgzib6s3nnpiaxq"
    },
    "mimeType": "image/jpeg",
    "size": 677015
  },
  "description": "Dictation turns speech into text. Conversation works in time. PersonaPlex marks the moment voice AI starts to operate in real-time dialogue.",
  "path": "/personaplex-marks-a-shift-from-dictation-to-conversation/",
  "publishedAt": "2026-01-23T13:56:00.000Z",
  "site": "https://hoeijmakers.net",
  "tags": [
    "PersonaPlex",
    "Whisper",
    "Speech recognition, Speech-to-Text, Dictation, and Transcription: What’s the Difference?Just like picking the right writing tool, choosing the right speech technology makes all the difference—here’s how to decide.Rob HoeijmakersRob Hoeijmakers",
    "NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice",
    "From Typing to Talking"
  ],
  "textContent": "I have written quite a bit about speech recognition, dictation, and voice over the past years. Looking back, I now see that I was often talking about different things under the same label.\n\nPersonaPlex is a research model from NVIDIA that explores native, real-time speech-to-speech conversation, where listening and speaking happen at the same time.\n\nIt helps make a distinction visible that had been forming for a while already.\n\nNot because earlier systems were wrong or outdated, but because _what we mean by voice_ has started to change.\n\n💡\n\n****PersonaPlex, in short:**** PersonaPlex is a research model that treats speech as a continuous, real-time process rather than as input to be processed in turns. It listens and speaks at the same time, meaning interruptions, timing, and tone directly shape what the system does next. That makes it fundamentally about conversation, not dictation.\n\n## Dictation solved one problem\n\nModels such as Whisper are excellent at dictation.\n\nYou speak. The system listens. You get clean, reliable text.\n\nThat alone was a major step. It removed friction between thinking and writing. It made meetings, interviews, and spoken notes usable at scale. For this purpose, dictation models are still extremely strong.\n\nBut dictation treats speech as something that already happened. Accuracy matters more than timing.\n\nSpeech recognition, Speech-to-Text, Dictation, and Transcription: What’s the Difference?Just like picking the right writing tool, choosing the right speech technology makes all the difference—here’s how to decide.Rob HoeijmakersRob Hoeijmakers\n\n## Conversation already feels solved, doesn’t it?\n\nIf you use ChatGPT with voice, or Gemini Live, it is easy to think: _this already is full duplex_.\n\nYou can interrupt them. They stop speaking immediately. The interaction feels fluid compared to older voice assistants.\n\nFrom a user’s perspective, that intuition makes sense.\n\nBut under the hood, something else is going on.\n\n## How today’s voice systems actually work\n\nMost production voice systems today rely on several fast components working together:\n\n  * One part listens for speech and detects interruptions.\n  * Another part reasons about what to say.\n  * Another part turns that into audio.\n\n\n\nWhen you interrupt, a very fast detector notices this and simply cuts off the audio output. The system stops talking right away, even if another part is still finishing its thought elsewhere.\n\nTo you, it feels like the system was listening while speaking.\nTechnically, it mostly just stopped speaking very quickly.\n\nThis is not a flaw. It is a sensible engineering choice. It is cheaper, more stable, and easier to control.\n\nBut it is not yet what researchers mean by **full duplex**.\n\n## What “full duplex” really means\n\nFull duplex simply means _listening and speaking at the same time_.\n\nNot taking turns.\nNot stopping first.\nNot restarting after an interruption.\n\nIn a full-duplex speech system, incoming sound continues to shape what the system is doing while it is already talking. Interruptions are not just stop signals. They carry information: timing, tone, urgency.\n\nSpeech is no longer just an interface around reasoning. It becomes part of the reasoning itself.\n\nThat is the real shift.\n\n## Why this matters\n\nSeen in that light, PersonaPlex is not just another voice demo.\n\nIt is a concrete example of what changes when speech is no longer treated as input that must first be stabilised, but as a medium in which interaction itself takes place.\n\nDictation systems listen, then act.\nMost current voice assistants react quickly.\nPersonaPlex listens _while_ it speaks.\n\nThat distinction may sound subtle, but it changes what kinds of conversations become possible. Especially outside the assistant setting: in phone calls, service conversations, and other situations where flow, interruption, and timing matter.\n\nPersonaPlex does not replace dictation models, nor does it invalidate today’s voice assistants. It shows that voice is becoming layered.\n\nAnd it gives a first, working glimpse of what that next layer looks like.\n\n### Further reading\n\n  * NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice\n  * From Typing to Talking\n\n",
  "title": "PersonaPlex marks a shift from dictation to conversation",
  "updatedAt": "2026-05-10T08:53:38.537Z"
}