External Publication

Microsoft released new MAI voice and image models for Build 2026

TestingCatalog May 30, 2026

UPDATE: Microsoft has announced 7 new AI models during Microsoft Build 2026 - MAI Image 2.5, MAI Image 2.5 Flash, MAI Voice 2, MAI Voice 2 Flash, MAI Transcribe 1.5, MAI Code 1 Flash, and MAI Thinking 1.

Seven new models launching at Build: let’s go! Reasoning. Code. Image. Transcribe. Voice.

Built from scratch on a clean data lineage, designed for efficiency, working seamlessly as a family of models

Thread 🧵 #MSBuild pic.twitter.com/g3WQIcIQ24

— Microsoft AI (@MicrosoftAI) June 2, 2026

The Story

Microsoft heads into its Build conference on June 2 in San Francisco with more in its model pipeline than the MAI-Image-2.5 that it has already shown on Arena, where the text-to-image system landed third behind OpenAI’s gpt-image-2 and Google’s Nano Banana 2. That release is lined up for the MAI Playground and Foundry, but three additional models are taking shape within the company’s stack, none of which are publicly available yet.

Exciting news, MAI-Image-2.5 (Preview) from @MicrosoftAI debuts at #3 in the Text-to-Image Arena with a score of 1,254 — a +72 point improvement over MAI-Image-2.

A top 5 arena previously held only by @GoogleDeepMind and @OpenAI has a new lab in the mix.

Congrats to the… https://t.co/stHydZYbNN pic.twitter.com/4eVXxfbI6M

— Arena.ai (@arena) May 26, 2026

The first, MAI-Transcribe-1.5 , is a modest step up from the speech-to-text model launched in April, which already claimed the lowest word error rate across 25 languages. The image side draws more attention: MAI-Image-2.5 looks set to ship in two variants, a high-quality version and a faster one labeled MAI-Image-2.5e , mirroring the split seen with MAI-Image-2. It would also accept image uploads, opening the model to editing as well as generation, putting it on par with rivals from Google and OpenAI.

The most striking find is MAI-Voice-2 , a multilingual successor to the company’s text-to-speech model. While MAI-Voice-1 began in English, the new version adds German, Australian and US English, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Dutch, Portuguese, Turkish, Vietnamese, and Chinese, with a wider emotional range that covers tones such as angry, confused, and embarrassed. Early samples suggest it can whisper, too.

Harper whisper egret

0:00

/18.8

1×

Ethan shouting egret

0:00

/15.696

1×

Field isla joyful

0:00

/16.176

1×

All three would feed Copilot, Teams, and Azure Speech, and fit the developer crowd that Build is made for. The timing matches a broader push, as Mustafa Suleyman’s team weans the company off OpenAI following April’s renegotiation. Reports point to a homegrown coding model for GitHub Copilot at the show, too, while a Copilot “super app” that integrates chat, coding, and agents into a single hub is expected later in the summer.

The Story

Discussion in the ATmosphere