Mistral unveils new open-source model for speech generation
Mistral has released a new open-source speech generation model, expanding AI voice capabilities and advancing accessible, high-quality text-to-speech technology.
French AI firm Mistral AI on Thursday unveiled a new open-source text-to-speech model designed for applications ranging from voice assistants to enterprise use cases such as customer support. The release positions the company in direct competition with players like ElevenLabs, Deepgram, and OpenAI.
The newly introduced model, Voxtral TTS, supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
“Our customers have been asking for a speech model. So we built a small speech model that can fit on a smartwatch, smartphone, laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance,” Pierre Stock, VP of science operations at Mistral AI, said in a phone interview.
Mistral noted that the model can generate a custom voice from a sample of less than 5 seconds, capturing subtle accents, inflexions, intonation, and natural speech irregularities. Built on Ministral 3B, the system can transition between languages while preserving the speaker’s voice characteristics, making it suitable for applications like dubbing and real-time translation. Stock emphasised that the goal was to produce speech that sounds natural rather than robotic.
According to the company, the model is optimised for real-time performance. It achieves a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample of 500 characters. It also delivers a real-time factor (RTF) of 6x, meaning it can produce a 10-second audio clip in approximately 1.6 seconds.
Earlier this year, Mistral introduced two transcription models — one aimed at large-scale batch processing and another designed for low-latency, real-time scenarios. With the addition of this speech model, the company appears to be expanding its suite of voice-related tools for enterprise customers.
“We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output,” Stock said.
Mistral’s approach centres on its open-source and customisation capabilities, which it believes will encourage enterprise adoption by allowing organisations to tailor the model to their specific needs
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0