Voxtral: Transcribing at the Speed of Sound.

Introduction to Voxtral Transcribe 2

Today marks an exciting milestone in the realm of speech-to-text technology with the release of Voxtral Transcribe 2. This family of next-generation models promises state-of-the-art transcription quality, exceptional diarization capabilities, and ultra-low latency. The lineup includes Voxtral Mini Transcribe V2, designed for batch transcription, and Voxtral Realtime, tailored for live applications. The latter is particularly noteworthy, as it is open-weights under the Apache 2.0 license, allowing for flexible deployment options.

Experience the New Audio Playground

Complementing this release, Mistral Studio introduces an audio playground where users can test transcription capabilities instantly, powered by Voxtral Transcribe 2. You can upload up to 10 audio files and choose features like diarization and timestamp granularity to suit your needs. This playground supports various audio file formats, including .mp3, .wav, .m4a, .flac, and .ogg, with a maximum size of 1GB for each file.

Highlights of Voxtral Transcribe 2

Voxtral Mini Transcribe V2: Delivers state-of-the-art transcription with speaker diarization, context biasing, and word-level timestamps in 13 languages.
Voxtral Realtime: Engineered for live transcription with latency options down to sub-200ms, making it ideal for voice agents and interactive applications.
Best-in-class efficiency: Boasts industry-leading accuracy at a fraction of the cost, with Voxtral Mini Transcribe V2 achieving the lowest word error rate at the most competitive price points.
Open weights: The Voxtral Realtime model is available under the Apache 2.0 license, making it deployable on edge devices for privacy-first applications.

A Closer Look at Voxtral Realtime

Voxtral Realtime is specifically designed for scenarios where low latency is crucial. By leveraging a novel streaming architecture, this model transcribes audio as it arrives, rather than processing it in chunks like traditional offline models. This enables configurations of transcription delays as low as 200ms, setting a new standard for voice-first applications.

The model supports 13 languages, achieving robust transcription performance across diverse linguistic contexts. Its adaptability doesn’t compromise efficiency; it maintains strong accuracy while operating on edge devices, safeguarding privacy and security for sensitive applications.

You can now access the model weights on the Hugging Face Hub, underlining the commitment to community-driven development and innovation.

The Power of Voxtral Mini Transcribe V2

Voxtral Mini Transcribe V2 is a game-changer in transcription and diarization quality. It boasts an impressive 4% word error rate on the FLEURS benchmark while maintaining a cost of just $0.003 per minute. This positions it as the most cost-effective transcription API available today.

In side-by-side comparisons, Voxtral Mini Transcribe V2 outperforms competitors like GPT-4o Mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova—processing audio at approximately 3x the speed of ElevenLabs’ Scribe v2 while matching quality and costing only a fifth of the price.

Key Features of Voxtral Mini Transcribe V2

Speaker Diarization

Generate transcriptions with speaker labels, allowing for clear attributions in varied settings, such as meetings and interviews. This feature is particularly useful for understanding multi-party conversations.

Context Biasing

Context biasing allows users to provide up to 100 words or phrases as guidance, improving accuracy for names, technical terms, and other domain-specific vocabulary. While optimized for English, it has experimental support for other languages.

Word-level Timestamps

This capability generates precise timestamps for each word, which aids in applications like subtitle generation, audio search, and content alignment.

Expanded Language Support

Supporting the same 13 languages as Realtime, this model significantly enhances transcription performance across linguistic and cultural contexts.

Noise Robustness

Voxtral Mini Transcribe V2 maintains high accuracy in challenging acoustic environments, such as busy call centers or industrial settings.

Longer Audio Support

This model can process recordings of up to 3 hours in one request, making it a robust solution for lengthy audio content.

Mistral Studio Audio Playground

The new audio playground in Mistral Studio offers users a chance to experiment with Voxtral Transcribe 2. This hands-on experience allows you to upload audio files and customize features like diarization and context biasing, making it particularly useful for developers and researchers aiming to refine their projects.

Transforming Voice Applications

The capabilities of Voxtral extend far beyond basic transcription, powering diverse voice workflows across various industries:

Meeting Intelligence

For organizations that value clear communication, Voxtral facilitates the transcription of multilingual recordings with speaker diarization. This not only enhances meeting efficiency but also allows for effective annotation and content retrieval at an unprecedented cost efficiency.

Voice Agents and Virtual Assistants

With sub-200ms transcription latency, you can seamlessly integrate Voxtral Realtime into conversational AI systems, facilitating natural interactions between users and your technology.

Contact Center Automation

By enabling real-time call transcription, Voxtral allows AI systems to analyze sentiment, suggest responses, and populate CRM fields directly as conversations occur—improving customer service outcomes significantly.

Media and Broadcast

This technology can generate live multilingual subtitles with minimal latency, ensuring that audiences receive timely and accurate information.

Compliance and Documentation

Transcribing interactions for regulatory compliance has never been easier, with diarization ensuring clarity in speaker attribution and timestamps providing valuable audit trails.

Getting Started with Voxtral

If you’re eager to integrate these advanced tools into your workflow, Voxtral Mini Transcribe V2 is available via API at a pricing rate of $0.003 per minute. You can begin exploring its features through the Mistral Studio audio playground or the official documentation.

On the other hand, Voxtral Realtime can be accessed through its API at $0.006 per minute and is also available as open weights on Hugging Face.

Join Our Team

If you’re passionate about pioneering advancements in speech AI and want to be a part of this exciting journey, consider joining the Mistral team. Explore opportunities to contribute to cutting-edge technology that puts powerful speech models in the hands of developers worldwide.