Whisper is an open-source automatic speech recognition (ASR) and speech translation model released by OpenAI in September 2022 under the MIT licence. The largest checkpoint, large-v3, was trained on roughly 1 million hours of weakly labelled audio plus 4 million hours of pseudo-labelled audio collected with large-v2, totalling more than 5 million hours overall. The architecture is a straightforward Transformer encoder-decoder: input audio is split into 30-second chunks, converted into a 128-bin log-Mel spectrogram, and decoded as a sequence of tokens that interleave language identification, timestamps, and transcription. Six model sizes are available — tiny (~39M), base, small, medium, large, and turbo (an optimised large-v3) — with English-only variants for the four smaller sizes. The pre-trained checkpoints generalise zero-shot to nearly 100 languages, with around 50% fewer errors than competing open systems on diverse benchmarks, plus translation from many languages into English. Multiple optimised runtimes — whisper.cpp, faster-whisper, Whisper JAX, WhisperKit, and AMD’s Ryzen NPU implementation — let it run fully offline on Raspberry Pi, mobile, and embedded devices, making it the standard ASR for privacy-sensitive on-robot voice input.
Open-source automatic speech recognition model from OpenAI. Trained on 680,000+ hours of multilingual audio (5M+ for large-v3); six model sizes from ~39M tiny to 1.55B large. Runs fully offline via whisper.cpp, faster-whisper, or WhisperKit — robust to accents, noise, and 100+ languages.
Whisper is an open-source automatic speech recognition (ASR) and speech translation model released by OpenAI in September 2022 under the MIT licence. The largest checkpoint, large-v3, was trained on roughly 1 million hours of weakly labelled audio plus 4 million hours of pseudo-labelled audio collected with large-v2, totalling more than 5 million hours overall. The architecture is a straightforward Transformer encoder-decoder: input audio is split into 30-second chunks, converted into a 128-bin log-Mel spectrogram, and decoded as a sequence of tokens that interleave language identification, timestamps, and transcription. Six model sizes are available — tiny (~39M), base, small, medium, large, and turbo (an optimised large-v3) — with English-only variants for the four smaller sizes. The pre-trained checkpoints generalise zero-shot to nearly 100 languages, with around 50% fewer errors than competing open systems on diverse benchmarks, plus translation from many languages into English. Multiple optimised runtimes — whisper.cpp, faster-whisper, Whisper JAX, WhisperKit, and AMD’s Ryzen NPU implementation — let it run fully offline on Raspberry Pi, mobile, and embedded devices, making it the standard ASR for privacy-sensitive on-robot voice input.
