Back to Blog
Deep Dive: How Whisper Works Under the Hood
An explanation of the Transformer architecture behind OpenAI's Whisper model and how we ported it to the web.
Tech Engineering••
8 min read
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
The Transformer Engine
At its core, Whisper is an encoder-decoder Transformer. The processed audio is fed into the encoder, and the decoder predicts the text tokens one by one.
Porting to the Web
Using the ONNX Runtime and Emscripten, we are able to execute these complex matrix operations in JavaScript environments efficiently...