Deep Dive: How Whisper Works Under the Hood
An explanation of the Transformer architecture behind OpenAI's Whisper model and how we ported it to the web.
OpenAI's Whisper is an encoder-decoder Transformer model trained on 680,000 hours of multilingual audio data. Whisper Web brings this model to the browser by executing it via ONNX Runtime compiled to WebAssembly, with optional WebGPU acceleration for 3-5x faster inference on supported hardware.
Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
The Transformer Engine
At its core, Whisper is an encoder-decoder Transformer. The processed audio is fed into the encoder, and the decoder predicts the text tokens one by one.
Porting to the Web
Using the ONNX Runtime and Emscripten, we are able to execute these complex matrix operations in JavaScript environments efficiently...