Back to Blog

Deep Dive: How Whisper Works Under the Hood

An explanation of the Transformer architecture behind OpenAI's Whisper model and how we ported it to the web.

Tech Engineering
8 min read

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

The Transformer Engine

At its core, Whisper is an encoder-decoder Transformer. The processed audio is fed into the encoder, and the decoder predicts the text tokens one by one.

Porting to the Web

Using the ONNX Runtime and Emscripten, we are able to execute these complex matrix operations in JavaScript environments efficiently...