OpenAI's Whisper is an encoder-decoder Transformer model trained on 680,000 hours of multilingual audio data. Whisper Web brings this model to the browser by executing it via ONNX Runtime compiled to WebAssembly, with optional WebGPU acceleration for 3-5x faster inference on supported hardware.

Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It is a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

The Transformer Engine

At its core, Whisper is an encoder-decoder Transformer. The processed audio is fed into the encoder, and the decoder predicts the text tokens one by one.

Porting to the Web

Using the ONNX Runtime and Emscripten, we are able to execute these complex matrix operations in JavaScript environments efficiently...

Deep Dive: How Whisper Works Under the Hood

The Transformer Engine

Porting to the Web