Whisper: Web-Scale Supervised Pretraining for Speech Recognition

Robust Speech Recognition via Large-Scale Weak Supervision

Takeaways

Introduction

Previous work on unsupervised pre-training

Previous work on supervised pre-training

Dataset for weakly supervised learning

This work

Methods

Overview of Whisper.

Dataset

A minimalist approach to data pre-processing:

Construct a very diverse dataset from the audio that is paired with transcripts on the Internet. While diversity in audio quality can help train a model to be robust, diversity in transcript quality is not similarly beneficial.

Filtering for transcript:

Model

Input Audio

  1. All audio is re-sampled to 16,000 Hz.
  2. An 80-channel log magnitude Mel spectrogram representation is computed on 25-millisecond windows with a stride of 10 milliseconds.
  3. Globally scale the input to be between -1 and 1 with approximately zero mean across the pre-training dataset.

Network

Vocabulary

Multitask Format

In addition to predicting which words were spoken in a given audio snippet, A fully featured speech recognition system can involve many additional components, e.g., voice activity detection, speaker diarization (i.e., the process of partitioning an input audio stream into homogeneous segments according to the speaker identity), and inverse text normalization.

To reduce this complexity, this work uses a single model to perform the entire speech processing pipeline and uses a simple format to specify all tasks and conditioning information as a sequence of input tokens to the decoder.

Since the decoder is an audio-conditional language model, the authors also train it to condition on the history of the text of the transcript in the hope that it will learn to use longer-range text context to resolve ambiguous audio. Specifically, with some probability, the transcript text preceding the current audio segment is added to the decoder’s context.

  1. Indicate the beginning of the prediction with a <|startoftranscript|> token.
  2. First, predict the language being spoken which is represented by a unique token for each language in our training set (99 total). If there is no speech in an audio segment, the model is trained to predict a <|nospeech|> token.
  3. The next token specifies the task (either transcription or translation) with an <|transcribe|> or <|translate|> token.
  4. After this, specify whether to predict timestamps or not by including a <|notimestamps|> token for that case.
  5. At this point, the task and desired format are fully specified, and the output begins.
  6. Timestamp prediction:
    1. Predict time relative to the current audio segment, quantizing all times to the nearest 20 milliseconds which matches the native time resolution of Whisper models, and add additional tokens to our vocabulary for each of these.
    2. Interleave timestamp prediction with the caption tokens: the start time token is predicted before each caption’s text, and the end time token is predicted after.
    3. When a final transcript segment is only partially included in the current 30-second audio chunk, we predict only its start time token for the segment when in timestamp mode, to indicate that the subsequent decoding should be performed on an audio window aligned with that time, otherwise we truncate the audio to not include the segment.
  7. Lastly, add a <|endoftranscript|> token.
  8. Only mask out the training loss over the previous context text, and train the model to predict all other tokens.

Experiments

Evaluation

The goal of Whisper is to develop a single robust speech processing system that works reliably without the need for dataset-specific fine-tuning to achieve high-quality results on specific distributions.

Evaluate Whisper in a zero-shot setting without using any of the training data for each of the datasets to measure broad generalization.

Metrics: word error rate (WER)

English Speech Recognition

For the previous SOTA supervised methods, there is a gap between reportedly superhuman performance in-distribution and subhuman performance out-of-distribution.

This might be due to conflating different capabilities being measured by human and machine performance on a test set.

Results:

Multi-lingual Speech Recognition

Low-data Benchmarks

Whisper performs well on Multilingual LibriSpeech. However, On VoxPopuli, Whisper significantly underperforms prior work.

Relationship between Size of Dataset and Performance

Translation

Study the translation capabilities of Whisper by measuring the performance on the X -> en translation.

Results:

Language Identification

The zero-shot performance of Whisper is not competitive with prior supervised work here and underperforms the supervised SOTA by 13.6%.

Robustness to Additive Noise

Long-form Transcription

Results: Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription.

Comparison with Human Performance

These results indicate that Whisper’s English ASR performance is not perfect but very close to human-level accuracy.

Ablations

Model Scaling

Concerns with using a large but noisy dataset:

Study the zero-shot generalization of Whisper models as a function of the model size.

Dataset Scaling

To study how important is the raw dataset size to Whisper’s performance

Results:

Multitask and Multilingual Transfer

A potential concern with jointly training a single model on many tasks and languages is the possibility of negative transfer where interference between the learning of several tasks results in performance worse than would be achieved by training on only a single task or language.

Compare the performance of models trained on just English speech recognition with the standard multitask and multilingual training setup.

Results:

Text Normalization

There is a risk of overfitted text normalization.

Compare the performance of Whisper using the proposed normalizer versus an independently developed one from another project.

Results:

Strategies for Reliable Long-form Transcription

Transcribing long-form audio using Whisper relies on accurate prediction of the timestamp tokens to determine the amount to shift the model’s 30-second audio context window by, and inaccurate transcription in one window may negatively impact transcription in the subsequent windows.

Strategies to avoid failure cases of long-form transcription:

Limitations

Improved Decoding Strategies

Perception-related errors:

Non-perceptual errors:

Increase Training Data For Lower-Resource Languages

Whisper’s speech recognition performance is still quite poor on many languages.

The pre-training dataset is currently very English-heavy due to biases of the data collection pipeline.

A targeted effort at increasing the amount of data for these rarer languages could result in a large improvement to average speech recognition performance even with only a small increase in the overall training dataset size.

Studying Fine-Tuning

This work focused on the robustness properties of speech processing systems and as a result only studied the zero-shot transfer.

It is likely that results can be improved further by fine-tuning.

Studying the Impact of Language Models on Robustness

The authors suspect that Whisper’s robustness is partially due to its strong decoder, which is an audio-conditional LM.

It’s currently unclear to what degree the benefits of Whisper stem from training its encoder, decoder, or both.

Potential Experiments:

Adding Auxiliary Training Objectives

Whisper departs noticeably from the most recent SOTA speech recognition systems due to the lack of unsupervised pre-training or self-teaching methods. It is possible that the results could be further improved by incorporating this.