T5: Text-to-Text Transfer Transformer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Published

October 23, 2019

Paper

Takeaways

Text-to-text provides a simple way to train a single model on a wide variety of text tasks. T2T is simple, yet obtained comparable performance to task-specific architectures and ultimately produced SOTA results when combined with scale.
Architectures: The original encoder-decoder form worked best in T2T.
Unsupervised objectives: The denoising objectives performed best in T2T.

Introduction

Motivation: There is a need for a more rigorous understanding of the contributions of different components in transfer learning for NLP (large-scale pre-training models), e.g., different models, pre-training objectives, datasets, and fine-tuning methods.

The basic idea: Introduce a unified framework (T5) that converts all text-based language problems into a text-to-text format. The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task considered.

This work primarily comprises a survey, exploration, and empirical comparison of existing techniques, and explores the limits of current approaches by scaling up the insights (training models up to 11 B parameters on dataset up to 750GB)

Methods

Model

T5 closely follows the original Transformer.

Main differences:

LayerNorm
- LayerNorms are used at the start of each block and the end of the last block.
- Scale-only LayerNorm, i.e., no additive bias.
Positional embedding
- Relative positional embedding.
- Simplified position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights.
- Share the position embedding parameters across all layers in the model, though within a given layer each attention head uses a different learned position embedding.
- Use 32 embeddings with ranges that increase in size logarithmically up to an offset of 128 beyond which we assign all relative positions to the same embedding. (Implementation)
Input embedding matrix
- The weights of the output dense layer (before the final softmax) are shared with the input embedding matrix.

T5 uses an encoder-decoder architecture as in the original Transformer. In comparison, GPT, GPT-2, BERT use a single stack of Transformer layers.

Dataset

The Colossal Clean Crawled Corpus (C4), ~ 750 GB.

Start with Common Crawl
Retain lines that ended in a terminal punctuation mark.
Discarded any page with fewer than 5 sentences and only retained lines that contained at least 3 words.
Remove any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
Remove any line with the word Javascript.
Remove any page where the phrase “lorem ipsum” (placeholder) appeared.
Removed any pages that contained a curly bracket to avoid pages with code.
Discarded all but one of any three-sentence span occurring more than once in the data set.
Filter out non-English pages

Input and Output Format

Cast all of the tasks considered into a “text-to-text” format, i.e., a task where the model is fed some text for context or conditioning and is then asked to produce some output text.

The text-to-text framework provides a consistent training objective both for pre-training and fine-tuning.

T5 is trained with a maximum likelihood objective (using “teacher forcing”, i.e., using ground truth as input, instead of model output from a prior time step as an input) and a cross-entropy loss regardless of the task. To specify which task the model should perform, a task-specific (text) prefix is added to the original input sequence before feeding it to the model.

Compare to GPT-2, which also uses prompts:

GPT-2 is autoregressive (processing the prefix left-to-right), while T5 explicitly processes an input with an encoder (bidirectional attention).
GPT-2 focuses on zero-shot learning, while T5 focuses on transfer learning with fine-tuning.

Experiments

Baseline

Baseline Model

A standard encoder-decoder Transformer is designed so that the encoder and decoder are each similar in size and configuration to a BERT-base model.

Vocabulary

Use SentencePiece to encode text as WordPiece tokens (use a vocabulary of 32,000 wordpieces)

Trained the SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of the model. Note that the vocabulary makes it so that the model can only process a predetermined, fixed set of languages.

Unsupervised Objective

Use the “denoising” objectives, i.e., masked language modeling. The model is trained to predict missing or otherwise corrupted tokens in the input.

Design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence.

The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. An example is as follows.

Original text

Thank you for inviting me to your party last week

Inputs

Thank you for <X> to your party <Y> week

Target

<X> for inviting <Y> last <Z>

Architectures

Review and compare the following architectural variants.

Different schematics of the Transformer architecture variants.

Different attention mask patterns.

Model Structures

A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model.

Architectures	mask	# of layer stacks
Encoder-Decoder (e.g. T5)	Encoder: Fully-visible, Decoder: Causal	2
Language model (e.g. GPT)	Causal	1
Prefix LM	Causal with prefix	1

A fundamental and frequently cited drawback of using an LM in the text-to-text setting is that causal masking forces the model’s representation of the \(i\)-th entry of the input sequence to only depend on the entries up until \(i\). This issue can be avoided in a Transformer-based language model simply by changing the masking pattern (Prefix LM).

The main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.

Objectives

Considered both the standard language modeling objective and the denoising objective discussed in the previous section.

Language modeling objective:

For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions.

For the standard language model, we train the model to predict the entire span from beginning to end.

Denoising objective:

The unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model the inputs and targets are concatenated.

Results

For all tasks, the encoder-decoder architecture with the denoising objective performed best.
Though an encoder-decoder model uses twice as many parameters as “encoder-only” (e.g. BERT) or “decoder-only” (language model) architectures, it has a similar computational cost.
Sharing the parameters in the encoder and decoder did not result in a substantial performance drop while halving the total parameter count.

Unsupervised Objectives

Explore different unsupervised objectives. Overall, all of the objectives ingest a sequence of token IDs corresponding to a tokenized span of text from our unlabeled text data set. The token sequence is processed to produce a (corrupted) input sequence and a corresponding target. Then, the model is trained as usual with maximum likelihood to predict the target sequence.

Choices of Objectives

Objective	Example input	Example target
Prefix LM	Thank you for inviting	me to your party last week .
BERT-style	Thank you `<M>` `<M>` me to your party apple week .	(original text)
Deshuffling	party me for your to . last fun you inviting week Thank	(original text)
MASS-style	Thank you `<M>` `<M>` me to your party `<M>` week .	(original text)
I.i.d. noise, replace spans	Thank you `<X>` me to your party `<Y>` week .	`<X>` for inviting `<Y>` last `<Z>`
I.i.d. noise, drop tokens	Thank you me to your party week .	for inviting last
Random spans	Thank you `<X>` to `<Y>` week .	`<X>` for inviting me `<Y>` your party last `<Z>`

Results

Denoising objectives outperformed language modeling and deshuffling for pre-training.
No remarkable difference across the many variants of the denoising objectives.
Different objectives can lead to different sequence lengths and thus different training speeds.

Pre-Training Dataset

Performance degrades as the data set size shrinks.
When comparing C4 to data sets that use additional filtering, the authors found that training on in-domain unlabeled data could boost performance in a few downstream tasks. However, constraining to a single domain typically results in a smaller data set.
Performance can degrade when an unlabeled data set is small enough that it is repeated many times over the course of pre-training. This motivates the use of a large and diverse data set like C4 for generic language understanding tasks.

Training Strategy

Fine-Tuning Methods

The standard method is to fine-tune all parameters in the model.

Two alternative methods:

Adapter layers: additional dense-ReLU-dense blocks are added after each of the preexisting feed-forward networks in each block of the Transformer. When fine-tuning, only the adapter layer and layer normalization parameters are updated.
gradual unfreezing: more and more of the model’s parameters are fine-tuned over time.

The standard method performs best.

Multi-Task Learning

Train the model on multiple tasks simultaneously (the unsupervised task and downstream supervised tasks). For the unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together.

In general, multi-task training underperforms pre-training followed by fine-tuning on most tasks.

Combining Multi-Task Learning with Fine-Tuning

The model is pre-trained on all tasks at once but is then fine-tuned on the individual supervised tasks.

Fine-tuning after multi-task pre-training results in comparable performance to the baseline (unsupervised pre-training + supervised fine-tuning). This suggests that using fine-tuning after multi-task learning can help mitigate some of the trade-offs between different mixing rates.

Scaling

Compared various strategies for taking advantage of additional computing, including training the model on more data, training a larger model, and using an ensemble of models. Each approach conferred a significant boost in performance. Specifically,

Increasing the training time and/or model size consistently improves the baseline.
In general, increasing the model size resulted in an additional bump in performance compared to solely increasing the training time or batch size.
Increasing the training time and increasing the model size can be complementary means of improving performance.
Training a smaller model on more data was often outperformed by training a larger model for fewer steps.
An ensemble of models can provide substantially better results than a single model, which provides an orthogonal means of leveraging additional computation.
Ensembling models that were fine-tuned from the same base pre-trained model performed worse than pre-training and fine-tuning all models completely separately, though fine-tune-only ensembling still substantially outperformed a single model.
Different scaling methods have different trade-offs that are separate from their performance.

Putting It All Together

The final T5 model is as follows.

Objective: the span-corruption objective, a variant of the denoising objective.
Longer training: pre-train for 1M steps on a batch size of 2048 sequences of length 512 corresponding to a total of about 1T pre-training tokens.
Model sizes: up to 11B.
Multi-task pre-training + fine-tuning
Beam search: replace greedy decoding by a beam search with a beam width of 4 and a length penalty of \(\alpha=0.6\).