Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Text-to-text provides a simple way to train a single model on a wide variety of text tasks. T2T is simple, yet obtained comparable performance to task-specific architectures and ultimately produced SOTA results when combined with scale.
Architectures: The original encoder-decoder form worked best in T2T.
Unsupervised objectives: The denoising objectives performed best in T2T.
Motivation: There is a need for a more rigorous understanding of the contributions of different components in transfer learning for NLP (large-scale pre-training models), e.g., different models, pre-training objectives, datasets, and fine-tuning methods.
The basic idea: Introduce a unified framework (T5) that converts all text-based language problems into a text-to-text format. The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task considered.
This work primarily comprises a survey, exploration, and empirical comparison of existing techniques, and explores the limits of current approaches by scaling up the insights (training models up to 11 B parameters on dataset up to 750GB)
T5 closely follows the original Transformer
Main differences:
T5 uses an encoder-decoder architecture as in the original Transformer
The Colossal Clean Crawled Corpus (C4), ~ 750 GB.
Cast all of the tasks considered into a “text-to-text” format, i.e., a task where the model is fed some text for context or conditioning and is then asked to produce some output text.
The text-to-text framework provides a consistent training objective both for pre-training and fine-tuning.
T5 is trained with a maximum likelihood objective (using “teacher forcing”, i.e., using ground truth as input, instead of model output from a prior time step as an input) and a cross-entropy loss regardless of the task. To specify which task the model should perform, a task-specific (text) prefix is added to the original input sequence before feeding it to the model.
Compare to GPT-2
A standard encoder-decoder Transformer
Use SentencePiece to encode text as WordPiece tokens (use a vocabulary of 32,000 wordpieces)
Trained the SentencePiece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of the model. Note that the vocabulary makes it so that the model can only process a predetermined, fixed set of languages.
Use the “denoising” objectives, i.e., masked language modeling. The model is trained to predict missing or otherwise corrupted tokens in the input.
Design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence.
The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. An example is as follows.
Original text
Thank you for inviting me to your party last week
Inputs
Thank you for <X> to your party <Y> week
Target
<X> for inviting <Y> last <Z>
Review and compare the following architectural variants.
Different schematics of the Transformer architecture variants.
Different attention mask patterns.
A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model.
Architectures | mask | # of layer stacks |
---|---|---|
Encoder-Decoder (e.g. T5) | Encoder: Fully-visible, Decoder: Causal | 2 |
Language model (e.g. GPT) | Causal | 1 |
Prefix LM | Causal with prefix | 1 |
A fundamental and frequently cited drawback of using an LM in the text-to-text setting is that causal masking forces the model’s representation of the \(i\)-th entry of the input sequence to only depend on the entries up until \(i\). This issue can be avoided in a Transformer-based language model simply by changing the masking pattern (Prefix LM).
The main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.
Considered both the standard language modeling objective and the denoising objective discussed in the previous section.
Language modeling objective:
For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions.
For the standard language model, we train the model to predict the entire span from beginning to end.
Denoising objective:
The unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model the inputs and targets are concatenated.
Explore different unsupervised objectives. Overall, all of the objectives ingest a sequence of token IDs corresponding to a tokenized span of text from our unlabeled text data set. The token sequence is processed to produce a (corrupted) input sequence and a corresponding target. Then, the model is trained as usual with maximum likelihood to predict the target sequence.
Objective | Example input | Example target |
---|---|---|
Prefix LM | Thank you for inviting | me to your party last week . |
BERT-style | Thank you <M> <M> me to your party apple week . |
(original text) |
Deshuffling | party me for your to . last fun you inviting week Thank | (original text) |
MASS-style | Thank you <M> <M> me to your party <M> week . |
(original text) |
I.i.d. noise, replace spans | Thank you <X> me to your party <Y> week . |
<X> for inviting <Y> last <Z>
|
I.i.d. noise, drop tokens | Thank you me to your party week . | for inviting last |
Random spans | Thank you <X> to <Y> week . |
<X> for inviting me <Y> your party last <Z>
|
The standard method is to fine-tune all parameters in the model.
Two alternative methods:
The standard method performs best.
Train the model on multiple tasks simultaneously (the unsupervised task and downstream supervised tasks). For the unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together.
In general, multi-task training underperforms pre-training followed by fine-tuning on most tasks.
The model is pre-trained on all tasks at once but is then fine-tuned on the individual supervised tasks.
Fine-tuning after multi-task pre-training results in comparable performance to the baseline (unsupervised pre-training + supervised fine-tuning). This suggests that using fine-tuning after multi-task learning can help mitigate some of the trade-offs between different mixing rates.
Compared various strategies for taking advantage of additional computing, including training the model on more data, training a larger model, and using an ensemble of models. Each approach conferred a significant boost in performance. Specifically,
The final T5 model is as follows.