Language Models are Unsupervised Multitask Learners
Supervised learning systems are brittle and sensitive to changes in distribution and task (“narrow experts”). The prevalence of single-task training on single-domain datasets might be a major contributor to the lack of generalization observed in current systems.
Multitask learning is a promising framework for improving general performance. However, multi-task training in NLP is still nascent.
The trend of pre-trained language representation NLP:
This work: LM can perform a wide range of downstream tasks in a zero-shot setting, without any parameter or architecture modification.
Training LMs in a probabilistic framework as estimating a conditional distribution of the output given the input and the task information (multi-task/meta-learning),
\[p(\texttt{output}|\texttt{input}, \texttt{task})\]The language model is auto-regressive, i.e., it predicts the next word given the previous words.
LMs with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If LMs are able to do this it will be, in effect, performing unsupervised multitask learning.
Test this by analyzing the performance of LMs in a zero-shot setting on a wide variety of tasks.
Although web scrapes such as Common Crawl are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues.
The authors created a new web scrape that emphasizes document quality.
Results in over 8 million documents for a total of 40 GB of text (about 40 Billion Bytes).
A general LM should be able to compute the probability of (and also generate) any string.
While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement, current byte-level LMs are not competitive with word-level LMs on large-scale datasets.
An interpolation between word-level inputs for frequent symbol sequences and character-level inputs for infrequent symbol sequences.
BPE on Unicode code points: The size of the base vocabulary is too large (> 130,000) compared to the 32,000 to 64,000 token vocabularies often used with BPE.
The model largely follows the details of the GPT
The largest model (GPT-2) has 1.5B parameters.
This is the primary task the models are trained for.
Task: Evaluating the log probability of different datasets according to the LM.
Results: GPT-2 transfers well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting.
The LAMBADA dataset tests the ability of systems to model long-range dependencies in text. The
Task: predict the final word of sentences that require at least 50 tokens of context for a human to successfully predict.
Results: GPT-2 improves the SOTA.
Common error: valid continuations of the sentence, but not valid final words.
This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy.
The Conversation Question Answering dataset (CoQA) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document. CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history (such as “Why?”).
Use greedy decoding from GPT-2 conditioned on a document, the history of the associated conversation, and a final token A:
.
Results: GPT-2 matches or exceeds the performance of 3 out of 4 baseline systems, and underperforms the supervised SOTA (BERT-based)
Induce summarization behavior by adding the text TL;DR:
after the article and generating 100 tokens with Top-\(k\) random sampling with \(k = 2\) which reduces repetition and encourages more abstractive summaries than greedy decoding. The first 3 generated sentences in these 100 tokens are used as the summary.
Results:
Help GPT-2 infer the translation task, by conditioning the LM on a context of example pairs of the format english sentence = french sentence
followed by a final prompt of english sentence =
. After the prompt, outputs are sampled with greedy decoding and the first generated sentence is used as the translation.
Results:
Note: Since non-English webpages were filtered from WebText, it only contains a very small (10MB) corpus in the Frech language.
Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset.
Results: The performance of GPT-2 is still much, much, worse than the existing open domain question answering systems.