Fundamentals of Large Language Models - Ep.7: GPT-2 and GPT-3 | rey’s blog

Previous: BERT

We covered GPT-1 in the previous article where we learned that pre-training the transformer decoder on unsupervised tasks, followed by finetuning it on a specific task can produce a well-performing model. Now, let’s take a look at its successors: GPT-2 and GPT-3.

GPT-2

Unlike its predecessor, GPT-2 can achieve very high performance across multiple tasks without finetuning the model on these tasks. In fact, the authors conjectured that training a model on a single task might be the source for its limited ability to generalize. How can we achieve this level of performance without any supervised finetuning (i.e., zero-shot)?

First, the authors highlighted the importance of having a high quality dataset, which led them to introduce a new dataset called WebText. Second, there are some changes made to the model architecture, but most notable difference is that compared to GPT-1 which has 117M parameters, GPT-2 is a lot bigger in size with its 1.542B parameters (over 13 times bigger!).

The model, trained on WebText dataset, demonstrates zero-shot capability, in the sense that one does not need to finetune the model on a specific task. Instead, one can ask the model to do a certain task, simply by prompting the model to do so.

GPT-3

Further scaling GPT-2 results GPT-3. While there are some differences and engineering tricks (e.g., dataset used, slightly different operations, etc.), the biggest difference with GPT-3 is its size: GPT-3 has 175B parameters! GPT-3 demonstrates that scaling GPT-2 results in an even better-performing model, and has reinforced the notion of scaling as a key factor in improving model performance.

Summary

Large model trained in unsupervised manner on a large, diverse, and high quality dataset can produce a highly performing model that can generalize well on multiple tasks. Though the model is not perfect yet, these work laid the foundation for much higher-performing models like GPT-4, which we will discuss in a future article.

References

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report, 2019.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 2020.

Next: InstructGPT