Previous: Transformer

Now that we know how transformer work, let’s take a look at the first generation of models that sparked the GPT craze: GPT-1.

GPT-1

In the GPT-1 paper, the authors show the practicality of generatively pre-training a language model on unlabeled text, and then finetuning it discriminatively for a target task. GPT-1 is essentially based on the transformer architecture, specifically utilizing the decoder component. Here is a refresher on how the decoder looks like.

Illustration of the decoder.

First, just as the name suggest, we are going to pre-train this model in unsupervised manner. To do this, assuming we have an unlabelled text dataset, we sample some sequence of tokens from our dataset, and train the model to predict the next token. For example, we can sample “I ate cereal this morning”, and use “I ate cereal this” as input tokens, and use “morning” as the label. Thus, the final layer of the model during pre-training phase is a softmax layer.

Once pre-training is done, we can then finetune the model in supervised manner on a target task. For example, a target task can be text completion task where input tokens are some partial text such as “I like to drink coffee”, and the output tokens can be “without sugar and milk.”. To finetune the model, we simply need to replace the output layer to match the requirements of the target task.

Summary

That’s GPT-1. Pre-training then finetuning works. As always, please feel free to send me an email if you have questions, suggestions, or if you found some mistakes in this article.

References

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Improving Language Understanding with Unsupervised Learning. OpenAI Technical Report, 2018.

Next: BERT