Previous: GPT-1

BERT addresses the issue that language models like the GPT-1 suffer from: they only “read” their input in one direction, from left to right. This can potentially limits the capability of the model to have a better understanding of the whole context. As an analogy, while we read text generally in one direction, we can get a better understanding of a sentence after reading the whole sentence. Training BERT involves a process similar to the one we used for training GPT-1. It is a two-stage process, where the model is first trained on unsupervised tasks, then finetuned on a target task.

BERT

Similar to GPT-1 which we learned in the previous article, BERT also uses transformer architecture at its core. In contrast to GPT-1, however, BERT only uses the transformer encoder as opposed to the decoder. Another major difference is the tasks used during pre-training phase. Here is a refresher on how the encoder looks like.

Illustration of the encoder.

Pretraining

During pre-training phase, we train BERT on two objectives. The first goal is to minimize Masked Language Model (MLM) objective, where given partially masked inputs tokens, the model is tasked with predicting what the masked tokens actually are. For example, the input sentence can be

\(\textrm{I like to drink coffee in the morning}\),

and the masked version will be

\(\textrm{I like to [MASK] coffee in the [MASK]}\).

For this task, we will add a classification head to our typical transformer encoder. Concretely, the MLM classification head is a feed forward neural network that maps \(D\)-dimensional vector into \(\vert V \vert\)-dimensional vector. Just like the notation used in the previous article about transformer, \(D\) denotes the embedding dimension and \(\vert V \vert\) denotes the vocabulary size. Recall that given an input sequence of length \(N\), the encoder outputs \(N \times D\) vectors. So, in the above examples where we have two masked tokens, we take the encoder output that corresponds to these masked tokens, and process them individually through the classification head.

The second objective is to minimize the loss of Next Sentence Prediction (NSP) task. For this task, we need to use two sentences as the input, and the goal is for the model to predict whether the second sentence is the right continuation from the first sentence (i.e., a binary classification task). For example, if the first sentence is “I like to drink coffee in the morning”, the second sentence can be “I like my coffee without sugar” but probably not “My computer is performing great”. Implementationally, we also add an NSP task head to our model. Furthermore, we will need to introduce two new type of tokens to the input sequence: [CLS] and [SEP]. We use [CLS] token to mark the beginning of the input sequence, and [SEP] token to mark the end of a sentence. Thus, the input sequence may look as follows:

\(\textrm{[CLS] I like to [MASK] coffee in the [MASK] [SEP] I like my [MASK] without sugar [SEP]}\).

Note that both MLM and NSP objectives are optimized at the same time during training.

After pre-training, we can also use the model as an embedding model. This however will be different than our standard word embedding. BERT gives us a contextual embedding, where embedding of a token will be different when put in different sequence.

Segment Embedding

In addition to typical word embedding and positional embedding, BERT adds another one called the segment embedding. The way it works is pretty simple: all tokens belonging to sentence 1 are assigned with “sentence 1 embedding” and all tokens belonging to sentence 2 are assigned with “sentence 2 embedding”. We then sum the segment embedding together with both the positional and word embeddings.

Finetuning

To finetune pretrained BERT on a specific task, we simply add another head that is suitable for the task. We can safely throw away MLM and NSP heads that we used during pretraining, but we will keep the [CLS] and [SEP] tokens. Only now, depending on the task that we want our model to do, we can input more than two sentences.

Summary

That’s it for BERT. Just like GPT-1, pre-training transformer then finetuning works, only now with the encoder rather than the decoder.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, 2018.

Next: GPT-2 and GPT-3