Fundamentals of Large Language Models - Ep.2: Sequence-to-Sequence (Seq2Seq) | rey’s blog

Previous: Word Embeddings

In this post, we will learn about Seq2Seq model. Seq2Seq is one of the earliest language model that maps an input sequence to a target sequence. For example, Seq2Seq can be used for language translation tasks, where an input sequence is a sentence in French, and a target sequence is a translated sentence in English. Yet another example, Seq2Seq can also be used for text summarisation tasks, where an input sentence is a paragraph of text that we want to be summarised, and a target sequence is a summary of the input paragraph. Understanding Seq2Seq should help our understanding about the attention mechanism (which we will discuss in the next post).

Seq2Seq architecture

The basic idea of Seq2Seq is to first encode an input sequence (e.g., a sentence in French) into an embedding using an encoder network. This embedding is supposed to capture all the information we need from the input sequence that allows a decoder network to generate an output sequence (e.g., a sentence in English). Thus, in terms of design, Seq2Seq model includes:

an encoder network that maps an input sequence to an embedding,
a decoder network that maps this embedding to an output sequence.

An encoder \(E\) (e.g., a recurrent neural network) first takes an input \(x\) (e.g., a sentence) and produces a hidden state at the last time step \(h^E_T\). The encoder hidden state \(h^E_T\) is then passed to a decoder \(D\) (e.g., another recurrent neural network) to be decoded into an output sequence \(\hat{y}\). SOS and EOS represent “start-of-sententece and ““end-of-sentence”, respectively. We use SOS token as the input to the decoder at the first time step, while we use EOS token so the decoder can learn when to stop performing inference, which allows variability in the sequence length. We will discuss how to use these tokens later.

Let’s take a look at how these networks work individually.

Seq2Seq encoder

The encoder is essentially a recurrent neural network (RNN) family of architecture and there is nothing too novel about how the encoder works. The job of the encoder is to encode the whole input sequence \(x = {x_1, x_2, ..., x_T}\) into its hidden state (i.e., the encoder takes an input \(x_t\) at each time step \(t\) and produces a hidden state \(h^E_t\)). The hope is that the final encoder hidden state \(h^E_T\) contains enough information about the input sequence such that the decoder \(D\) will be able to decode it to a target sequence. The following figure explains how this all work.

Seq2Seq decoder

The decoder is also a recurrent neural network, which at each time step processes its own prediction from the previous timestep \(\hat{y_{t-1}}\) into a new prediction. At first time step, the decoder takes the last encoder hidden state \(h^E_T\) and SOS token as its input to output the first prediction \(\hat{y_1}\). In the second time step, the decoder takes its own prediction and hidden state from the previous timestep (i.e., \(\hat{y_1}\) and \(h^D_1\)) to produce \(\hat{y_2}\). This process continues until the decoder outputs an EOS token. The figure below describes the mechanics of the decoder.

Seq2Seq training

The above shows exactly how the decoder works during inference time. During training, we may also do something called “teacher forcing” which uses the ground truth labels \(y = \{y_1, y_2, ..., EOS\}\) as inputs instead of using the decoder’s own predictions as illustrated below.

We can then train Seq2Seq end-to-end by simply accumulating the cross-entropy loss for each of the predicted tokens that are misclassified.

Summary

That is all for Seq2Seq! Please feel free to send me an email if you have questions, suggestions, or if you found some mistakes in this article.

References

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems, 2014.

Next: Attention