Introduction

For the fifth post in this series, I read “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Devlin et al. Besides the GPT architecture (decoder-only), the BERT architecture (encoder-only) is the most popular transformer architecture used today due to its proficiency and capability at learning downstream tasks that GPT can’t or performs significantly worse at.

Summary

The BERT architecture was partially inspired by GPT’s pre-training approach but with the key distinction that it uses bidirectional attention rather than causal attention. What this actually means from an implementation perspective is that BERT-based architectures only use a transformer encoder, which can attend to the entire input both left-to-right and right-to-left, whereas GPT-based architectures only use a transformer decoder, which can only apply attention in a unidirectional left-to-right manner. The authors’ motivation for this approach was that a great deal of information, context, and understanding is missed when attention is restricted to only being causal. Such information loss is crucial for tasks like question answering, where it’s more important to understand the entire question input to perform analysis as opposed to generative tasks like generating new text. At the highest level, BERT architectures can only perform analysis tasks, which they excel at, but are unable to generate brand new text like GPT architectures because they require the entire input for a forward pass.

The BERT architecture itself, at least the one used in the paper, is nearly identical to the original encoder architecture from the “Attention is all you need” paper by Vaswani et al. It consists of an input embedding layer followed by a positional encoding layer and then N transformer blocks (self-attention followed by a layer normalization layer and a FFN). This architecture constitutes the pre-training portion of BERT, similar to how GPT used just a decoder for its pre-training. Additionally, BERT makes uses of a special [CLS] token that is used for classification tasks and a [SEP] token to separate inputs such as sentence pairs for association or causality tasks. For cases where there are two separated inputs, such as sentence A and sentence B, BERT also has an input layer for adding a learned embedding for which sentence or portion of the input each token belongs to. You can think of this as marking every token in the first sequence with sequence A and every token in the second sequence with sequence B.

In order to accomodate for using bidirectional attention, as opposed to causal attention, BERT uses different learning objectives than standard language modelling (i.e. predicting the next token). The first learning objective BERT uses is known as masked language modelling, or MLM for short. This task consists of randomly masking out tokens with a set probability from the input of the encoder and then using the non-masked tokens to predict the masked outputs via self-attention. To mitigate mismatches between pre-training and fine-tuning, the policy BERT uses for MLM training is to first choose 15% of token positions at random for prediction, then mask it with the special [MASK] token 80% of the time, a random token 10% of the time, and the original token 10% of the time. The output masked tokens are fed into a softmax over the vocabulary and cross-entropy loss is used to calculate the loss and back propagate. The second learning objective that BERT uses is known as next sentence prediction (NSP). Unlike MLM or even traditional language modelling, NSP doesn’t use any masking of the input, but does require that the input be a pair of sequences, sequence A and sequence B. The objective of this taks is to predict whether or not sequence B directly follows sequence A. In the case of sentences, this task is essentially asking “does sentence B follow sentence A?”. This language task is accomplished by taking any monolingual corpus and creating a dataset of sentence pairs where 50% of the time sentence B is the actual next sentence after sentence A and the other 50% of the time it isn’t. This learning objective is dependent on BERT’s [CLS] output token to encode the binary output of “yes, B is after A” or “no, B doesn’t follow A”. Furthermore, since this task is just a binary classification task, the rest of the output tokens from BERT can be discarded during NSP since they aren’t used at all. Pre-training is thus done by using massive datasets of unlabeled text data to train the encoder on both MLM and NSP.

Fine-tuning BERT, similar to fine-tuning GPT, uses much smaller, labeled datasets to optimize for a particular downstream task. Examples of tasks include paraphrasing, entailment, question answering, sequence tagging, and sentiment analysis. Such tasks are typically accomplished by adding some form of additional output layer and then using the smaller labeled dataset to learn the parameters for the small output layer. Some of the above tasks, such as sentiment analysis, only make use of the [CLS] token, while other tasks, such as question answering, use the entire output to analyze sequence B in relation to sequence A.

BERT as an architecture was revolutionary to the field of NLP due to it’s state of the art performance on so many text analysis tasks (outperforming GPT on many such tasks). While many architectures that followed it, such as RoBERTa, continued to revise and build upon the original architecture, BERT established the pre-training approach to encoder-based architectures and opened the door for incredible gains in the field of NLP as a whole.

Questions/Notes I Have

  1. For a more in-depth explanation of BERT, I’d recommend this Hugging Face article which has great illustrations of many of the important ideas outlined in the paper.
  2. One of the really cool aspects about BERT compared to the GPT series is that BERT is entirely open source for anyone to use. This is in large part why so many improvements and iterations have been made on top of the BERT architecture and why BERT is so widely used.