Introduction

For the third post in this series, I read “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al. More commonly known as the “word2vec” paper, Mikolov et al. present their research into efficient methods of translating tokenized vocabulary words into high-dimensional word embeddings using a stripped-down feedforward neural network and specially designed learning objectives.

Summary

The motivation for the word2vec paper was to find a way to create high-quality word embedding vectors by efficiently using a ton of data, rather than making a very complex model. At the time of publishing, there weren’t really any good ways to use massive datasets of words/vocabularies to generate word embeddings. The most naive method is just to use a one-hot vector for each word where the dimension N of the vector is the length of the vocabulary itself. For small vocabularies, this method can work reasonably well, but for large vocabularies, it causes the number of parameters in language models (pre-LLMs at the time of publishing) to explode and make models much less efficient. Aside from using one-hot vectors, there were two main (ML) methods of generating word embeddings. The first used a FFN that took one-hot vectors as inputs (the context window), followed by a projection layer into a lower dimensional space, and a dense hidden layer which then was softmaxed to generate the most likely next word in the sequence. The second used a RNN similar to more traditional seq2seq models where you input a sequence, one token at a time, there’s a projection layer and a hidden layer and then the network outputs the current timestep’s output vector (not used for this task) and the hidden state vector which gets passed to the next timestep as additional input. This RNN-style architecture was also trained to predict the next word in a given sequence.

However, due to the dense hidden layers in both architectures, these networks were very inefficient to train and couldn’t handle massive datasets or vocabularies. To address this problem, the authors of the word2vec paper decided to essentially copy the FFN architecture except they also stripped away the hidden layer leaving only the inputs feeding into a projection (embedding) layer. In addition to this vast architecture simplification, the authors also came up with two learning objectives that are far more successful at generating embeddings compared to the generic “next token in a sequence” objective; these two objectives were the Continuous Bag of Words objective (CBOW) and the Skip-gram objective. Both learning objectives came with associated network architectures to match them and, in addition, assigned each word in the vocabulary a “center word” vector, \(v\), and a “context word” vector, \(u\), which are learned during training.

The CBOW learning objective is essentially trying to predict a center word given all of the surrounding words, called context words. For example if I set the context window, \(C\), to equal 3, my input is the one-hot vectors of the three words before and the three words after the current word. Each of these context words are then fed into the embedding layer individually and average-pooled after to create a single context embedding vector. This context embedding vector is then dotted with the learned center word vector for every word in the vocabulary and the result is softmaxed to generate the probability of each word in the output being the center word.

The Skip-gram learning objective, on the other hand, is the opposite of the CBOW learning objective. The objective here is to predict context words given a center word. So the architecture essentially consists of a one-hot vector which is then fed into an embedding layer and then dotted with the learned context vector of every word in the vocabulary and the result is then softmaxed to generate the probability of each word in the output being a context word.

Both of the architectures use the dot product operation before the softmax operation as a method of measuring the similarity between the center word and the context words in the embedding vector space. This similarity is then what acts as the signal for whether or not that word is related to its surrounding words. This idea is worth pointing out because it’s extremely similar to the scaled dot product attention mechanism used in LLMs today in both the encoder and the decoder, only the learning objective and inputs/outputs are very different.

In the paper, there are also some other strategies employed to make training effective and feasible such as negative sampling, which is when random non-context words are sampled as noise for the model and the model learns to not predict them as context words/center words (i.e. negative examples), as well as the hierarchical softmax which is a method of softmax involving a binary tree in order to make the operation on a very large vocabulary more efficient. With these strategies, the paper showed that both the CBOW and the Skip-gram architectures/objectives outperformed the more traditional language models at the time and were more effective and generating higher quality word embeddings since the simplified models could ingest far more training data. The result of a fully trained embedding model is that you can feed it a word index and get out a very high quality word embedding to use as a much better and informative input to the rest of a model. As such, many LLMs today will use the word2vec, or another set of pre-trained embedding weights, to initialize their embedding layers and then simply fine tune it during the training of the rest of their model. This greatly increases training efficiency as well as the performance of the overall model.

Questions/Notes I Have

  1. There’s a lot of good mathematical intuition behind why word2vec works that isn’t included in the paper and it’s properly explained here, which I would recommend for people with a more solid probability background.
  2. One of the big constraints of these pre-trained embedding models is that if you want to use them and fine tune them, your input vocabulary has to match the embedding’s input vocabulary exactly. This is due to the fact that a one-hot vector has to be fed in and the corresponding word index has to match what the embedding layer was trained on otherwise you will get an embedding vector for a completely different word.