Attention Mechanism in Seq2Seq and BiDAF – An Illustrated Guide

Read original article here

This article is the third in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a...

This article is the third in a series of four articles that aim to illustrate the working of Bi-Directional Attention Flow (BiDAF), a popular machine learning model for question and answering (Q&A). To recap, BiDAF is a closed-domain, extractive Q&A model. This means that to be able to answer a Query, BiDAF needs to consult the accompanying text that contains the information needed to answer the Query. This accompanying text is called the Context. BiDAF works by extracting a substring of the Context that best answers the query — this is what what we refer to as the Answer to the Query. I intentionally capitalize the words Query, Context, and Answer to signal that I am using them in their specialized technical capacities. An example of Context, Query, and Answer. Notice how the Answer can be found verbatim in the Context. In the first article in the series, I presented a high-level overview of BiDAF. In the second article, I talked about how BiDAF uses 3 embedding layers to get vector representations of the Context and the Query. The final outputs of these embedding steps are two matrices — H (which represent words in the Context) and U (which represents words in the Query). H and U are the inputs of the attention layers, whose function is to combine their informational content. These attention layers are the core component in BiDAF that differentiates it from earlier models and enables it to score highly in the SQuAD leaderboard. The workings of the attention layers will be the focus of this article. Let’s dive in!

Before delving into the details of the attention mechanism used in BiDAF, it behooves us first to understand what attention is. Attention was first introduced in 2016 as a part of a sequence-to-sequence (seq2seq) model. As its name suggests, seq2seq is a neural network model whose aim is to convert one sequence to another sequence. An example application of seq2seq is the translation of a ???????? French sentence to an English ???????? sentence, such as the one below: The first RNN, called “encoder”, is responsible for understanding the input sequence (in our example, a French sentence) and converting its informational content into a fixed-size intermediary vector. The second RNN, called “decoder”, then uses this intermediary vector to generate an output sequence (in our example, the English translation of the French sentence). Prior to the incorporation of the attention mechanism, seq2seq models can only deal with short sequences. This restriction arises because the “vanilla” seq2seq models can only fit a limited amount of information into the intermediary vector and some informational content is lost in the process. This situation is akin to an attempt to translate a French book by first reading the whole book, memorizing its content, and then translating it into English just from memory. As you can imagine, such an attempt is bound to fail— our poor translator will forget most of the book by the time he starts writing the English translation! The attention mechanism was developed to solve this information bottleneck. The core idea of attention is that at each step of the decoding process, we are to make a direct connection to specific parts of the encoder. In the context of our French-English translation task, this means that every time our model is to generate the next English word, it will only focus on the most relevant portions of the input French sentence. Conceptually, a seq2seq translation model with attention works just like how a normal human translator would translate a French text. He would read the first paragraph of the French text and translate it into English, move on to the second paragraph and translate this one, and so on. By doing so, he doesn’t have to commit the whole content of the book into his memory and run the risk of forgetting most of its content. Practically, we can include an attention mechanism in a seq2seq model by performing the following three steps that are depicted in the diagram below: 1. Comparison of Sequences and Calculation of Attention Scores At each time step during the decoding process, we will compare the decoder hidden state with all of the encoder hidden states. This comparison can be done using any function that takes two vectors and outputs a scalar that reflects their similarity. The simplest of such functions is a simple dot product. The scalar output of the similarity function is called an “attention score”; these attention scores are depicted by blue circles ???? in the diagram above. We then take the softmax of all these attention scores. The softmax function normalizes these attention scores into a probability distribution (a set of numbers that sum up to one). This probability distribution is called the “attention distribution”; it signals the parts of the input sequence that are most relevant to the decoding process at hand. The blue bars in the diagram above show the attention distribution. We see that the bar corresponding to the second ???????? French word, “me”, is the tallest ; this is because this word translates to “I” in ???????? English, which is the first word in our output sequence. 3. Multiplying Attention Distribution with Encoder Hidden States to Get Attention Output We then multiply each element of the attention distribution with its corresponding encoder hidden states and sum up all of these products to produce a single vector called the “attention output”. You can think of the attention output as a selective summary of the input sequence. The attention output will then become the input to the next decoding step. Although the three attention steps above were first applied to seq2seq, they are generalizable to other applications. As we will see later, BiDAF uses the same three steps in its implementation of attention, albeit with some minor modifications. With this quick overview of the attention mechanism and its implementation in seq2seq, we are now ready to see how this concept is applied in BiDAF. C’est parti!

Just to remind you where we left off in the last article — our last step (step 5) was a contextual embedding step that produced two output matrices: H for the Context and U for the Query. Our overarching goal for the attention steps (steps 6 to 9) is to fuse together information from H and U to create several matrix representations of the Context that also contain information from the Query. Our sixth step — the first attention-related step — is the formation of the so-called similarity matrix S. S is a tall skinny matrix with a dimension of T-by-J (number of words in Context by number of words in the Query). The generation of the similarity matrix S corresponds to the first step in the seq2seq attention mechanism discussed above. It entails applying a comparison function to each column in H and each column in U. The value in row t and column j of the matrix S represents the similarity of t-th Context word and j-th Query word. Let’s take a look at an example of similarity matrix S. Suppose we have this Query/Context pair: The similarity matrix S produced from the above Query/Context pair is shown below: An example of a similarity matrix S We can make a couple of observations from the matrix S above: As we expect, the matrix has a dimension of 9-by-4, 9 being the length of the Context(T) and 4 being the length of the Query (J). The cell in row 1, column 3 contains a relatively high value as indicated by its bright yellow color. This implies that the Query word and Context word associated with this coordinate are highly similar to each other. These words turn out to be the exact same word — “Singapore” — hence it makes sense that their vector representations are very similar. Just because a Context word and a Query word are identical doesn’t necessarily imply that their vector representations are highly similar! Look at the cell in row 2, column 2 — this cell encodes the similarity of the Context word “is” and the identical Query word “is”. However, its value is not as high as the “Singapore” par above. This is because these vector representations also incorporates information from the surrounding phrases. This contextual contribution is especially important for small copulas such as “is”. On the other hand, we can see that the similarity value of two distinct words with close semantic and grammatical meaning, such as “situated” and “located” is relatively high. This is thanks to our word and character embedding layers, which can generate vector representations that pretty accurately reflect a word’s meaning. Now let me tell you about how we calculate values in S. The comparison function used to perform this calculation is called α. α is more complex than the dot product used in the seq2seq; here is the equation for α: As the function α comprises a multiplication of a row vector and an equally-sized column vector, it always returns a scalar. The diagram below shows all the matrix operations performed in this step. Step 6. Using the similarity function α, we combine context matrix H and query matrix U to form similarity matrix S. The similarity matrix S serves as an input to the next two steps: Context-to-Query attention (C2Q) and Query-to-Context attention (Q2C). In this section, we will focus on C2Q. The goal of this step is to find which Query words are most relevant to each Context word. Performing C2Q is similar to performing the second and the third steps of the seq2seq attention. First, we use the scalar values in S to calculate the attention distribution. This is done by taking the row-wise softmax of S. The result is another matrix. This matrix is not explicitly named in the BiDAF paper, but let’s call it matrix A. The matrix A, whose dimension is the same as S, indicates which Query words are the most relevant to each Context word. Let’s look at an example of A: An example of matrix A, the row-wise softmaxed version of S By observing the heatmap above, we can conclude that: Semantic similarity strongly contributes to relevance. We see that for the Context words [“Singapore”, “is”, “located”], the most relevant Query words are [“Singapore”, “is”, “situated”]. These are also words with which they share a strong semantic similarity. Context words “understand” the information requested by query words. We see that the Query word “Where” is the most relevant query word for the Context words [“a”, “small”, “country”, “in”, “Southeast”, “Asia”]— words related to geographical location. We then take every row of A to get the attention distribution At: which has a dimension of 1-by-J. At: reflects the relative importance of each Query word for the t-th Context word. We then calculate the weighted sum of the query matrix U with respect to each element in the attention distribution At: The result of this step is the attention output matrix called Ũ, which is a 2d-by-T matrix. Ũ, just like H, is a matrix representation of the Context. However, Ũ contains different information from H! Whereas H encapsulates the semantic, syntactic and contextual meaning of each Context word, Ũ encapsulates the information about the relevance of each query word to each Context word. The whole process to generate Ũ from similarity matrix S and query matrix U is depicted below: The next step is Q2C, which like C2Q also starts with the similarity matrix S. In this step, our goal is to find which Context word is most similar to either one of the Query words hence are critical for answering the Query. We first take the maximum across rows of the similarity matrix S to get a column vector. This column vector is not explicitly named in the paper, but let’s call it z. Let’s now step back and think about what z symbolizes. Remember, our similarity matrix S records the similarity between each Context word and each Query word. Let’s take a second look at the example S we created above. Now let’s focus our attention to the fourth row of this matrix, which corresponds to the Context word “small”. One can see that there isn’t any bright cell across this row! This indicates that there is no word in the query that is similar in meaning to the Context word “small”. When we take the maximum across this row, the maximum value obtained will be close to zero. Contrast this with the word “Singapore” and “situated”, in whose rows we do find at least one bright cell, indicating the existence of query word(s) similar to these words. When we take the maximum of these two rows, the corresponding value in the vector z for these rows will also be relatively high. Here is the z obtained for our example: In Q2C, the values in z serve our attention values. We apply softmax on z to get an attention distribution called b. We then use b to take a weighted sum of the Context matrix H. The resulting attention output is a 2d-by-1 column vector called ĥ. The last step of Q2C is copying-and-pasting ĥ T times and combine these copies into a 2d-by-T matrix called Ĥ. Ĥ is yet another representation of the Context that encapsulates the information about the most important words in the Context with respect to the Query. The whole process to generate Ĥ from similarity matrix S and Context matrix H is depicted below: The matrices produced in steps 5, 7, and 8 are then combined to form a giant matrix G. To refresh your memory, these three matrices are as follows: H : the original Context matrix that encapsulates the semantic, syntactic, and contextual meaning of Context words. Ũ : Context matrix that encapsulates the relevance of each Query word to each Context word. Ĥ : Context matrix that encapsulates the information about the most important words in the context with respect to the Query. These three matrices have the same dimension — 2d-by-T. This “megamerging”, unfortunately, is not as simple as stacking these three matrices vertically! Here, we make use of another custom function called β. Here is the equation for β: We define our megamerged matrix G by G:t = β(H:t, Ũ:t, Ĥ:t) where G:t is the t-th column vector of G that corresponds to t-th Context word. G has a dimension of 8d-by-T. The whole process to generate G from H, Ũ and Ĥ is depicted below: Step 9. Merging of the three Context matrices H, Ũ, and Ĥ to form G The giant matrix G contains all information in H, Ũ, and Ĥ. That is, you can think of each column vector in G as a vector representation of a Context word that is “aware” of the existence of the Query and has incorporated relevant information from it. So that’s all there is to it about the attention layers in BiDAF! I know this might be a lot to take in, especially with the myriad of symbols and matrix operations involved. If you really want to study BiDAF in detail, I recommend you to print out the glossary below as well as all the diagrams above and study them side-by-side. In the next article, which will be the last article of the series, I will discuss how G serves as an input to the modeling and output layer. The output layer will be the one that gives out probabilities for each Context word being included in the Answer span. I hope to see you in that last article!

Images Powered by Shutterstock

The Data Daily

Attention Mechanism in Seq2Seq and BiDAF – An Illustrated Guide