Source Post is here

## Sequence to Sequence basics

Let’s explain the sequence to sequence framework as we’ll rely on it for our model. Let’s start with the simplest version on the translation task.

As an example, let’s translate how are you in French comment vas tu.

### Vanilla Seq2Seq

The Seq2Seq framework relies on the encoder-decoder paradigm. The encoder encodes the input sequence, while the decoder produces the target sequence

Encoder

Our input sequence is how are you. Each word from the input sequence is associated to a vector $w \in \mathbb{R}^d$ (via a lookup table). In our case, we have 3 words, thus our input will be transformed into $[w_0, w_1, w_2] \in \mathbb{R}^{d \times 3}$. Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation $e$. Let’s write the hidden states $[e_0,e_1,e_2]$ (and thus $e=e_2$)

Vanilla Encoder

Decoder

Now that we have a vector ee that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: ee as hidden state and a special start of sentence vector $w_{\text{sos}}$ as input. The LSTM computes the next hidden state $h_0 \in \mathbb{R}^{h}$. Then, we apply some function $g : \mathbb{R}^h \mapsto \mathbb{R}^V$ so that $s_0 := g(h_0) \in \mathbb{R}^V$ is a vector of the same size as the vocabulary.
\begin{align} h_0 &= \operatorname{LSTM}\left(e, w_{sos} \right)\\ s_0 &= g(h_0)\\ p_0 &= \operatorname{softmax}(s_0)\\ i_0 &= \operatorname{argmax}(p_0)\\ \end{align}
Then, apply a softmax to $s_0$ to normalize it into a vector of probabilities $p_0 \in \mathbb{R}^V$ . Now, each entry of $p_0$ will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus $i_0 = \operatorname{argmax}(p_0)$) corresponds to the index of “comment”). Get a corresponding vector and $w_{i_0} = w_{comment}$ repeat the procedure: the LSTM will take $h_0$ as hidden state and $w_{comment}$ as input and will output a probability vector $p_1$ over the second word, etc.
\begin{align} h_1 &= \operatorname{LSTM}\left(h_0, w_{i_0} \right)\\ s_1 &= g(h_1)\\ p_1 &= \operatorname{softmax}(s_1)\\ i_1 &= \operatorname{argmax}(p_1) \end{align}
The decoding stops when the predicted word is a special end of sentence token.

Vanilla Decoder

Intuitively, the hidden vector represents the “amount of meaning” that has not been decoded yet.

The above method aims at modelling the distribution of the next word conditionned on the beginning of the sentence
$$\mathbb{P}\left[ y_{t+1} | y_1, \dots, y_{t}, x_0, \dots, x_n \right]$$
by writing
$$\mathbb{P}\left[ y_{t+1} | y_t, h_{t}, e \right]$$

### Seq2Seq with Attention

The previous model has been refined over the past few years and greatly benefited from what is known as attention. Attention is a mechanism that forces the model to learn to focus (=to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM. One way of performing attention is explained by Bahdanau et al.. We slightly modify the reccurrence formula that we defined above by adding a new vector $c_t$ to the input of the LSTM
\begin{align} h_{t} &= \operatorname{LSTM}\left(h_{t-1}, [w_{i_{t-1}}, c_t] \right)\\ s_t &= g(h_t)\\ p_t &= \operatorname{softmax}(s_t)\\ i_t &= \operatorname{argmax}(p_t) \end{align}
The vector ctct is the attention (or context) vector. We compute a new context vector at each decoding step. First, with a function $f (h_{t-1}, e_{t’}) \mapsto \alpha_{t’} \in \mathbb{R}$, compute a score for each hidden state $e_{t^{\prime}}$ of the encoder. Then, normalize the sequence of using a softmax and compute $c_t$ as the weighted average of the $e_{t^{\prime}}$.
\begin{align} \alpha_{t’} &= f(h_{t-1}, e_{t’}) \in \mathbb{R} & \text{for all } t’\\ \bar{\alpha} &= \operatorname{softmax} (\alpha)\\ c_t &= \sum_{t’=0}^n \bar{\alpha}_{t’} e_{t’} \end{align}

Attention Mechanism

The choice of the function ff varies, but is usually one of the following
$$f(h_{t-1}, e_{t’}) = \begin{cases} h_{t-1}^T e_{t’} & \text{dot}\\ h_{t-1}^T W e_{t’} & \text{general}\\ v^T \tanh \left(W [h_{t-1}, e_{t’}]\right) & \text{concat}\\ \end{cases}$$
It turns out that the attention weighs $\bar{\alpha}$ can be easily interpreted. When generating the word vas(corresponding to are in English), we expect $\bar{\alpha} _ {\text{are}}$ are to be close to $1$ while $\bar{\alpha} _ {\text{how}}$ and $\bar{\alpha} _ {\text{you}}$ to be close to $0$. Intuitively, the context vector $c$ will be roughly equal to the hidden vector of are and it will help to generate the French word vas.

By putting the attention weights into a matrix (rows = input sequence, columns = output sequence), we would have access to the alignment between the words from the English and French sentences… (see page 6) There is still a lot of things to say about sequence to sequence models (for instance, it works better if the encoder processes the input sequence backwards…).

## Training

What happens if the first time step is not sure about wether it should generate comment or vas (most likely case at the beginning of the training)? Then it would mess up the entire sequence, and the model will hardly learn anything…

If we use the predicted token as input to the next step during training (as explained above), errors would accumulate and the model would rarely be exposed to the correct distribution of inputs, making training slow or impossible. To speedup things, one trick is to feed the actual output sequence (<sos> comment vas tu) into the decoder’s LSTM and predict the next token at every position (comment vas tu <eos>).

Training

The decoder outputs vectors of probability over the vocabulary $p_i \in \mathbb{R}^V$ for each time step. Then, for a given target sequence $y_1, \dots, y_n$, we can compute its probability as the product of the probabilities of each token being produced at each relevant time step:
$$\mathbb{P}\left(y_1, \dots, y_m \right) = \prod_{i=1}^m p_i [y_i]$$
where $p_i [y_i]$ means that we extract the $y_i$-th entry of the probability vector $p_i$ from the $i$-th decoding step. In particular, we can compute the probability of the actual target sequence. A perfect system would give a probabilty of 1 to this target sequence, so we are going to train our network to maximize the probability of the target sequence, which is the same as minimizing
\begin{align} -\log \mathbb{P} \left(y_1, \dots, y_m \right) &= - \log \prod_{i=1}^m p_i [y_i]\\ &= - \sum_{i=1}^n \log p_i [y_i]\\ \end{align}
in our example, this is equal to
$$• \log p_1[\text{comment}] - \log p_2[\text{vas}] - \log p_3[\text{tu}] - \log p_4[\text{}]$$
and you recognize the standard cross entropy: we actually are minimizing the cross entropy between the target distribution (all one-hot vectors) and the predicted distribution outputed by our model (our vectors $p_i$).

## Decoding

The main takeaway from the discussion above is that for the same model, we can define different behaviors. In particular, we defined a specific behavior that speeds up training.

What about inference/testing time then? Is there an other way to decode a sentence?

There indeed are 2 main ways of performing decoding at testing time (translating a sentence for which we don’t have a translation). The first of these methods is the one covered at the beginning of the article: greedy decoding. It is the most natural way and it consists in feeding to the next step the most likely word predicted at the previous step.

Greedy Decoder - feeds the best token to the next step

But didn’t we say that this behavior is likely to accumulate errors?

Even after having trained the model, it can happen that the model makes a small error (and gives a small advantage to vas over comment for the first step of the decoding). This would mess up the entire decoding…

There is a better way of performing decoding, called Beam Search. Instead of only predicting the token with the best score, we keep track of $k$ hypotheses (for example $k=5$, we refer to kk as the beam size). At each new time step, for these 5 hypotheses we have $V$ new possible tokens. It makes a total of $5V$ new hypotheses. Then, only keep the $5$ best ones, and so on… Formally, define $\mathcal{H}_t$ the set of hypotheses decoded at time step $t$.
$$\mathcal{H}_ t := \{ (w^1_1, \dots, w^1_t), \dots, (w^k_1, \dots, w^k_t) \}$$
For instance if $k=2$, one possible $\mathcal{H}_2$ would be
$$\mathcal{H}_ 2 := \{ (\text{comment vas}), (\text{comment tu}) \}$$
Now we consider all the possible candidates $\mathcal{C}_{t+1}$, produced from $\mathcal{H}_t$ by adding all possible new tokens
$$\mathcal{C}_ {t+1} := \bigcup_{i=1}^k \{ (w^i_1, \dots, w^i_t, 1), \dots, (w^i_1, \dots, w^i_t, V) \}$$
and keep the $k$ highest scores (probability of the sequence). If we keep our example
\begin{align} \mathcal{C}_ 3 =& \{ (\text{comment vas comment}), (\text{comment vas vas}), (\text{comment vas tu})\} \\ \cup & \{ (\text{comment tu comment}), \ \ (\text{comment tu vas}), \ \ (\text{comment tu tu}) \} \end{align}
and for instance we can imagine that the 2 best ones would be
$$\mathcal{H}_ 3 := \{ (\text{comment vas tu}), (\text{comment tu vas}) \}$$
Once every hypothesis reached the <eos> token, we return the hypothesis with the highest score.

If we use beam search, a small error at the first step might be rectified at the next step, as we keep the gold hypthesis in the beam!

## Conclusion

In this article we covered the seq2seq concepts. We showed that training is different than decoding. We covered two methods for decoding: greedy and beam search. While beam search generally achieves better results, it is not perfect and still suffers from exposure bias. During training, the model is never exposed to its errors! It also suffers from Loss-Evaluation Mismatch. The model is optimized w.r.t. token-level cross entropy, while we are interested about the reconstruction of the whole sentence…

Now, let’s apply Seq2Seq for LaTeX generation from images!

Producing LaTeX code from an image

## Approach

Previous part covered the concepts of sequence-to-sequence applied to Machine Translation. The same framework can be applied to our LaTeX generation problem. The input sequence would just be replaced by an image, preprocessed with some convolutional model adapted to OCR (in a sense, if we unfold the pixels of an image into a sequence, this is exactly the same problem). This idea proved to be efficient for image captioning (see the reference paper Show, Attend and Tell). Building on some great work from the Harvard NLP group, my teammate Romain and I chose to follow a similar approach.

Keep the seq2seq framework but replace the encoder by a convolutional network over the image!

Good Tensorflow implementations of such models were hard to find. Together with this post, I am releasing the code and hope some will find it useful. You can use it to train your own image captioning model or adapt it for a more advanced use. The code does not rely on the Tensorflow Seq2Seq library as it was not entirely ready at the time of the project and I also wanted more flexibility (but adopts a similar interface).

## Data

To train our model, we’ll need labeled examples: images of formulas along with the LaTeX code used to generate the images. A good source of LaTeX code is arXiv, that has thousands of articles under the .tex format. After applying some heuristics to find equations in the .tex files, keeping only the ones that actually compile, the Harvard NLP group extracted $\sim 100,000$ formulas.

Wait… Don’t you have a problem as different LaTeX codes can give the same image?

Good point: (x^2 + 1) and \left( x^{2} + 1 \right) indeed give the same output. That’s why Harvard’s paper found out that normalizing the data using a parser (KaTeX) improved performance. It forces adoption of some conventions, like writing x ^ { 2 } instead of x^2, etc. After normalization, they end up with a .txt file containing one formula per line that looks like

From this file, we’ll produce images 0.png, 1.png, etc. and a matching file mapping the image files to the indices (=line numbers) of the formulas

The reason why we use this format is that it is flexible and allows you to use the pre-built dataset from Harvard (You may need to use the preprocessing scripts as explained here). You’ll also need to have pdflatex and ImageMagick installed.

We also build a vocabulary, to map LaTeX tokens to indices that will be given as input to our model. If we keep the same data as above, our vocabulary will look like

+ 1 2 \alpha \beta \frac { }

## Model

Our model is going to rely on a variation of the Seq2Seq model, adapted to images. First, let’s define the input of our graph. Not surprisingly we get as input a batch of black-and-white images of shape $[H,W]$ and a batch of formulas (ids of the LaTeX tokens):

A special note on the type of the image input. You may have noticed that we use tf.uint8. This is because our image is encoded in grey-levels (integers from 0 to 255 - and $2^8=256$). Even if we could give a tf.float32 Tensor as input to Tensorflow, this would be 4 times more expensive in terms of memory bandwith. As data starvation is one of the main bottlenecks of GPUs, this simple trick can save us some training time. For further improvement of the data pipeline, have a look at the new Tensorflow data pipeline.

### Encoder

High-level idea Apply some convolutional network on top of the image an flatten the output into a sequence of vectors $[e_1, \cdots, e_n]$, each of those corresponding to a region of the input image. These vectors will correspond to the hidden vectors of the LSTM that we used for translation.

Once our image is transformed into a sequence, we can use the seq2seq model!

Convolutional Encoder - produces a sequence of vectors

We need to extract features from our image, and for this, nothing has (yet) been proven more effective than convolutions. Here, there is nothing much to say except that we pick some architecture that has been proven to be effective for Optical Character Recognition (OCR), which stacks convolutional layers and max-pooling to produce a Tensor of shape $[H^{\prime},W^{\prime},512]$

Now that we have extracted some features from the image, let’s unfold the image to get a sequence so that we can use our sequence to sequence framework. We end up with a sequence of length $[H^{\prime},W^{\prime}]$

Don’t you loose a lot of structural information by reshaping? I’m afraid that when performing attention over the image, my decoder won’t be able to understand the location of each feature vector in the original image!

It turns out that the model manages to work despite this issue, but that’s not completely satisfying. In the case of translation, the hidden states of the LSTM contained some positional information that was computed by the LSTM (after all, LSTM are by essence sequential). Can we fix this issue?

Positional Embeddings I decided to follow the idea from Attention is All you Need that adds positional embeddings to the image representation (out), and has the huge advantage of not adding any new trainable parameter to our model. The idea is that for each position of the image, we compute a vector of size $512$ such that its components are coscos or sinsin. More formally, the (2i)-th and (2i+1)-th entries of my positional embedding $v$ at position $p$ will be
\begin{align} v_{2i} &= \sin\left(p / f^{2i}\right)\\ v_{2i+1} &= \cos\left(p / f^{2i}\right)\\ \end{align}
where $f$ is some frequency parameter. Intuitively, because $\sin(a+b)$ and $\cos⁡(a+b)$ can be expressed in terms of $\sin⁡(b)$ , $\sin(a)$ , $\cos⁡(b)$ and $\cos⁡(a)$, there will be linear dependencies between the components of distant embeddings, authorizing the model to extract relative positioning information. Good news: the tensorflow code for this technique is available in the library tensor2tensor, so we just need to reuse the same function and transform our out with the following call

### Decoder

Now that we have a sequence of vectors $[e_1, \dots, e_n]$ that represents our input image, let’s decode it! First, let’s explain what variant of the Seq2Seq framework we are going to use.

First hidden vector of the decoder’s LSTM In the seq2seq framework, this is usually just the last hidden vector of the encoder’s LSTM. Here, we don’t have such a vector, so a good choice would be to learn to compute it with a matrix $W$ and a vector $b$
$$h_0 = \tanh\left( W \cdot \left( \frac{1}{n} \sum_{i=1}^n e_i\right) + b \right)$$
This can be done in Tensorflow with the following logic

Attention Mechanism We first need to compute a score $\alpha_{t^{\prime}}$ for each vector $e_{t^{\prime}}$ of the sequence. We use the following method
\begin{align} \alpha_{t’} &= \beta^T \tanh\left( W_1 \cdot e_{t’} + W_2 \cdot h_{t} \right)\\ \bar{\alpha} &= \operatorname{softmax}\left(\alpha\right)\\ c_t &= \sum_{i=1}^n \bar{\alpha}_{t’} e_{t’}\\ \end{align}
This can be done in Tensorflow with the follwing code

Note that the line W1_e = tf.layers.dense(inputs=seq, units=512, use_bias=False) is common to every decoder time step, so we can just compute it once and for all. The dense layer with no bias is just a matrix multiplication.

Now that we have our attention vector, let’s just add a small modification and compute an other vector $o_{t-1}$ (as in Luong, Pham and Manning) that we will use to make our final prediction and that we will feed as input to the LSTM at the next step. Here $w_{t−1}$ denotes the embedding of the token generated at the previous step.

$o_t$ passes some information about the distribution from the previous time step as well as the confidence it had for the predicted token

\begin{align} h_t &= \operatorname{LSTM}\left( h_{t-1}, [w_{t-1}, o_{t-1}] \right)\\ c_t &= \operatorname{Attention}([e_1, \dots, e_n], h_t)\\ o_{t} &= \tanh\left(W_3 \cdot [h_{t}, c_t] \right)\\ p_t &= \operatorname{softmax}\left(W_4 \cdot o_{t} \right)\\ \end{align}

and now the code

If I read carefully, I notice that for the first step of the decoding process, we need to compute an $o_0$ too, right?

This is a good point, and we just use the same technique that we used to generate $h_0$ but with different weights.

## Training

We’ll need to create 2 different outputs in the Tensorflow graph: one for training (that uses the formulaand feeds the ground truth at each time step, see part I) and one for test time (that ignores everything about the actual formula and uses the prediction from the previous step).

### AttentionCell

We’ll need to encapsulate the reccurent logic into a custom cell that inherits RNNCell. Our custom cell will be able to call the LSTM cell (initialized in the __init__). It also has a special recurrent state that combines the LSTM state and the vector oo (as we need to pass it through). An elegant way is to define a namedtuple for this recurrent state:

Then, to compute our output sequence, we just need to call the previous cell on the sequence of LaTeX tokens. We first produce the sequence of token embeddings to which we concatenate the special <sos> token. Then, we call dynamic_rnn.

### Loss

Code speaks for itself

and when iterating over the batches during training, train_op will be given to the tf.Session along with a feed_dict containing the data for the placeholders.

## Decoding in Tensorflow

Let’s have a look at the Tensorflow implementation of the Greedy Method before dealing with Beam Search

While greedy decoding is easy to conceptualize, implementing it in Tensorflow is not straightforward, as you need to use the previous prediction and can’t use dynamic_rnn on the formula. There are basically 2 ways of approaching the problem

1. Modify our AttentionCell and AttentionState so that AttentionState also contains the embedding of the predicted word at the previous time step,

This technique has a few downsides. It doesn’t use inputs (which used to be the embedding of the gold token from the formula and thus we would have to call dynamic_rnn on a “fake” sequence). Also, how do you know when to stop decoding, once you’ve reached the <eos> token?

2. Implement a variant of dynamic_rnn that would not run on a sequence but feed the prediction from the previous time step to the cell, while having a maximum number of decoding steps. This would involve delving deeper into Tensorflow, using tf.while_loop. That’s the method we’re going to use as it fixes all the problems of the first technique. We eventually want something that looks like

Much better isn’t it? Now let’s see what GreedyDecoderCell and dynamic_decode look like.

### Greedy Decoder Cell

We first wrap the attention cell in a GreedyDecoderCell that takes care of the greedy logic for us, without having to modify the AttentionCell

### Dynamic Decode primitive

We need to implement a function dynamic_decode that will recursively call the above step function. We do this with a tf.while_loop that stops when all the hypotheses reached <eos> or time is greater than the max number of iterations.

Some details using TensorArrays or nest.map_structure have been omitted for clarity but may be found on github

Notice that we place the tf.while_loop inside a scope named rnn. This is because dynamic_rnndoes the same thing and thus the weights of our LSTM are defined in that scope.

### Beam Search Decoder Cell

We can follow the same approach as in the greedy method and use dynamic_decode

Let’s create a new wrapper for AttentionCell in the same way we did for GreedyDecoderCell. This time, the code is going to be more complicated and the following is just for intuition. Note that when selecting the top kk hypotheses from the set of candidates, we must know which “beginning” they used (=parent hypothesis).

Look at github for the details. The main idea is that we add a beam dimension to every tensor, but when feeding it into AttentionCell we merge the beam dimension with the batch dimension. There is also some trickery involved to compute the parents and the new ids using modulos.

## Conclusion

I hope that you learned something with this post, either about the technique or Tensorflow. While the model achieves impressive performance (at least on short formulas with roughly 85% of the LaTeX being reconstructed), it still raises some questions that I list here:

How do we evaluate the performance of our model?. We can use standard metrics from Machine Translation like BLEU to evaluate how good the decoded LaTeX is compared to the reference. We can also choose to compile the predicted LaTeX sequence to get the image of the formula, and then compare this image to the orignal. As a formula is a sequence, computing the pixel-wise distance wouldn’t really make sense. A good idea is proposed by Harvard’s paper. First, slice the image vertically. Then, compare the edit distance between these slices…

How to fix exposure bias? While beam search generally achieves better results, it is not perfect and still suffers from exposure bias. During training, the model is never exposed to its errors! It also suffers from Loss-Evaluation Mismatch. The model is optimized w.r.t. token-level cross entropy, while we are interested about the reconstruction of the whole sentence…

An Example of LaTeX generation - which one is the reference?