In this article we built a deep learning-based model for automatic translation from English to Russian using TensorFlow and Keras.

## Introduction

Google Translate works so well, it often seems like magic. But it’s not magic — it’s deep learning!

In this series of articles, we’ll show you how to use deep learning to create an automatic translation system. This series can be viewed as a step-by-step tutorial that helps you understand and build a neuronal machine translation.

This series assumes that you are familiar with the concepts of machine learning: model training, supervised learning, neural networks, as well as artificial neurons, layers, and backpropagation.

In the previous article, we installed all the tools required to develop an automatic translation system, and defined the development workflow. In this article, we’ll go ahead and build our AI language translation system.

We’ll need to write very few lines of code because, for most of the logic, we’ll use Keras-based pre-formatted templates.

If you'd like to see the final code we end up with, it's available in this Python notebook.

## Importing Libraries

As a start, we need to load the required libraries:

import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
import numpy as np
import string
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers

## Building Model Components

Building our model with Keras is very straightforward. We'll start by creating our model using the Sequential model provided by Keras.

model = Sequential()

Next, we add a long short-term memory (LSTM) layer. In Keras' LSTM class, most parameters of an LSTM cell have default values, so the only thing we need to explicitly define is the dimensionality of the output: the number of LSTM cells that will be created for our sequence-to-sequence recurrent neural network (RNN).

The size of the input vector is the total of the words inside the original sentence. Because we’re using an embedding, we will get tokenized words. This means that words can be split into subtokens, increasing the number of words in the input sentence.

To keep our model size manageable (and therefore ensure we can train it in a reasonable amount of time), we set a length of 512. We add two LSTM layers: the first is an encoder, and the second is a decoder.

model.add(LSTM(512))
model.add(RepeatVector(LEN_EN))
model.add(LSTM(512))

Note that we've added a RepeatVector in the middle. That will be part of our attention mechanism, which we'll add shortly.

Next, we add a Dense layer to our model. This layer takes all the output neurons from the previous layer. We need the dense layer because we’re making predictions. We want to get the sentence in Russian that has the maximal score corresponding to the inputted English sentence. The dense layer, essentially, computes a softmax on the outputs of each LSTM cell.

model.add(Dense(LEN_RU, activation='softmax'))

`LEN_RU`

is the size of the output vector (we will compute these parameters later on). The same for the variable `LEN_EN`

.

Here's how our model should look so far:

model = Sequential()
model.add(LSTM(512))
model.add(LSTM(512))
model.add(Dense(LEN_RU, activation='softmax'))
rms = optimizers.RMSprop(lr=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')

We are using a Keras optimizer called RMSprop. It optimizes the gradient descent technique used for backpropagation.

We still need to add the embedding layer, as well as include an attention layer between the encoder and the decoder.

The embedding layer is created with Word2Vec.This is, in fact, a pretrained embedding layer. Now we need to generate the Word2Vec weights matrix (the weights of the neurons of the layer) and fill a standard Keras Embedding layer with that matrix.

We can use the `gensim`

package to obtain the embedding layer automatically:

from gensim.models import Word2Vec

Then, we create our Word2Vec embedding layer:

model_w2v = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)

The embedding layer can then be retrieved as follows:

model_w2v.wv.get_keras_embedding(train_embeddings=False)

We can call the ` model.summary()`

function to get an overview of our model:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 1200
_________________________________________________________________
lstm_1 (LSTM) (None, 512) 1255424
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 8, 512) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 512) 2099200
_________________________________________________________________
dense_1 (Dense) (None, 512) 262656
=================================================================
Total params: 3,618,480
Trainable params: 3,617,280
Non-trainable params: 1,200
_________________________________________________________________

## Adding Attention Mechanism

Now we want to add an attention mechanism. We could write it from scratch, but a simpler solution is to use an existing Keras module, such as Keras self-attention.

Let’s import this module:

from keras_self_attention import SeqSelfAttention

Now we will add the imported module between the two LSTM blocks:

model.add(SeqSelfAttention(attention_activation='sigmoid'))

Our model is now complete.

## Putting the Model Together

Here is the final code of our NN, coded in Keras:

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import string
from numpy import array, argmax, random, take
import tensorflow as tf
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
from keras_self_attention import SeqSelfAttention
model = Sequential()
model_w2v = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.add(model_w2v.wv.get_keras_embedding(train_embeddings=False))
model.add(LSTM(512))
model.add(RepeatVector(8))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(LSTM(512))
model.add(Dense(LEN_RU, activation='softmax'))
rms = optimizers.RMSprop(lr=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')
model.summary()

After we run the code, we get the following output:

[root@ids ~]# python3 NMT.py
Using TensorFlow backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, None, 100) 1200
_________________________________________________________________
lstm_1 (LSTM) (None, 512) 1255424
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 8, 512) 0
_________________________________________________________________
seq_self_attention_1 (SeqSel (None, 8, 512) 32833
_________________________________________________________________
lstm_2 (LSTM) (None, 512) 2099200
_________________________________________________________________
dense_1 (Dense) (None, 512) 262656
=================================================================
Total params: 3,651,313
Trainable params: 3,650,113
Non-trainable params: 1,200

Although our model code works well as-is, consider the enclosing the model creation code in a function will make it easier to reuse. You don't *have* to do this - but to get an idea of how it might look, see the final translator code in the notebook we mentioned earlier.

## Next Steps

Now our model is ready. In the next article, we’ll train and test this model. Stay tuned!