Embeddings with TensorFlow

Embeddings in TensorFlow

To represent discrete values such as words to a machine learning algorithm, we need to transform every class to a one-hot encoded vector or to an embedding vector.

Using embeddings for a sparse data often results in more efficient representation as compared to the one-hot encoding approach. For example, a typical vocabulary size for NLP problems is usually from 20,000 to 200,000 unique words. It will be very inefficient to represent every word by a vector of thousands of 0s and only one 1.

Embeddings can also be "trained" by an optimizer to have different similarities which could represent semantic similarities between words. For example, a model using trained embeddings could predict a test dataset with words unseen before in the training dataset and still have a logical inference based on similar words before seen when training.

In this post, I'll show and describe use cases of embeddings with Python and TensorFlow.

In [1]:
%env CUDA_VISIBLE_DEVICES=''
import tensorflow as tf
import numpy as np
import pandas as pd
env: CUDA_VISIBLE_DEVICES=''

We will use words in the text as a use case for demonstrating using embeddings, but it is important to point out that embeddings can be used to represent discrete values other than words. Before we feed a text into a machine learning model, we need to pre-process it and the first step is often tokenization. Let's split the text to create example tokens (words).

In [2]:
text = 'My cat is a great cat'
tokens = text.lower().split()
print('Words in the our text:', tokens)
Words in the our text: ['my', 'cat', 'is', 'a', 'great', 'cat']

Define the vocabulary out of the tokens:

In [3]:
vocab = set(tokens)
vocab = pd.Series(range(len(vocab)), index=vocab)
vocab
Out[3]:
great    0
cat      1
my       2
a        3
is       4
dtype: int64

To convert such text to one-hot vectors we can just use Pandas or any other library in python.

In [4]:
pd.get_dummies(tokens)
Out[4]:
a cat great is my
0 0 0 0 0 1
1 0 1 0 0 0
2 0 0 0 1 0
3 1 0 0 0 0
4 0 0 1 0 0
5 0 1 0 0 0

One-hot encoding in Python and then sending it to TensorFlow could be very inefficient. It will be better to use the built-in method in TensorFlow tf.one_hot for that. It expects integer representation for every class and the total number of classes. For our words example, we need to assign a unique integer for every unique word.

In [5]:
word_ids = vocab.loc[tokens].values
word_ids
Out[5]:
array([2, 1, 4, 3, 0, 1])

One-Hot with TensorFlow

We are only passing integers instead of potentially huge vectors to TensorFlow and it will internally convert the integers to one-hot vectors.

In [6]:
inputs = tf.placeholder(tf.int32, [None])

# TensorFlow has an operation for one-hot encoding
one_hot_inputs = tf.one_hot(inputs, len(vocab))

transformed = tf.Session().run(one_hot_inputs, {inputs: word_ids})
transformed
Out[6]:
array([[0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)

Embeddings with TensorFlow

With embeddings representation, every word will be transformed into a vector of real numbers with a chosen length (embedding_size).

This example is created with embedding_size = 3 in order to easily output the embeddings vectors. It means that every word is represented by a vector of 3 real numbers. In practice, a common size for word embedding size is 200 or 300.

The tensor embeddings is a two dimensional matrix of type tf.float32 with len(vocab) rows and embedding_size columns. The method tf.nn.embedding_lookup converts our inputs from integers representing words to vectors form the embeddings matrix where every input integer is the index of a row from the embeddings. Every row of the embeddings matrix is a vector representing a word so every word will be represented as a point in embedding_size dimensional space. The tensor embeddings has a random initialization so its content will be different every time and by default, the embeddings will not represent any relationship (as syntactic similarities) between them.

The example below transorms our text of six words to a 6x3 array. The second word "cat" is the second and the last word in the text so the second resulting vector is the same as the last one.

In [7]:
embedding_size = 3

inputs = tf.placeholder(tf.int32, [None], name='word_ids')

# This is where the embedding vectors live
# This will be modified by the optimization unless trainable=False
# I choose random normal distribution but you can try other distributions
embeddings = tf.random_normal(shape=(len(vocab), embedding_size))

# this will return the embedding lookup
embedded = tf.nn.embedding_lookup(embeddings, inputs)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
transformed = sess.run(embedded, {inputs: word_ids})
transformed
Out[7]:
array([[-0.02937758, -0.60863554,  1.1070673 ],
       [ 0.5732564 , -0.7388431 ,  0.30292028],
       [-0.21285258, -1.8346152 , -1.9110047 ],
       [ 0.6378179 ,  1.3454263 , -0.8725002 ],
       [-0.89525574, -0.79854083, -0.40036395],
       [ 0.5732564 , -0.7388431 ,  0.30292028]], dtype=float32)

Here is the content of the embeddings matrix. It has only five rows because of there are five unique words in the vocabulary.

In [8]:
sess.run(embeddings)
Out[8]:
array([[ 2.1316075 , -1.061635  ,  0.7775889 ],
       [ 1.5460204 , -0.6848818 ,  0.21591993],
       [-1.3386676 ,  1.6173289 ,  0.7851124 ],
       [ 0.3835993 ,  0.2665336 , -0.32107848],
       [ 0.08313944,  1.1281649 , -1.151932  ]], dtype=float32)

The method tf.nn.embedding_lookup index to column lookup. If we pass [0, 2] as inputs for the lookup we'll get the first and third row from the embeddings back.

In [9]:
sess.run(embedded, {inputs: [0, 2]})
Out[9]:
array([[-0.26338172,  0.06488156, -0.14654125],
       [ 0.44013414,  0.22667229,  0.5294355 ]], dtype=float32)

Pretrained embedings

When the tensor embeddings is created, will be initialised by a random initialization and the distance between words will also have random values. In order to have any similarity between the vectors, we can train them with models like word2vec or use pre-trained vectors.

An easy way to get pre-train vectors is with a package called chakin. The project lives here. To install it:

pip install chakin

To list the available pre-trained word vectors:

In [10]:
import chakin

chakin.search(lang='English')
                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300          Google News(100B)           3.0M   

      Method Language    Author  
2   fastText  English  Facebook  
11     GloVe  English  Stanford  
12     GloVe  English  Stanford  
13     GloVe  English  Stanford  
14     GloVe  English  Stanford  
15     GloVe  English  Stanford  
16     GloVe  English  Stanford  
17     GloVe  English  Stanford  
18     GloVe  English  Stanford  
19     GloVe  English  Stanford  
20     GloVe  English  Stanford  
21  word2vec  English    Google  

To get the Facebooks's fastText model:

In [11]:
chakin.download(number=2, save_dir='.', )
Test: 100% ||                                       | Time: 0:03:12  32.6 MiB/s
Out[11]:
'./wiki.en.vec'

It downloads a .vec file in the current directory. The .vec files are text files similar to .csv files. The first column has two numbers for the number of vocabulary and dimensions of the vectors and the reset is token and numbers separated by a space. I just load the file with pd.read_csv but there may be other more efficient ways to do it.

In [12]:
import csv

with open('wiki.en.vec') as f:
    rows, cols = f.readline().strip().split(' ')

vectors = pd.read_csv(
    'wiki.en.vec', sep=' ', skiprows=1, header=None, index_col=0,
    quoting=csv.QUOTE_NONE, encoding='utf-8')

# remove one junk column
vectors = vectors.dropna(axis=1)
assert vectors.shape == (int(rows), int(cols))
vectors.head()
Out[12]:
1 2 3 4 5 6 7 8 9 10 ... 291 292 293 294 295 296 297 298 299 300
0
, -0.023167 -0.004248 -0.105720 0.042783 -0.143160 -0.078954 0.078187 -0.194540 0.022303 0.312070 ... 0.046595 -0.11558 0.044184 -0.023124 0.025860 -0.116530 0.010936 0.089398 -0.01590 0.148660
. -0.111120 -0.001386 -0.177800 0.064508 -0.240370 0.031087 -0.030144 -0.368830 -0.043855 0.248310 ... 0.095332 -0.21914 -0.042760 -0.136850 0.097470 -0.218180 -0.058233 0.063374 -0.12161 0.039339
the -0.065334 -0.093031 -0.017571 0.200070 0.029521 -0.039920 -0.163280 -0.072946 0.089604 0.080907 ... 0.064944 -0.21673 -0.037683 0.081860 -0.039891 -0.051334 -0.101650 0.166420 -0.13079 0.035397
</s> 0.050258 -0.073228 0.435810 0.174830 -0.185460 -0.399210 -0.507670 -0.506600 -0.155570 0.031451 ... -0.096853 -0.47723 -0.027511 0.259640 -0.010468 -0.298150 -0.236090 0.205250 0.75183 0.097156
of 0.048804 -0.285280 0.018557 0.205770 0.060704 0.085446 -0.036267 -0.068373 0.145070 0.178520 ... 0.169560 -0.33677 -0.060286 0.086097 -0.065001 0.004833 -0.100960 0.139100 -0.13714 -0.039705

5 rows × 300 columns

Every word in this model is represented by a vector of 300 number. Here is the vector for the word "car":

In [13]:
vectors.loc[['car']]
Out[13]:
1 2 3 4 5 6 7 8 9 10 ... 291 292 293 294 295 296 297 298 299 300
0
car -0.092271 -0.14855 -0.14696 0.013 -0.40305 -0.31004 0.1022 -0.42087 -0.22948 0.12853 ... 0.096352 0.031328 0.31818 -0.18818 0.14998 -0.18162 -0.35564 0.28245 -0.18557 -0.060884

1 rows × 300 columns

To demonstrate similarity, we can use cosine distance.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

print('Similarity:')
print('   bus to car =', cosine_similarity(vectors.loc[['bus']], vectors.loc[['car']])[0][0])
print('   bus to dog =', cosine_similarity(vectors.loc[['bus']], vectors.loc[['cat']])[0][0])
print('   dog to cat =', cosine_similarity(vectors.loc[['dog']], vectors.loc[['cat']])[0][0])
print('   cat to car =', cosine_similarity(vectors.loc[['dog']], vectors.loc[['bus']])[0][0])
Similarity:
   bus to car = 0.46000568082170107
   bus to dog = 0.17041678778979216
   dog to cat = 0.6380517245741392
   cat to car = 0.15984959272698984

In order to use this vectors for our vocabulary, we need to put them in the same order.

In [15]:
pretrained_embeddings = vectors.loc[vocab.index, :]
pretrained_embeddings
Out[15]:
1 2 3 4 5 6 7 8 9 10 ... 291 292 293 294 295 296 297 298 299 300
0
great -0.267620 -0.081004 -0.21283 0.380140 -0.156980 -0.167050 0.370720 0.108050 0.293960 0.159460 ... -0.115460 -0.10987 0.285300 -0.071547 -0.253600 -0.101550 -0.013372 0.105520 0.037726 0.253640
cat -0.138190 0.140290 -0.32621 0.116240 -0.198060 0.455260 0.212820 -0.512560 0.033657 0.154290 ... 0.151850 -0.32703 0.102690 -0.053309 -0.068975 -0.006616 -0.066738 0.273190 0.520030 -0.008721
my -0.126860 0.152880 0.14903 0.039269 -0.130520 -0.038069 -0.162300 -0.002766 0.121190 0.142020 ... 0.062873 0.14942 -0.146160 -0.210860 0.321350 -0.037258 -0.060301 0.419110 0.032854 -0.123030
a 0.115590 0.301920 -0.11465 0.010010 -0.032187 -0.107550 0.060674 -0.104770 0.174880 0.008112 ... -0.020257 -0.18694 -0.065594 -0.202230 -0.122180 -0.297980 0.034272 0.110480 0.130740 0.041164
is 0.035927 0.145170 0.11926 0.078836 -0.047748 0.100960 0.090815 -0.221760 -0.095085 -0.022610 ... 0.040324 -0.27410 -0.116330 -0.089418 -0.072754 -0.260430 0.084246 -0.001608 0.170800 -0.035512

5 rows × 300 columns

Using pre-trained Embeddings with TensorFlow

Instead of random values I can now initialize the embeddings with pretrained_embeddings.

In [16]:
inputs = tf.placeholder(tf.int32, [None])

embeddings = tf.Variable(pretrained_embeddings.values)

embedded = tf.nn.embedding_lookup(embeddings, inputs)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
transformed = sess.run(embedded, {inputs: word_ids})
transformed
Out[16]:
array([[-0.12686  ,  0.15288  ,  0.14903  , ...,  0.41911  ,  0.032854 ,
        -0.12303  ],
       [-0.13819  ,  0.14029  , -0.32621  , ...,  0.27319  ,  0.52003  ,
        -0.0087214],
       [ 0.035927 ,  0.14517  ,  0.11926  , ..., -0.0016082,  0.1708   ,
        -0.035512 ],
       [ 0.11559  ,  0.30192  , -0.11465  , ...,  0.11048  ,  0.13074  ,
         0.041164 ],
       [-0.26762  , -0.081004 , -0.21283  , ...,  0.10552  ,  0.037726 ,
         0.25364  ],
       [-0.13819  ,  0.14029  , -0.32621  , ...,  0.27319  ,  0.52003  ,
        -0.0087214]])

Training your embeddings

There are several models for "similarity training" of embeddings. The most popular are:

Comments

Comments powered by Disqus