Embeddings with TensorFlow
Embeddings in TensorFlow¶
To represent discrete values such as words to a machine learning algorithm, we need to transform every class to a one-hot encoded vector or to an embedding vector.
Using embeddings for a sparse data often results in more efficient representation as compared to the one-hot encoding approach. For example, a typical vocabulary size for NLP problems is usually from 20,000 to 200,000 unique words. It will be very inefficient to represent every word by a vector of thousands of 0s and only one 1.
Embeddings can also be "trained" by an optimizer to have different similarities which could represent semantic similarities between words. For example, a model using trained embeddings could predict a test dataset with words unseen before in the training dataset and still have a logical inference based on similar words before seen when training.
In this post, I'll show and describe use cases of embeddings with Python and TensorFlow.
%env CUDA_VISIBLE_DEVICES=''
import tensorflow as tf
import numpy as np
import pandas as pd
We will use words in the text as a use case for demonstrating using embeddings, but it is important to point out that embeddings can be used to represent discrete values other than words. Before we feed a text into a machine learning model, we need to pre-process it and the first step is often tokenization. Let's split the text to create example tokens (words).
text = 'My cat is a great cat'
tokens = text.lower().split()
print('Words in the our text:', tokens)
Define the vocabulary out of the tokens:
vocab = set(tokens)
vocab = pd.Series(range(len(vocab)), index=vocab)
vocab
To convert such text to one-hot vectors we can just use Pandas or any other library in python.
pd.get_dummies(tokens)
One-hot encoding in Python and then sending it to TensorFlow could be very inefficient. It will be better to use the built-in method in TensorFlow tf.one_hot for
that. It expects integer representation for every class and the total number of classes. For our words example, we need to assign a unique integer for every unique word.
word_ids = vocab.loc[tokens].values
word_ids
One-Hot with TensorFlow¶
We are only passing integers instead of potentially huge vectors to TensorFlow and it will internally convert the integers to one-hot vectors.
inputs = tf.placeholder(tf.int32, [None])
# TensorFlow has an operation for one-hot encoding
one_hot_inputs = tf.one_hot(inputs, len(vocab))
transformed = tf.Session().run(one_hot_inputs, {inputs: word_ids})
transformed
Embeddings with TensorFlow¶
With embeddings representation, every word will be transformed into a vector of real numbers with a chosen length (embedding_size)
.
This example is created with embedding_size = 3
in order to easily output the embeddings
vectors. It means that every word is represented by a vector of 3 real numbers. In practice, a common size for word embedding size is 200 or 300.
The tensor embeddings
is a two dimensional matrix of type tf.float32 with len(vocab)
rows and embedding_size
columns. The method tf.nn.embedding_lookup
converts our inputs from integers representing words to vectors form the embeddings matrix where every input integer is the index of a row from the embeddings
. Every row of the embeddings
matrix is a vector representing a word so every word will be represented as a point in embedding_size
dimensional space. The tensor embeddings
has a random initialization so its content will be different every time and by default, the embeddings will not represent any relationship (as syntactic similarities) between them.
The example below transorms our text of six words to a 6x3 array. The second word "cat" is the second and the last word in the text so the second resulting vector is the same as the last one.
embedding_size = 3
inputs = tf.placeholder(tf.int32, [None], name='word_ids')
# This is where the embedding vectors live
# This will be modified by the optimization unless trainable=False
# I choose random normal distribution but you can try other distributions
embeddings = tf.random_normal(shape=(len(vocab), embedding_size))
# this will return the embedding lookup
embedded = tf.nn.embedding_lookup(embeddings, inputs)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
transformed = sess.run(embedded, {inputs: word_ids})
transformed
Here is the content of the embeddings matrix. It has only five rows because of there are five unique words in the vocabulary.
sess.run(embeddings)
The method tf.nn.embedding_lookup
index to column lookup. If we pass [0, 2] as inputs for the lookup we'll get the first and third row from the embeddings back.
sess.run(embedded, {inputs: [0, 2]})
Pretrained embedings¶
When the tensor embeddings
is created, will be initialised by a random initialization and the distance between words will also have random values. In order to have any similarity between the vectors, we can train them with models like word2vec or use pre-trained vectors.
An easy way to get pre-train vectors is with a package called chakin
. The project lives here. To install it:
pip install chakin
To list the available pre-trained word vectors:
import chakin
chakin.search(lang='English')
To get the Facebooks's fastText model:
chakin.download(number=2, save_dir='.', )
It downloads a .vec file in the current directory. The .vec files are text files similar to .csv files. The first column has two numbers for the number of vocabulary and dimensions of the vectors and the reset is token and numbers separated by a space. I just load the file with pd.read_csv but there may be other more efficient ways to do it.
import csv
with open('wiki.en.vec') as f:
rows, cols = f.readline().strip().split(' ')
vectors = pd.read_csv(
'wiki.en.vec', sep=' ', skiprows=1, header=None, index_col=0,
quoting=csv.QUOTE_NONE, encoding='utf-8')
# remove one junk column
vectors = vectors.dropna(axis=1)
assert vectors.shape == (int(rows), int(cols))
vectors.head()
Every word in this model is represented by a vector of 300 number. Here is the vector for the word "car":
vectors.loc[['car']]
To demonstrate similarity, we can use cosine distance.
from sklearn.metrics.pairwise import cosine_similarity
print('Similarity:')
print(' bus to car =', cosine_similarity(vectors.loc[['bus']], vectors.loc[['car']])[0][0])
print(' bus to dog =', cosine_similarity(vectors.loc[['bus']], vectors.loc[['cat']])[0][0])
print(' dog to cat =', cosine_similarity(vectors.loc[['dog']], vectors.loc[['cat']])[0][0])
print(' cat to car =', cosine_similarity(vectors.loc[['dog']], vectors.loc[['bus']])[0][0])
In order to use this vectors for our vocabulary, we need to put them in the same order.
pretrained_embeddings = vectors.loc[vocab.index, :]
pretrained_embeddings
Using pre-trained Embeddings with TensorFlow¶
Instead of random values I can now initialize the embeddings
with pretrained_embeddings
.
inputs = tf.placeholder(tf.int32, [None])
embeddings = tf.Variable(pretrained_embeddings.values)
embedded = tf.nn.embedding_lookup(embeddings, inputs)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
transformed = sess.run(embedded, {inputs: word_ids})
transformed
Comments
Comments powered by Disqus