Word2Vec with TensorFlow
Word2Vec with Skip-Gram and TensorFlow¶
This is a tutorial and a basic example for getting started with word2vec model by Mikolov et al. It is used for learning vector representations of words, called "Words Embeddings". For more information about Embeddings, read my previous post.
The word2vec model can be trained with two different word representations:¶
- Continuous Bag-of-Words (CBOW): predicts target words (e.g. 'mat') from source context words ('the cat sits on the')
- Skip-Gram: predicts source context-words from the target words
Skip-Gram tends to do better and this tutorial will implement a word2vec with skip-grams.¶
The goal of the model is to train it's embeddings layer in a way that similar by meaning words are close to each other in their N-dimensional vector representation. The model has two layers: the embeddings layer and a linear layer. Because of the last layer is linear, the distance between embedding vectors for words is linearly related to the distance in the meaning of those words. In other words, we are able to do such mathematical operations with the vectors: [king] - [man] + [woman] ~= [queen]
%env CUDA_VISIBLE_DEVICES=0
import time
import numpy as np
import pandas as pd
import tensorflow as tf
import sklearn
import nltk
Dataset¶
To train a word2vec model, we need a large text corpus. This example uses the text from the "20 newsgroups dataset". The dataset contains 11314 messages form a message board with corresponding labels for its topics. We just merge all messages together and ignore the labels. In practice, it's better to use a larger corpus and to have a domain-specific text. Lowering the case of the text is optional and recommended when working with a small corpus.
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
text = ' '.join(data.data).lower()
text[100:350]
Sentence Tokenize¶
The skip grams will work better if they are created from sentenced text. nltk.sent_tokenize
will break a string to a list of sentences.
sentences_text = nltk.sent_tokenize(text)
len(sentences_text)
Word Tokenize¶
Next, break all sentences to tokens (words) with nltk.word_tokenize
.
sentences = [nltk.word_tokenize(s) for s in sentences_text]
print(sentences[10])
Vocabulary (unique words)¶
In this example, we filter words who are used less than 5 times in the text, stop words and punctuations.
from collections import Counter
from string import punctuation
from nltk.corpus import stopwords
min_count = 5
puncs = set(punctuation)
stops = set(stopwords.words('english'))
flat_words = []
for sentence in sentences:
flat_words += sentence
counts = Counter(list(flat_words))
counts = pd.DataFrame(counts.most_common())
counts.columns = ['word', 'count']
counts = counts[counts['count'] >= min_count]
counts = counts[~counts['word'].isin(puncs)]
counts = counts[~counts['word'].isin(stops)]
vocab = pd.Series(range(len(counts)), index=counts['word']).sort_index()
print('The vocabulary has:', len(vocab), 'words')
Filter tokens not in vocabulary¶
Some words were excluded from the vocabulary because they are very rare or too common to present value. We have to remove them from our sentences.
filtered_sentences = []
for sentence in sentences:
sentence = [word for word in sentence if word in vocab.index]
if len(sentence):
filtered_sentences.append(sentence)
sentences = filtered_sentences
Transform the words to integer indexes¶
for i, sentence in enumerate(sentences):
sentences[i] = [vocab.loc[word] for word in sentence]
Create Skip-Gram dataset¶
from nltk.util import skipgrams
window_size = 10
data = []
for sentance in sentences:
data += skipgrams(sentance, 2, window_size)
data = pd.DataFrame(data, columns=['x', 'y'])
data.head()
Train and Validation Split¶
validation_size = 5000
data_valid = data.iloc[-validation_size:]
data_train = data.iloc[:-validation_size]
print('Train size:', len(data_train), 'Validation size:', len(data_valid))
Model Hyperparameters¶
learning_rate = .01
embed_size = 300
batch_size = 64
steps = 1000000
Model Inputs¶
inputs = tf.placeholder(tf.int32, [None])
targets = tf.placeholder(tf.int32, [None])
Embeddings Layer¶
This is the embeddings layer. Its a len(vocab) by embed_size matrix, initialized with random uniform distribution. The optimizer will change the similarity between it's rows to be higher on similar words.
embeddings = tf.Variable(tf.random_uniform((len(vocab), embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embeddings, inputs)
Linear layer¶
We use a linear layer with activation=None
. We don't need this layer after the training. Think of it as part of the loss function.
logits = tf.layers.dense(embed, len(vocab), activation=None,
kernel_initializer=tf.random_normal_initializer())
Loss & Optimization¶
There is a more optimized, noise-contrastive loss function for traning word embeddings: tf.nn.nce_loss
. I use tf.nn.softmax_cross_entropy_with_logits
for simplicity. For more information about the nce_loss look at the TensorFlow word2vec tutorial.
labels = tf.one_hot(targets, len(vocab))
loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)
loss = tf.reduce_mean(loss)
train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
Start Session¶
sess = tf.Session()
sess.run(tf.global_variables_initializer())
Training Loop¶
from sklearn.metrics.pairwise import cosine_similarity
def get_batches(x, y, batch_size, n=None):
if n:
# cheap way to add some randomization
rand_start = np.random.randint(0, len(x) - batch_size * n)
x = x[rand_start:]
y = y[rand_start:]
for start in range(len(x))[::batch_size][:n]:
end = start + batch_size
yield x[start:end], y[start:end]
step = 0
while step < steps:
start = time.time()
# shuffle train data once in while
if step % 100000 == 0:
data_train = data_train.sample(frac=1.)
# train part
train_loss = []
for x, y in get_batches(
data_train['x'].values, data_train['x'].values, batch_size, n=10000):
step += 1
_, batch_loss = sess.run([train_op, loss], {inputs: x, targets: y})
train_loss.append(batch_loss)
# validation prat (one batch of "validation_size")
feed_dict = {inputs: data_valid['x'].values, targets: data_valid['x'].values}
valid_loss, x_vectors = sess.run([loss, embed], feed_dict)
y_vectors = sess.run(embed, {inputs: data_valid['x'].values})
# outputs
print('Step:', step, 'TLoss:', np.mean(train_loss), 'VLoss:', np.mean(valid_loss),
'Similarity: %.3f' % cosine_similarity(x_vectors, y_vectors).mean(),
'Seconds %.1f' % (time.time() - start))
We have trained embeddings!¶
vectors = sess.run(embeddings)
vectors = pd.DataFrame(vectors, index=vocab.index)
Demonstrate similarity¶
from sklearn.metrics.pairwise import cosine_similarity
print('Similarity:')
print(' computer to mouse =', cosine_similarity(vectors.loc[['computer']], vectors.loc[['mouse']])[0][0])
print(' cat to mouse =', cosine_similarity(vectors.loc[['cat']], vectors.loc[['mouse']])[0][0])
print(' dog to mouse =', cosine_similarity(vectors.loc[['dog']], vectors.loc[['mouse']])[0][0])
Comments
Comments powered by Disqus