GloVe
is an unsupervised learning algorithm for obtaining vector representations for word.In this tutorial we will see
how to generate vector for a given word.
Google Colab
or Jupyter Notebook
as we have to execute some Unix
commands as well.Download GloVe
pre-trained vectors
!wget http://nlp.stanford.edu/data/glove.6B.zip
Unzip downloaded zip file
!unzip glove.6B.zip
You should see output similar to the below one
Archive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
View files after unzipping
!ls -lrt
You should see output similar to the below one
total 3039136
-rw-rw-r-- 1 root root 347116733 Aug 4 2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root 693432828 Aug 4 2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root 171350079 Aug 4 2014 glove.6B.50d.txt
-rw-rw-r-- 1 root root 1037962819 Aug 27 2014 glove.6B.300d.txt
-rw-r--r-- 1 root root 862182613 Oct 25 2015 glove.6B.zip
From the output we can see there are four text files glove.6B.100d.txt
, glove.6B.200d.txt
, glove.6B.50d.txt
and glove.6B.300d.txt
.
These text files generate vectors of length 100, 200, 50 and 300 respectively , for this tutorial we will generate vectors of length 50,
so we will use glove.6B.50d.txt
.
Create a dictionary having word as key and vector as value for all of the entries in file glove.6B.50d.txt
import os
import numpy as np
# Create Empty dictionary
word2vector = {}
#Create a dictionary with word and corresponding vector
with open(os.path.join('./glove.6B.50d.txt')) as file:
for line in file:
list_of_values = line.split()
word = list_of_values[0]
vector_of_word = np.asarray(list_of_values[1:], dtype='float32')
word2vector[word] = vector_of_word
msg = f"Total number of words and corresponding vectors in word2vectors are {len(word2vector)}"
print(msg)
# View first record in word2vector dictionary
for word, vector in word2vector.items():
print(word)
print(vector)
print(vector.shape)
break
You should see output similar to the below one
Total number of words and corresponding vectors in word2vectors are 400000
the
[ 4.1800e-01 2.4968e-01 -4.1242e-01 1.2170e-01 3.4527e-01 -4.4457e-02
-4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01 2.7843e-01 -1.4767e-01
-5.5677e-01 1.4658e-01 -9.5095e-03 1.1658e-02 1.0204e-01 -1.2792e-01
-8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
-1.9181e-01 -1.8823e+00 -7.6746e-01 9.9051e-02 -4.2125e-01 -1.9526e-01
4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01 5.9213e-04 7.4449e-03
1.7778e-01 -1.5897e-01 1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
-3.4758e-01 -4.5637e-02 -4.4251e-01 1.8785e-01 2.7849e-03 -1.8411e-01
-1.1514e-01 -7.8581e-01]
(50,)
Generate word vectors for a sentence
#Sample sentence with four words
sample_sentence = "convert word into vectors"
for i in sample_sentence.split():
print(i, word2vector[i], word2vector[i].shape)
You should see output similar to the below one
convert [ 0.33661 -0.51168 0.87064 -0.95326 0.74852 0.12839
-0.37988 0.10754 -0.36786 1.4141 0.62383 0.45762
0.62611 -0.11105 -0.41305 0.67618 0.43104 -0.57291
0.016154 -0.0049896 0.40332 -0.59646 -0.43036 0.20764
-0.1147 -0.99394 0.68397 -1.089 0.51071 -0.37707
2.0347 -0.13211 -0.35318 0.01808 0.40005 -0.13595
-0.058802 0.073057 0.12816 0.0078398 0.70848 -0.36644
0.25745 0.75544 -0.037074 0.50653 -0.055351 -0.20353
-0.37791 -0.67328 ] (50,)
word [-0.1643 0.15722 -0.55021 -0.3303 0.66463 -0.1152
-0.2261 -0.23674 -0.86119 0.24319 0.074499 0.61081
0.73683 -0.35224 0.61346 0.0050975 -0.62538 -0.0050458
0.18392 -0.12214 -0.65973 -0.30673 0.35038 0.75805
1.0183 -1.7424 -1.4277 0.38032 0.37713 -0.74941
2.9401 -0.8097 -0.66901 0.23123 -0.073194 -0.13624
0.24424 -1.0129 -0.24919 -0.06893 0.70231 -0.022177
-0.64684 0.59599 0.027092 0.11203 0.61214 0.74339
0.23572 -0.1369 ] (50,)
into [ 6.6749e-01 -4.1321e-01 6.5755e-02 -4.6653e-01 2.7619e-04 1.8348e-01
-6.5269e-01 9.3383e-02 -8.6802e-03 -1.8874e-01 -6.3057e-03 4.4894e-02
-6.6801e-01 4.8506e-01 -1.1850e-01 1.9968e-01 1.8180e-01 3.3144e-02
-5.9108e-01 -2.1829e-01 4.1438e-01 5.6740e-02 4.2155e-01 2.7798e-01
-1.1322e-01 -1.9227e+00 3.5513e-02 6.1928e-01 6.2206e-01 -6.3987e-01
3.9115e+00 -2.1078e-02 -2.4685e-01 -1.3922e-01 -2.2545e-01 5.9131e-01
-7.3220e-01 1.1620e-01 4.1550e-01 -1.5188e-01 -1.4933e-01 4.0739e-02
-1.0415e-01 2.3733e-01 -4.3800e-01 6.0590e-02 5.5073e-01 -9.6571e-01
-2.6875e-01 -1.1741e+00] (50,)
vectors [ 1.3247 -0.38281 0.27162 0.82353 1.7431 0.63094 1.9888
-1.0854 0.97619 -0.79769 0.70562 0.28915 -0.44682 0.16009
-0.25901 -0.35215 0.10791 -0.71015 -0.80975 -0.70704 -1.0186
-1.619 0.93473 1.1258 -0.22782 0.71059 0.22179 -0.42324
0.61644 0.30039 1.1298 0.075558 0.049487 -0.40429 -0.4642
-0.41281 0.193 0.29502 -0.74731 1.3598 1.2449 0.30083
-0.63276 1.5004 -0.30381 0.21208 1.1786 -0.036461 -0.3919
0.71549 ] (50,)
Category: TensorFlow