How to convert word into vector with GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for word.In this tutorial we will see how to generate vector for a given word.

Prerequisites
  • Run below code snippets with Google Colab or Jupyter Notebook as we have to execute some Unix commands as well.

Download GloVe pre-trained vectors


!wget http://nlp.stanford.edu/data/glove.6B.zip


Unzip downloaded zip file


!unzip glove.6B.zip

You should see output similar to the below one


Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt 

View files after unzipping


!ls -lrt

You should see output similar to the below one


total 3039136
-rw-rw-r-- 1 root root  347116733 Aug  4  2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root  693432828 Aug  4  2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root  171350079 Aug  4  2014 glove.6B.50d.txt
-rw-rw-r-- 1 root root 1037962819 Aug 27  2014 glove.6B.300d.txt
-rw-r--r-- 1 root root  862182613 Oct 25  2015 glove.6B.zip

From the output we can see there are four text files glove.6B.100d.txt, glove.6B.200d.txt, glove.6B.50d.txt and glove.6B.300d.txt. These text files generate vectors of length 100, 200, 50 and 300 respectively , for this tutorial we will generate vectors of length 50, so we will use glove.6B.50d.txt .

Create a dictionary having word as key and vector as value for all of the entries in file glove.6B.50d.txt


import os
import numpy as np

# Create Empty dictionary
word2vector = {}

#Create a dictionary with word and corresponding vector
with open(os.path.join('./glove.6B.50d.txt')) as file:
  
  for line in file:
    list_of_values = line.split()    	
    word = list_of_values[0]
    vector_of_word = np.asarray(list_of_values[1:], dtype='float32')
    word2vector[word] = vector_of_word

msg = f"Total number of words and corresponding vectors in word2vectors are {len(word2vector)}"
print(msg)

# View first record in word2vector dictionary
for word, vector in word2vector.items():
  print(word)
  print(vector)
  print(vector.shape)
  break

You should see output similar to the below one


Total number of words and corresponding vectors in word2vectors are 400000
the
[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
(50,)

Generate word vectors for a sentence


#Sample sentence with four words
sample_sentence  = "convert word into vectors"

for i in sample_sentence.split():
  print(i, word2vector[i], word2vector[i].shape)

You should see output similar to the below one


convert [ 0.33661   -0.51168    0.87064   -0.95326    0.74852    0.12839
 -0.37988    0.10754   -0.36786    1.4141     0.62383    0.45762
  0.62611   -0.11105   -0.41305    0.67618    0.43104   -0.57291
  0.016154  -0.0049896  0.40332   -0.59646   -0.43036    0.20764
 -0.1147    -0.99394    0.68397   -1.089      0.51071   -0.37707
  2.0347    -0.13211   -0.35318    0.01808    0.40005   -0.13595
 -0.058802   0.073057   0.12816    0.0078398  0.70848   -0.36644
  0.25745    0.75544   -0.037074   0.50653   -0.055351  -0.20353
 -0.37791   -0.67328  ] (50,)
word [-0.1643     0.15722   -0.55021   -0.3303     0.66463   -0.1152
 -0.2261    -0.23674   -0.86119    0.24319    0.074499   0.61081
  0.73683   -0.35224    0.61346    0.0050975 -0.62538   -0.0050458
  0.18392   -0.12214   -0.65973   -0.30673    0.35038    0.75805
  1.0183    -1.7424    -1.4277     0.38032    0.37713   -0.74941
  2.9401    -0.8097    -0.66901    0.23123   -0.073194  -0.13624
  0.24424   -1.0129    -0.24919   -0.06893    0.70231   -0.022177
 -0.64684    0.59599    0.027092   0.11203    0.61214    0.74339
  0.23572   -0.1369   ] (50,)
into [ 6.6749e-01 -4.1321e-01  6.5755e-02 -4.6653e-01  2.7619e-04  1.8348e-01
 -6.5269e-01  9.3383e-02 -8.6802e-03 -1.8874e-01 -6.3057e-03  4.4894e-02
 -6.6801e-01  4.8506e-01 -1.1850e-01  1.9968e-01  1.8180e-01  3.3144e-02
 -5.9108e-01 -2.1829e-01  4.1438e-01  5.6740e-02  4.2155e-01  2.7798e-01
 -1.1322e-01 -1.9227e+00  3.5513e-02  6.1928e-01  6.2206e-01 -6.3987e-01
  3.9115e+00 -2.1078e-02 -2.4685e-01 -1.3922e-01 -2.2545e-01  5.9131e-01
 -7.3220e-01  1.1620e-01  4.1550e-01 -1.5188e-01 -1.4933e-01  4.0739e-02
 -1.0415e-01  2.3733e-01 -4.3800e-01  6.0590e-02  5.5073e-01 -9.6571e-01
 -2.6875e-01 -1.1741e+00] (50,)
vectors [ 1.3247   -0.38281   0.27162   0.82353   1.7431    0.63094   1.9888
 -1.0854    0.97619  -0.79769   0.70562   0.28915  -0.44682   0.16009
 -0.25901  -0.35215   0.10791  -0.71015  -0.80975  -0.70704  -1.0186
 -1.619     0.93473   1.1258   -0.22782   0.71059   0.22179  -0.42324
  0.61644   0.30039   1.1298    0.075558  0.049487 -0.40429  -0.4642
 -0.41281   0.193     0.29502  -0.74731   1.3598    1.2449    0.30083
 -0.63276   1.5004   -0.30381   0.21208   1.1786   -0.036461 -0.3919
  0.71549 ] (50,)

Category: TensorFlow