How to convert word into vector with GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for word.In this tutorial we will see how to generate vector for a given word.

  • Run below code snippets with Google Colab or Jupyter Notebook as we have to execute some Unix commands as well.

Download GloVe pre-trained vectors


Unzip downloaded zip file


You should see output similar to the below one

  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt 

View files after unzipping

!ls -lrt

You should see output similar to the below one

total 3039136
-rw-rw-r-- 1 root root  347116733 Aug  4  2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root  693432828 Aug  4  2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root  171350079 Aug  4  2014 glove.6B.50d.txt
-rw-rw-r-- 1 root root 1037962819 Aug 27  2014 glove.6B.300d.txt
-rw-r--r-- 1 root root  862182613 Oct 25  2015

From the output we can see there are four text files glove.6B.100d.txt, glove.6B.200d.txt, glove.6B.50d.txt and glove.6B.300d.txt. These text files generate vectors of length 100, 200, 50 and 300 respectively , for this tutorial we will generate vectors of length 50, so we will use glove.6B.50d.txt .

Create a dictionary having word as key and vector as value for all of the entries in file glove.6B.50d.txt

import os
import numpy as np

# Create Empty dictionary
word2vector = {}

#Create a dictionary with word and corresponding vector
with open(os.path.join('./glove.6B.50d.txt')) as file:
  for line in file:
    list_of_values = line.split()    	
    word = list_of_values[0]
    vector_of_word = np.asarray(list_of_values[1:], dtype='float32')
    word2vector[word] = vector_of_word

msg = f"Total number of words and corresponding vectors in word2vectors are {len(word2vector)}"

# View first record in word2vector dictionary
for word, vector in word2vector.items():

You should see output similar to the below one

Total number of words and corresponding vectors in word2vectors are 400000
[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]

Generate word vectors for a sentence

#Sample sentence with four words
sample_sentence  = "convert word into vectors"

for i in sample_sentence.split():
  print(i, word2vector[i], word2vector[i].shape)

You should see output similar to the below one

convert [ 0.33661   -0.51168    0.87064   -0.95326    0.74852    0.12839
 -0.37988    0.10754   -0.36786    1.4141     0.62383    0.45762
  0.62611   -0.11105   -0.41305    0.67618    0.43104   -0.57291
  0.016154  -0.0049896  0.40332   -0.59646   -0.43036    0.20764
 -0.1147    -0.99394    0.68397   -1.089      0.51071   -0.37707
  2.0347    -0.13211   -0.35318    0.01808    0.40005   -0.13595
 -0.058802   0.073057   0.12816    0.0078398  0.70848   -0.36644
  0.25745    0.75544   -0.037074   0.50653   -0.055351  -0.20353
 -0.37791   -0.67328  ] (50,)
word [-0.1643     0.15722   -0.55021   -0.3303     0.66463   -0.1152
 -0.2261    -0.23674   -0.86119    0.24319    0.074499   0.61081
  0.73683   -0.35224    0.61346    0.0050975 -0.62538   -0.0050458
  0.18392   -0.12214   -0.65973   -0.30673    0.35038    0.75805
  1.0183    -1.7424    -1.4277     0.38032    0.37713   -0.74941
  2.9401    -0.8097    -0.66901    0.23123   -0.073194  -0.13624
  0.24424   -1.0129    -0.24919   -0.06893    0.70231   -0.022177
 -0.64684    0.59599    0.027092   0.11203    0.61214    0.74339
  0.23572   -0.1369   ] (50,)
into [ 6.6749e-01 -4.1321e-01  6.5755e-02 -4.6653e-01  2.7619e-04  1.8348e-01
 -6.5269e-01  9.3383e-02 -8.6802e-03 -1.8874e-01 -6.3057e-03  4.4894e-02
 -6.6801e-01  4.8506e-01 -1.1850e-01  1.9968e-01  1.8180e-01  3.3144e-02
 -5.9108e-01 -2.1829e-01  4.1438e-01  5.6740e-02  4.2155e-01  2.7798e-01
 -1.1322e-01 -1.9227e+00  3.5513e-02  6.1928e-01  6.2206e-01 -6.3987e-01
  3.9115e+00 -2.1078e-02 -2.4685e-01 -1.3922e-01 -2.2545e-01  5.9131e-01
 -7.3220e-01  1.1620e-01  4.1550e-01 -1.5188e-01 -1.4933e-01  4.0739e-02
 -1.0415e-01  2.3733e-01 -4.3800e-01  6.0590e-02  5.5073e-01 -9.6571e-01
 -2.6875e-01 -1.1741e+00] (50,)
vectors [ 1.3247   -0.38281   0.27162   0.82353   1.7431    0.63094   1.9888
 -1.0854    0.97619  -0.79769   0.70562   0.28915  -0.44682   0.16009
 -0.25901  -0.35215   0.10791  -0.71015  -0.80975  -0.70704  -1.0186
 -1.619     0.93473   1.1258   -0.22782   0.71059   0.22179  -0.42324
  0.61644   0.30039   1.1298    0.075558  0.049487 -0.40429  -0.4642
 -0.41281   0.193     0.29502  -0.74731   1.3598    1.2449    0.30083
 -0.63276   1.5004   -0.30381   0.21208   1.1786   -0.036461 -0.3919
  0.71549 ] (50,)

Category: TensorFlow

Latest Articles