Demo 1: Deep Learning for NLP - Word Embeddings

Christophe Servan


Tools

Data


Report

You should send me your report the day before the next session using PDF format at the address christophe_dot_servan_at_epita_dot_fr (replace _dot_ and _at_ by dots and at, respectively)

Remarks:


Work to do

      This TP is here to help you manipulate the word embeddings (WE)

    I. Train word embeddings

      The main idea is to train your first word embeddings model

      WARNING: Read carefully the documentation available here: https://github.com/facebookresearch/fastText

        Exercice 1: Train two models for each dataset, with a vector dimension of 50, 5 epochs of training and a window size of 5.

          Q1: Train a cbow model, explain your command line (1pt)

          Q2: Train a skip-gram model, explain your command line (1pt)

          Q3: Give example of the vector file, describe it (1pt)

    II. Loading word embeddings

      Loading the WE is not difficult in itself. To make computations fast, we store the whole set of embeddings in a numpy array of shape (num words, dimensions).
      We also build a dictionary (vocab) for mapping words to row index in this array, and another (rev vocab) for mapping indexes back to word forms.

              def load(filename):
                  vocab = {}
                  rev_vocab = []
                  lines = open(filename).readlines()
                  vectors = np.zeros((int(lines[0].split()[0]), int(lines[0].split()[1])))
                  for i, line in enumerate(lines):
                      tokens = line.strip().split()
                      if (i > 0) :
                          vocab[tokens[0]] = i-1
                          rev_vocab.append(tokens[0])
                          vectors[i-1] = [float(value) for value in tokens[1:]]
                  return vocab, rev_vocab, vectors
      
                  

        Exercice 1: Loading

          Q1: write in Python the necessary script to load the WE (1pt)

          Q2: what "vocab", "rev_vocab" and "vectors" stand for? (1pt)

        Exercice 2: Compute the cosine similarity for each model and each datasets

          Q1: Explain what is the cosine similarity. (1pt)

          Q2: How to compute it in python using numpy? and using scipy? (2pt)

          Q3: What is the cosine similarity between the vectors representing "dog" and "cat", and what about "dog" and "dentist"? (1pt)

          Q4: What is the closest word to "bank": "river" or "trade"? (1pt)

    III. Closest words

      Now you have your WE model, you will use a tric to compute all the closest words.
      This can be done once for all, by computing the dot product between he matrix containing all vectors and the transpose of the target word vector (np.dot(vectors, v.T)).
      Then we can use some numpy trick to recover the indices of the n highest scores.

              def closest(vectors, vector, n=10):
                  n=n+1
                  scores = np.dot(vectors, vector.T)
                  indices = np.argpartition(scores, -n)[-n:]
                  indices = indices[np.argsort(scores[indices])]
                  output = []
                  for i in [int(x) for x in indices]:
                      output.append((scores[i], i))
                  return reversed(output)
      

        Exercice 1: Code the function

          Q1: What pre-process to all the vectors do you NEED to do before using the dot product instead of the cosine? (3pt)

          Q2: What "argpartition" stand for? (1pt)

          Q3: Add Comments to the code. (1pt)

        Exercice 2: Analysis

          Q1: What are the closest words to "apple"? (1pt)

          Q2: What about other words (close to "apple")? (1pt)

          Q3: Can you find words which have a strange neighborhood? (1pt)

          Q4: Check your answers using scipy.spatial.distance.cosine regarding the scores obtained what can you conclude about the dot product?

    IV. Analogy

      Word analogies can be exposed by translating a word vector in a direction that corresponds to a linguistic or semantic relationship between two other words.
      So if we have w1 and w2 in relation R(w1,w2), we can compute the relation r = vec(w2)-vec(W1), and then apply this relation to the vector of another word vec(w3).
      The word closest to vec(w3)+r should also exhibit the same relation.
      Therefore, the idea is to use the closest function to find vectors similar to vec(w2)-vec(w1)+vec(w3).

        Exercice 1: Code the analogy function

          Q1: Can you use the previous code? How far? (1pt)

        Exercice 2: Solve the analogies

          Q1: "paris" is to "france" what "delhi" is to ... ? (1pt)

          Q2: "gates" - "microsoft" + "apple" = ... ? (1pt)

          Q3: "king" - "man" + "woman" = ... ? (1pt)

          Q4: "slow" - "slower" + "fast" = ... ? (1pt)

        Exercice 3: Bonus

          Q1: Augment the dimension of the WE to 300 and retrain them (1pt)

          Q2: What are the impact on the analogies? (1pt)

    V. Evaluation

      To evaluate the quality of a set of word embeddings, we will use data from the WS-3531 dataset (https://github.com/k-kawakami/embedding-evaluation).
      The file human_correlation.txt contains on each line a pair of words followed by the average of a similarity rating by human judges. The higher the rating, the more similar the judges though the words in the pair are.
      One way to evaluate the quality of word embeddings is to compute the correlation between the cosine similarity of the pairs and human ratings.
      Since cosine similarities and judgements are on a different scale, we will use a rank correlation (how similar two rankings are).
      To compute Spearman's ρ correlation, you can use scipy. The spearmanr function takes two lists of values as argument and returns two values. The first one is ρ, the second one is the significance of the estimation (pval).
      Here we are only interested in the first one.

      from scipy import stats
      serie1 = [1.3, 2.2, 3.1, 4.0]
      serie2 = [4.5, 1.8, 3.2, 2.7]
      print(stats.spearmanr(serie1, serie2)[0])
      

        Exercice 1: compute the ρ correlation between human judgements and the word embeddings you have created for different window length

          Q1: What is the best window length which correlate the best? (1pt)

          Q2: What is the best vector dimension which correlate the best? (1pt)



The End.