sudo apt update && sudo apt install cmake gcc
bash FT_installation.bashWARNING: the tool will be installed by default into the "tools" directory located into your home directory.
source ~/.bashrc
fasttext --helpthe result should print the help of the fasttext command.
You should send me your report the day before the next session using PDF format at the address christophe_dot_servan_at_epita_dot_fr (replace _dot_ and _at_ by dots and at, respectively)
Remarks:
The main idea is to train your first word embeddings model
WARNING: Read carefully the documentation available here: https://github.com/facebookresearch/fastText
Exercice 1: Train two models for each dataset, with a vector dimension of 50, 5 epochs of training and a window size of 5.
Q1: Train a cbow model, explain your command line (1pt)
Q2: Train a skip-gram model, explain your command line (1pt)
Q3: Give example of the vector file, describe it (1pt)
Loading the WE is not difficult in itself.
To make computations fast, we store the whole set of embeddings in a numpy array of shape (num words, dimensions).
We also build a dictionary (vocab) for mapping words to row index in this array, and another (rev vocab) for mapping indexes back to word forms.
def load(filename): vocab = {} rev_vocab = [] lines = open(filename).readlines() vectors = np.zeros((int(lines[0].split()[0]), int(lines[0].split()[1]))) for i, line in enumerate(lines): tokens = line.strip().split() if (i > 0) : vocab[tokens[0]] = i-1 rev_vocab.append(tokens[0]) vectors[i-1] = [float(value) for value in tokens[1:]] return vocab, rev_vocab, vectors
Exercice 1: Loading
Q1: write in Python the necessary script to load the WE (1pt)
Q2: what "vocab", "rev_vocab" and "vectors" stand for? (1pt)
Exercice 2: Compute the cosine similarity for each model and each datasets
Q1: Explain what is the cosine similarity. (1pt)
Q2: How to compute it in python using numpy? and using scipy? (2pt)
Q3: What is the cosine similarity between the vectors representing "dog" and "cat", and what about "dog" and "dentist"? (1pt)
Q4: What is the closest word to "bank": "river" or "trade"? (1pt)
Now you have your WE model, you will use a tric to compute all the closest words.
This can be done once for all, by computing the dot product between he matrix containing all
vectors and the transpose of the target word vector (np.dot(vectors, v.T)).
Then we can use some numpy trick to recover the indices of the n highest scores.
def closest(vectors, vector, n=10): n=n+1 scores = np.dot(vectors, vector.T) indices = np.argpartition(scores, -n)[-n:] indices = indices[np.argsort(scores[indices])] output = [] for i in [int(x) for x in indices]: output.append((scores[i], i)) return reversed(output)
Exercice 1: Code the function
Q1: What pre-process to all the vectors do you NEED to do before using the dot product instead of the cosine? (3pt)
Q2: What "argpartition" stand for? (1pt)
Q3: Add Comments to the code. (1pt)
Exercice 2: Analysis
Q1: What are the closest words to "apple"? (1pt)
Q2: What about other words (close to "apple")? (1pt)
Q3: Can you find words which have a strange neighborhood? (1pt)
Q4: Check your answers using scipy.spatial.distance.cosine regarding the scores obtained what can you conclude about the dot product?
Word analogies can be exposed by translating a word vector in a direction that
corresponds to a linguistic or semantic relationship between two other words.
So if we have w1 and w2 in relation R(w1,w2), we can compute the relation
r = vec(w2)-vec(W1), and then apply this relation to the vector of another
word vec(w3).
The word closest to vec(w3)+r should also exhibit the same relation.
Therefore, the idea is to use the closest function to find vectors similar to vec(w2)-vec(w1)+vec(w3).
Exercice 1: Code the analogy function
Q1: Can you use the previous code? How far? (1pt)
Exercice 2: Solve the analogies
Q1: "paris" is to "france" what "delhi" is to ... ? (1pt)
Q2: "gates" - "microsoft" + "apple" = ... ? (1pt)
Q3: "king" - "man" + "woman" = ... ? (1pt)
Q4: "slow" - "slower" + "fast" = ... ? (1pt)
Exercice 3: Bonus
Q1: Augment the dimension of the WE to 300 and retrain them (1pt)
Q2: What are the impact on the analogies? (1pt)
To evaluate the quality of a set of word embeddings, we will use data from the WS-3531 dataset (https://github.com/k-kawakami/embedding-evaluation).
The file human_correlation.txt contains on each line a pair of words followed by the average of a similarity rating by human judges.
The higher the rating, the more similar the judges though the words in the pair are.
One way to evaluate the quality of word embeddings is to compute the correlation between the cosine similarity of the pairs and human ratings.
Since cosine similarities and judgements are on a different scale, we will use a rank correlation (how similar two rankings are).
To compute Spearman's ρ correlation, you can use scipy.
The spearmanr function takes two lists of values as argument and returns two values.
The first one is ρ, the second one is the significance of the estimation (pval).
Here we are only interested in the first one.
from scipy import stats serie1 = [1.3, 2.2, 3.1, 4.0] serie2 = [4.5, 1.8, 3.2, 2.7] print(stats.spearmanr(serie1, serie2)[0])
Exercice 1: compute the ρ correlation between human judgements and the word embeddings you have created for different window length
Q1: What is the best window length which correlate the best? (1pt)
Q2: What is the best vector dimension which correlate the best? (1pt)