Methods based on Neural Netword Language Model or Word2Vec Pretrained both use co-occurance information as the signal of Self-Supervised Learning. Besides these ways, another way to estimate word vector is Matrix Decomposition. Firstly, doing statistic analysis for corpus, then get a matrix called Word-Context Matrix containing global statistic information. Secondly, using Singlular Value Decomposition(SVD) to decompose matrix, then get low dimension representation of words.
This method is called Global Vectors for Word Representation().
Pretrained Task
The core idea of GloVe is use word-contexr co-occurrence matrix to realize predicting. Formually, $M_{w, c} = \sum_{i}\frac{1}{d_i(w, c)}$, where $d_i(w, c)$ indicates the distance between word $w$ and context window $c$ when the $i_{th}$ co-occurrence occurred. After calculating M, then we can calculate word vector and context vector based on the formula $v_w^T v_c + b_w + b_c = logM_{w, c}$, where $v_w$ indicates word vector, $v_c$ indicates context vector, $b$ indicates corresponding bias. Solve this equation, we can get the vector representation of word and context.
classGloveDataset(Dataset): def__init__(self, corpus, vocab, context_size=2): self.cooccur_counts = defaultdict(float) self.bos = vocab[BOS_TOKEN] self.eos = vocab[EOS_TOKEN] for sentence in tqdm(corpus, desc="Dataset Construction"): sentence = [self.bos] + sentence + [self.eos] for i in range(1, len(sentence)-1): w = sentence[i] left_contexts = sentence[max(0, i-context_size): i] right_contexts = sentence[i+1: min(len(sentence), i+context_size)+1] # co-occurrence distance for k, c in enumerate(left_contexts[::-1]): self.cooccur_counts[(w, c)] += 1/(k+1) for k, c in enumerate(right_contexts): self.cooccur_counts[(w, c)] += 1/(k+1) self.data = [(w, c, count) for (w, c), count in self.cooccur_counts.items()] def__len__(self): return len(self.data) def__getitem__(self, i): return self.data[i] defcollate_fn(self, examples): words = torch.tensor([ex[0] for ex in examples]) contexts = torch.tensor([ex[1] for ex in examples]) counts = torch.tensor([ex[2] for ex in examples]) return (words, contexts, counts)