Theme NexT works best with JavaScript enabled
0%

word2vec词向量

^ _ ^

Motivation

For pre-trained based on neural model, it only considered history information and ignored future information. So someone proposed some more efficient pre-trained word vecotr model, including CBOW(Continuous Bag-Of-Words) and Skip-Gram. They are not language model strictly, only basing co-occurrence information realize learning of word vecotr.

CBOW

The basic core idea of CBOW: According fixed length of window $|C_t|$(e.g. 5) look up text $w_{t-2}w_{t-1}w_{t}w_{t+1}w_{t+2}$. Based on the context word $C_t = {w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}}$, predicting $w_t$. The difference between CBOW and neural language model is: CBOW model dont’t consider the order or position of context words.

(1) Input Layer: Suppose the length of fixed window is n, then the dimension of input is $(n-1) \times |V|$, where $V$ is the vocabluary and every column in the input matrix is a word represented by one-hot encoding.

(2) Embedding Layer: Transfer input space to word vector space through matrix $E \in R^{d \times |V|}$: $v_{w_i} = Ee_{w_i}$. Suppose $C_t = {w_{t-k}, \cdots, w_{t-1}, w_{t+1}, \cdots, w_{t+k}}$. Then we calculate $v_{C_t} = \frac{1}{C_t}\sum_{w \in C_t}v_w$.

(3) Output Layer: $E’ \in R^{|V| \times d}$ is the transport matrix from hidden layer to output layer. Suppose $v’{w_i}$ in $E’$ is the correspoding row vecotr of $w_i$. The probablity of output $w_t$ is $P(w_t|C_t) = \frac{exp(v{C_t} \cdot v’{w_t})}{\sum{w’ \in V}exp(v_{C_t} \cdots v’_{w^’})}$.

In the CBOW model, both $E$ and $E’$ can apply as word vecotr matrix. In some task, we also can combine both to get better performance.

Skip-Gram

CBOW method using all the words in the context window to predict target word. Skip-Gram model simplified this process: using every word in context window dependently predict the target word. So Skip-Gram model try to establish the co-occurence realtionship between one word to one word, specifically $P(w_t|w_{t+j})$, where $j \in {\pm 1,\cdots,\pm k}$. It also can be described as predict context by current word, as $P(w_{t+j}|w_t)$.

In the Input Layer, current time word one-hot encoding $w_t$ as input will map itself from input layer to Embedding Layer through matrix E. Formually, $v_{w_t} = E^T_{w_t}$.
Then we use linear transformation matrix $E’$ to predict context word in output layer: $P(c|w_t) = \frac{exp(v_{w_t} \cdot v’c)}{\sum{w’ \in V}exp(v’_{w’})}$.

Parameters Estimation

Both CBOW model and Skip-Gram model need to estimate paramters $\theta={E, E’}$. But they have different loss function:

  • For CBOW, $L(\theta) = -\sum_{t=1}^T logP(w_t|C_t)$
  • For Skip-Gram, $L(\theta) = -\sum_{t=1}^T \sum_{-k \leq j \leq k} logP(w_{t+j}|w_t)$

Negative Sampling

When the size of vocabulary is too big and the computing resources is limited, both CBOW model and Skip-Gram model will decrease computing efficiency because of the normalization process in the output layer. Negative Sampling provide a new task view: Given current word and its context, maxmize the probablity of co-occurence of them. After that, the problem can be simplified to another problem: binary classfication task according to $(w, c)$. The probability of $(w,c)$ has co-occurence relationship: $P(D=1|w,c) = \sigma(v_w \cdot v’_c)$. Otherwise, $P(D=0|w,c) = 1 - P(D=1|w,c) = \sigma(-v_w \cdot v’_c)$

In Skip-Gram model, $w=w_t, c=w_{t+j}$. If the pair $(w, c)$ satisfies co-occurence condition, then it is a positive sample. At the same time, we can do some negative samples based on this. We sampled some words which is not in the context word, marked as $\widetilde{w_{i}}$. The item $P(w_{t+j}|w_t)$ in loss function will be represented as $P(w_{t+j}|w_t) = log \sigma(v_{w_t} \cdot v’{t+j}) + \sum{i=1}^K log \sigma(- v_{w_t} \cdot v’_{\widetilde{w_i}})$. Usually, the negativate sampling can adapt a sort of distribution.

Code

CBOW

create dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
class CbowDataset(Dataset):
def __init__(self, corpus, vocab, context_size=2):
self.data = []
self.bos = vocab[BOS_TOKEN]
self.eos = vocab[EOS_TOKEN]
for sentence in tqdm(corpus, desc="Dataset Construction"):
sentence = [self.bos] + sentence + [self.eos]
if len(sentence) < context_size * 2 + 1:
continue
for i in range(context_size, len(sentence)-context_size):
context = sentence[i-context_size:i] + sentence[i+1:i+context_size+1]
target = sentence[i]
self.data.append((context, target))

create model

1
2
3
4
5
6
7
8
9
10
11
class CbowModel(nn.Module):
def __init__(self, vocab_size, embedding_size):
super(CbowModel, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.output = nn.Linear(embedding_dim, vocab_size, bias=False)
def forward(self, inputs):
embeds = self.embeddings(inputs)
hidden = embeds.mean(dim=1)
output = self.output(hidden)
log_probs = F.log_softmax(output, dim=1)
return log_probs

Skip-Gram

create database

1
2
3
4
5
6
7
8
9
10
11
12
13
class SkipGramDataset(Dataset):
def __init__(self, corpus, vocab, context_size=2):
self.data = []
self.bos = vocab[BOS_TOKEN]
self.eos = vocab[EOS_TOKEN]
for sentence in tqdm(corpus, desc="Dataset Construction"):
sentence = [self.bos] + sentence + [self.eos]
for i in range(1, len(sentence)-1):
w = sentence[i]
left_context_index = max(0, i-context_size)
right_context_index = min(len(sentence), i + context_size)
context = sentence[left_context_index: i] + sentence[i+1: right_context_index+1]
self.data.extend([(w, c) for c in context])

create model

1
2
3
4
5
6
7
8
9
10
class SkipGramModel(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(SkipGramModel, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.output = nn.Linear(embbeding_dim, vocab_size)
def forward(self, inputs):
embeds = self.embeddings(inputs)
output = self.output(embeds)
log_probs = F.log_softmax(output, dim=1)
return log_probs

Skip-Gram based on Negativate Sampling

create dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class SGNSDataset(Dataset):
'''
@param ns_dist:
'''
def __init__(self, corpus, vocab, context_size=2, n_negative=5, ns_dist=None):
self.data = []
self.bos = vocab[BOS_TOKEN]
self.eos = vocab[EOS_TOKEN]
self.pad = vocab[PAD_TOKEN]
for sentence in tqdm(corpus, desc="Dataset Construction"):
sentence = [self.bos] + sentence + [self.eos]
for i in range(1, len(sentence)-1):
w = sentence[i]
left_context_index = max(0, i-context_size)
right_context_index = min(len(sentence), i+context_size)
context = sentence[left_context_index:i] + sentence[i+1: right_context_index+1]
context += [self.pad] * (2 * context_size - len(context))
self.data.append((w, context))
self.n_negativate = n_negativate
self.ns_dist = ns_dist if ns_dist else torch.ones(len(vocab))
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(self, examples):
wordss = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
contexts = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
batch_size, context_size = contexts.shape
neg_contexts = []
for i in range(batch_size):
ns_dist = self.ns_dist.index_fill(0, contexts[i], .0)
neg_contexts.append(torch.multinomial(ns_dist, self.n_negativates * context_size, replacement=True))
neg_contexts = torch.stack(neg_contexts, dim=0)
return words, contexts, neg_contexts

create models
Maintain 2 embedding layers: w_embeddings for word representation, c_embeddings for context representation.

1
2
3
4
5
6
7
8
9
10
11
class SGNSModel(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super(SGNSModel, self).__init__()
self.w_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.c_embeddings = nn.Embedding(vocab_soze, embedding_dim)
def forward_w(self, words):
w_embeds = self.w_embeddings(words)
return w_embeds
def forward_c(self, contexts):
c_embeds = self.c_embeddings(contexts)
return c_embeds

counting unigram distribution

1
2
3
4
5
6
7
8
9
def get_unigram_distribution(corpus, vocab_size):
token_counts = torch.tensor([0] * vocab_size)
total_count = 0
for sentence in corpus:
total_count += len(sentence)
for token in sentence:
token_counts[token] += 1
unigram_dist = torch.div(token_counts.float(), total_counts)
return unigram_dist

training

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
embedding_dim = 128
context_size = 3
batch_size = 1024
n_negatives = 5
num_epoch = 10

corpus, vocab = load_reuters()
unigram_dist = get_unigram_distribution(corpus, len(vocab))
# Negative Sampling through probablity distribution: p(w)**0.75
negative_sampling_dist = unigram_dist ** 0.75
negative_sampling_dist /= negative_sampling_dist.sum()

dataset = SGNSDataset(
corpus,
vocab,
context_size=context_size,
n_negatives=n_negatives,
ns_dist=negative_sampling_dist
)
data_loader = get_loader(dataset, batch_size)

model = SGNSModel(len(vocab), embedding_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
for epoch in range(num_epochs):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
words, contexts, neg_contexts = [x.to(device) for x in batch]
optimizer.zero_grad()
batch_size = words.shape[0]
word_embeds = model.forward_w(words).unsqueeze(dim=2)
context_embeds = model.forward_c(words)
neg_context_embeds = model.forward_c(neg_contexts)
context_loss = F.logsigmoid(torch.bmm(context_embeds, word_embeds).squeeze(dim=2))