^ _ ^

Motivation

For pre-trained based on neural model, it only considered history information and ignored future information. So someone proposed some more efficient pre-trained word vecotr model, including CBOW(Continuous Bag-Of-Words) and Skip-Gram. They are not language model strictly, only basing co-occurrence information realize learning of word vecotr.

CBOW

The basic core idea of CBOW: According fixed length of window $|C_t|$(e.g. 5) look up text $w_{t-2}w_{t-1}w_{t}w_{t+1}w_{t+2}$. Based on the context word $C_t = {w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}}$, predicting $w_t$. The difference between CBOW and neural language model is: CBOW model dont’t consider the order or position of context words.

(1) Input Layer: Suppose the length of fixed window is n, then the dimension of input is $(n-1) \times |V|$, where $V$ is the vocabluary and every column in the input matrix is a word represented by one-hot encoding.

(2) Embedding Layer: Transfer input space to word vector space through matrix $E \in R^{d \times |V|}$: $v_{w_i} = Ee_{w_i}$. Suppose $C_t = {w_{t-k}, \cdots, w_{t-1}, w_{t+1}, \cdots, w_{t+k}}$. Then we calculate $v_{C_t} = \frac{1}{C_t}\sum_{w \in C_t}v_w$.

(3) Output Layer: $E’ \in R^{|V| \times d}$ is the transport matrix from hidden layer to output layer. Suppose $v’{w_i}$ in $E’$ is the correspoding row vecotr of $w_i$. The probablity of output $w_t$ is $P(w_t|C_t) = \frac{exp(v{C_t} \cdot v’{w_t})}{\sum{w’ \in V}exp(v_{C_t} \cdots v’_{w^’})}$.

In the CBOW model, both $E$ and $E’$ can apply as word vecotr matrix. In some task, we also can combine both to get better performance.

Skip-Gram

CBOW method using all the words in the context window to predict target word. Skip-Gram model simplified this process: using every word in context window dependently predict the target word. So Skip-Gram model try to establish the co-occurence realtionship between one word to one word, specifically $P(w_t|w_{t+j})$, where $j \in {\pm 1,\cdots,\pm k}$. It also can be described as predict context by current word, as $P(w_{t+j}|w_t)$.

In the Input Layer, current time word one-hot encoding $w_t$ as input will map itself from input layer to Embedding Layer through matrix E. Formually, $v_{w_t} = E^T_{w_t}$.
Then we use linear transformation matrix $E’$ to predict context word in output layer: $P(c|w_t) = \frac{exp(v_{w_t} \cdot v’c)}{\sum{w’ \in V}exp(v’_{w’})}$.

Parameters Estimation

Both CBOW model and Skip-Gram model need to estimate paramters $\theta={E, E’}$. But they have different loss function:

For CBOW, $L(\theta) = -\sum_{t=1}^T logP(w_t|C_t)$
For Skip-Gram, $L(\theta) = -\sum_{t=1}^T \sum_{-k \leq j \leq k} logP(w_{t+j}|w_t)$

Negative Sampling

When the size of vocabulary is too big and the computing resources is limited, both CBOW model and Skip-Gram model will decrease computing efficiency because of the normalization process in the output layer. Negative Sampling provide a new task view: Given current word and its context, maxmize the probablity of co-occurence of them. After that, the problem can be simplified to another problem: binary classfication task according to $(w, c)$. The probability of $(w,c)$ has co-occurence relationship: $P(D=1|w,c) = \sigma(v_w \cdot v’_c)$. Otherwise, $P(D=0|w,c) = 1 - P(D=1|w,c) = \sigma(-v_w \cdot v’_c)$

In Skip-Gram model, $w=w_t, c=w_{t+j}$. If the pair $(w, c)$ satisfies co-occurence condition, then it is a positive sample. At the same time, we can do some negative samples based on this. We sampled some words which is not in the context word, marked as $\widetilde{w_{i}}$. The item $P(w_{t+j}|w_t)$ in loss function will be represented as $P(w_{t+j}|w_t) = log \sigma(v_{w_t} \cdot v’{t+j}) + \sum{i=1}^K log \sigma(- v_{w_t} \cdot v’_{\widetilde{w_i}})$. Usually, the negativate sampling can adapt a sort of distribution.

Code

CBOW

create dataset

class CbowDataset(Dataset):
    def __init__(self, corpus, vocab, context_size=2):
        self.data = []
        self.bos = vocab[BOS_TOKEN]
        self.eos = vocab[EOS_TOKEN]
        for sentence in tqdm(corpus, desc="Dataset Construction"):
            sentence = [self.bos] + sentence + [self.eos]
            if len(sentence) < context_size * 2 + 1:
                continue
            for i in range(context_size, len(sentence)-context_size):
                context = sentence[i-context_size:i] + sentence[i+1:i+context_size+1]
                target = sentence[i]
                self.data.append((context, target))

create model

class CbowModel(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(CbowModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embedding_dim, vocab_size, bias=False)
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        hidden = embeds.mean(dim=1)
        output = self.output(hidden)
        log_probs = F.log_softmax(output, dim=1)
        return log_probs

Skip-Gram

create database

class SkipGramDataset(Dataset):
    def __init__(self, corpus, vocab, context_size=2):
        self.data = []
        self.bos = vocab[BOS_TOKEN]
        self.eos = vocab[EOS_TOKEN]
        for sentence in tqdm(corpus, desc="Dataset Construction"):
            sentence = [self.bos] + sentence + [self.eos]
            for i in range(1, len(sentence)-1):
                w = sentence[i]
                left_context_index = max(0, i-context_size)
                right_context_index = min(len(sentence), i + context_size)
                context = sentence[left_context_index: i] + sentence[i+1: right_context_index+1]
                self.data.extend([(w, c) for c in context])

create model

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output = nn.Linear(embbeding_dim, vocab_size)
    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        output = self.output(embeds)
        log_probs = F.log_softmax(output, dim=1)
        return log_probs

Skip-Gram based on Negativate Sampling

create dataset

class SGNSDataset(Dataset):
    '''
        @param ns_dist: 
    '''
    def __init__(self, corpus, vocab, context_size=2, n_negative=5, ns_dist=None):
        self.data = []
        self.bos = vocab[BOS_TOKEN]
        self.eos = vocab[EOS_TOKEN]
        self.pad = vocab[PAD_TOKEN]
        for sentence in tqdm(corpus, desc="Dataset Construction"):
            sentence = [self.bos] + sentence + [self.eos]
            for i in range(1, len(sentence)-1):
                w = sentence[i]
                left_context_index = max(0, i-context_size)
                right_context_index = min(len(sentence), i+context_size)
                context = sentence[left_context_index:i] + sentence[i+1: right_context_index+1]
                context += [self.pad] * (2 * context_size - len(context))
                self.data.append((w, context))
        self.n_negativate = n_negativate
        self.ns_dist = ns_dist if ns_dist else torch.ones(len(vocab))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, i):
        return self.data[i]
    def collate_fn(self, examples):
        wordss = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
        contexts = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
        batch_size, context_size = contexts.shape
        neg_contexts = []
        for i in range(batch_size):
            ns_dist = self.ns_dist.index_fill(0, contexts[i], .0)
            neg_contexts.append(torch.multinomial(ns_dist, self.n_negativates * context_size, replacement=True))
        neg_contexts = torch.stack(neg_contexts, dim=0)
        return words, contexts, neg_contexts

create models
Maintain 2 embedding layers: w_embeddings for word representation, c_embeddings for context representation.

class SGNSModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SGNSModel, self).__init__()
        self.w_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.c_embeddings = nn.Embedding(vocab_soze, embedding_dim)
    def forward_w(self, words):
        w_embeds = self.w_embeddings(words)
        return w_embeds
    def forward_c(self, contexts):
        c_embeds = self.c_embeddings(contexts)
        return c_embeds

counting unigram distribution

def get_unigram_distribution(corpus, vocab_size):
    token_counts = torch.tensor([0] * vocab_size)
    total_count = 0
    for sentence in corpus:
        total_count += len(sentence)
        for token in sentence:
            token_counts[token] += 1
    unigram_dist = torch.div(token_counts.float(), total_counts)
    return unigram_dist

training

embedding_dim = 128
context_size = 3
batch_size = 1024
n_negatives = 5
num_epoch = 10

corpus, vocab = load_reuters()
unigram_dist = get_unigram_distribution(corpus, len(vocab))
# Negative Sampling through probablity distribution: p(w)**0.75
negative_sampling_dist = unigram_dist ** 0.75
negative_sampling_dist /= negative_sampling_dist.sum()

dataset = SGNSDataset(
    corpus,
    vocab,
    context_size=context_size,
    n_negatives=n_negatives,
    ns_dist=negative_sampling_dist
)
data_loader = get_loader(dataset, batch_size)

model = SGNSModel(len(vocab), embedding_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
        words, contexts, neg_contexts = [x.to(device) for x in batch]
        optimizer.zero_grad()
        batch_size = words.shape[0]
        word_embeds = model.forward_w(words).unsqueeze(dim=2)
        context_embeds = model.forward_c(words)
        neg_context_embeds = model.forward_c(neg_contexts)
        context_loss = F.logsigmoid(torch.bmm(context_embeds, word_embeds).squeeze(dim=2))

摸鱼的Llunch

word2vec词向量

Motivation

CBOW

Skip-Gram

Parameters Estimation

Negative Sampling

Code

CBOW

Skip-Gram

Skip-Gram based on Negativate Sampling