2023 Feb 15

Implementation of pointer networks

Encoder

There are multiple ways a encoder can be implemented in order to make it an experiment rather than using a LSTM or GRU i started with a Linear layer with hidden size as 32.

in the class of the Encoder implemented hidden vectors as parameters weight query, weight key and weight value.

Attention

3 linear layers were used here with a softmax. and multiplied with the original imput asd it will be the encoded input from the encoder.

Attention mechanisms are used in neural networks to focus on the most relevant parts of input data or sequence during the prediction or classification process. This can be useful in tasks such as machine translation, image captioning, and speech recognition where the length of the input sequence can vary and it may not be feasible to process the entire sequence at once.

The attention mechanism allows the model to selectively weight different parts of the input data, so that the model can attend to the most important parts of the input while ignoring the less important parts. This helps the model to make more accurate predictions by focusing on the most relevant information.

The equation for the attention mechanism varies depending on the specific implementation, but one common form of attention used in sequence-to-sequence models is the dot product attention. In dot product attention, the relevance of each part of the input sequence is determined by taking the dot product between the current decoder state and each of the encoder states.

Here is the general equation for dot product attention: Attention(Q,K,V)=softmax(QKTdk)V Attention(Q,K,V)=softmax(dk QKT)V

where:

$Q$ is the query vector, which represents the current state of the decoder.
$K$ and $V$ are the key and value vectors, which represent the encoder states and their corresponding values.
$\sqrt{d_k}$ is a scaling factor to prevent the dot product from becoming too large.
$\text{softmax}$ is used to normalize the dot product scores so that they sum to 1 and represent attention weights for each encoder state.

The output of the attention mechanism is a weighted sum of the value vectors, where the weights are the attention scores. This weighted sum is then combined with the decoder state to produce the final output of the model.

Basic pointer network I built

import torch
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.distributions import Categorical
from torch.nn.functional import softmax

import torch.autograd as autograd 
import random

USE_CUDA = torch.cuda.is_available()
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if USE_CUDA else autograd.Variable(*args, **kwargs)


class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.fc = nn.Linear(input_size, hidden_size)

    def forward(self, inputs):
        x = Variable(self.fc(inputs))
        # x = F.relu(x)
        return x

    def init_hidden(self):
        self.w_query = Parameter(torch.zeros(1), requires_grad=False)
        self.w_key = Parameter(torch.zeros(1), requires_grad=False)
        self.w_value = Parameter(torch.zeros(1), requires_grad=False)

        return self.w_query, self.w_key, self.w_value
class Attention(nn.Module):
    def __init__(self, emb_dim, model_dim):
        super(Attention, self).__init__()
        self.emb_dim = emb_dim
        self.model_dim = model_dim
        self.to_query = nn.Linear(emb_dim, model_dim, bias=False)
        self.to_key = nn.Linear(emb_dim, model_dim, bias=False)
        self.to_value = nn.Linear(emb_dim, model_dim, bias=False)

    def forward(self, inputs):
        q = Variable(self.to_query(inputs))
        k :torch.tensor = Variable(self.to_key(inputs))
        v = Variable(self.to_value(inputs))
        # Compute Attention scores
        attn_scores = torch.matmul(q, k.transpose(-1, -2))
        # Convert attention scores into probability distributions
        softmax_attn_scores = softmax(attn_scores, dim = -1)
        output = torch.matmul(softmax_attn_scores, v)
        return output


class Decoder(nn.Module):
    def __init__(self, input_size, hidden_size) -> None:
        super().__init__()
        self.fc = nn.Linear(input_size, hidden_size)

    def forward(self, inputs):
        x = Variable(self.fc(inputs))
        # x = F.relu(x)
        return x

class PointerNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.output_size = output_size
        self.encoder = Encoder(input_size, hidden_size)
        self.attention = Attention(hidden_size, hidden_size)
        self.decoder = Decoder(hidden_size, output_size)

    def forward(self, input):
        x = Variable(self.encoder(input))
        x = Variable(self.attention(x))
        x = Variable(self.decoder(x))

        return x 

    def act(self, state, epsilon):
        # if random.random() > epsilon:
        state   = Variable(torch.FloatTensor(state).unsqueeze(0), volatile=True)
        q_value = self.forward(state)
        q_value = softmax(q_value, dim = -1)
        m = Categorical(q_value.squeeze(0))

        action = m.sample()
        return action.item(), m.log_prob(action)
        # else:
            # action = random.randrange(self.output_size)
            # return action, epsilon