2
\$\begingroup\$

I'm trying to implement Adagrad in Python. For learning purposes, I am using matrix factorisation as an example. I'd be using Autograd for computing the gradients.

My main question is if the implementation is fine.

Problem description

Given a matrix A (M x N) having some missing entries, decompose into W and H having sizes (M x k) and (k X N) respectively. Goal would to learn W and H using Adagrad. I'd be following this guide for the Autograd implementation.

NB: I very well know that ALS based implementation are well-suited. I'm using Adagrad only for learning purposes

Customary imports

import autograd.numpy as np
import pandas as pd

Creating the matrix to be decomposed

A = np.array([[3, 4, 5, 2],
                   [4, 4, 3, 3],
                   [5, 5, 4, 3]], dtype=np.float32).T

Masking one entry

A[0, 0] = np.NAN

Defining the cost function

def cost(W, H):
    pred = np.dot(W, H)
    mask = ~np.isnan(A)
    return np.sqrt(((pred - A)[mask].flatten() ** 2).mean(axis=None))

Decomposition params

rank = 2
learning_rate=0.01
n_steps = 10000

Gradient of cost wrt params W and H

from autograd import grad, multigrad
grad_cost= multigrad(cost, argnums=[0,1])

Main Adagrad routine (this needs to be checked)

shape = A.shape

# Initialising W and H
H =  np.abs(np.random.randn(rank, shape[1]))
W =  np.abs(np.random.randn(shape[0], rank))

# gt_w and gt_h contain accumulation of sum of gradients
gt_w = np.zeros_like(W)
gt_h = np.zeros_like(H)

# stability factor
eps = 1e-8
print "Iteration, Cost"
for i in range(n_steps):

    if i%1000==0:
        print "*"*20
        print i,",", cost(W, H)

    # computing grad. wrt W and H
    del_W, del_H = grad_cost(W, H)

    # Adding square of gradient
    gt_w+= np.square(del_W)
    gt_h+= np.square(del_H)

    # modified learning rate
    mod_learning_rate_W = np.divide(learning_rate, np.sqrt(gt_w+eps))
    mod_learning_rate_H = np.divide(learning_rate, np.sqrt(gt_h+eps))
    W =  W-del_W*mod_learning_rate_W
    H =  H-del_H*mod_learning_rate_H

While the problem converges and I get a reasonable solution, I was wondering if the implementation is correct. Specifically, if the understanding of sum of gradients and then computing the adaptive learning rate is correct or not?

\$\endgroup\$

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.