Linear Regression with Gradient Descent

We'll consider the data on the radial velocity of Galaxy NGC7531

http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/galaxy.info.txt
Radial Velocity of Galaxy NGC7531

SUMMARY:
       The galaxy data frame records the  radial  velocity  of  a
       spiral  galaxy  measured  at 323 points in the area of sky
       which it covers.  All the measurements  lie  within  seven
       slots  crossing at the origin.  The positions of the meas-
       urements given by four variables (columns).

DATA DESCRIPTION:
east.west:     the east-west coordinate.  The origin,  (0,0),  is
       near  the  center of the galaxy, east is negative, west is
       positive.
north.south:   the north-south coordinate.  The origin, (0,0), is
       near the center of the galaxy, south is negative, north is
       positive.
angle:    degrees of counter-clockwise rotation from the horizon-
       tal of the slot within which the observation lies.
radial.position:    signed  distance  from  origin;  negative  if
       east-west coordinate is negative.
velocity:   radial velocity measured in km/sec.

SOURCE:
       Buta, R. (1987)  The  Structure  and  Dynamics  of  Ringed
       Galaxies,  III:  Surface  Photometry and Kinematics of the
       Ringed Nonbarred Spiral NGC7531.    The  Astrophysical  J.
       Supplement Ser. Vol. 64, pp. 1--37.

         John M. Chambers and Trevor J. Hastie, (eds.)  Statistical
       Models in S, Wadsworth and Brooks, Pacific Grove, CA 1992,
       pg. 352.
In [1]:
import pandas as pd
from numpy import *
from numpy.linalg import norm

dat = pd.read_csv("galaxy.data")


x1 = dat.loc[:,"east.west"].as_matrix()
x2 = dat.loc[:, "north.south"].as_matrix()
y = dat.loc[:, "velocity"].as_matrix()

Now, let's write the cost function and the gradient of the cost function.

In [2]:
def f(x, y, theta):
    x = vstack( (ones((1, x.shape[1])), x))
    return sum( (y - dot(theta.T,x)) ** 2)

def df(x, y, theta):
    x = vstack( (ones((1, x.shape[1])), x))
    return -2*sum((y-dot(theta.T, x))*x, 1)
    

The gradient descent algorithm is the same:

In [3]:
def grad_descent(f, df, x, y, init_t, alpha):
    EPS = 1e-5   #EPS = 10**(-5)
    prev_t = init_t-10*EPS
    t = init_t.copy()
    max_iter = 30000
    iter  = 0
    while norm(t - prev_t) >  EPS and iter < max_iter:
        prev_t = t.copy()
        t -= alpha*df(x, y, t)
        if iter % 500 == 0:
            print "Iter", iter
            print "x = (%.2f, %.2f, %.2f), f(x) = %.2f" % (t[0], t[1], t[2], f(x, y, t)) 
            print "Gradient: ", df(x, y, t), "\n"
        iter += 1
    return t

Let's check the gradient to make sure it works:

In [4]:
x = vstack((x1, x2))
theta = array([-3, 2, 1])

h = 0.000001
print (f(x, y, theta+array([0, h, 0])) - f(x, y, theta-array([0, h, 0])))/(2*h)
print df(x, y, theta)
217161.297798
[-1030866.79673168   217161.22653259   -29883.8912257 ]
In [5]:
x = vstack((x1, x2))
theta0 = array([0., 0., 0.])
theta = grad_descent(f, df, x, y, theta0, 0.0000010)
Iter 0
x = (1.03, -0.05, 0.32), f(x) = 822017775.24
Gradient:  [-1028488.48191787    39503.94379375  -211469.28135073] 

Iter 500
x = (440.65, 0.38, -0.29), f(x) = 432219704.26
Gradient:  [-745195.41327602   -1246.48132255    2086.28965059] 

Iter 1000
x = (759.38, 0.92, -1.19), f(x) = 227355448.41
Gradient:  [-540285.76135469    -903.73088504    1512.61343293] 

Iter 1500
x = (990.47, 1.30, -1.83), f(x) = 119666065.36
Gradient:  [-391721.01534996    -655.22803896    1096.68348152] 

Iter 2000
x = (1158.01, 1.58, -2.30), f(x) = 63057833.85
Gradient:  [-284007.76929983    -475.05711064     795.12361351] 

Iter 2500
x = (1279.49, 1.79, -2.64), f(x) = 33301029.76
Gradient:  [-205912.90704841    -344.42857288     576.48498534] 

Iter 3000
x = (1367.56, 1.93, -2.89), f(x) = 17659004.46
Gradient:  [-149292.13166828    -249.7195372      417.96637991] 

Iter 3500
x = (1431.41, 2.04, -3.07), f(x) = 9436583.97
Gradient:  [-108240.61928678    -181.05306055     303.03633083] 

Iter 4000
x = (1477.71, 2.12, -3.20), f(x) = 5114368.82
Gradient:  [-78477.22135563   -131.26810622    219.70910154] 

Iter 4500
x = (1511.28, 2.17, -3.29), f(x) = 2842343.96
Gradient:  [-56897.99552406    -95.17273919    159.29472604] 

Iter 5000
x = (1535.61, 2.21, -3.36), f(x) = 1648026.51
Gradient:  [-41252.50408631    -69.00267358    115.4927564 ] 

Iter 5500
x = (1553.26, 2.24, -3.41), f(x) = 1020219.04
Gradient:  [-29909.12206515    -50.02870571     83.73520651] 

Iter 6000
x = (1566.05, 2.26, -3.44), f(x) = 690204.43
Gradient:  [-21684.87956116    -36.27209303     60.71016944] 

Iter 6500
x = (1575.33, 2.28, -3.47), f(x) = 516728.24
Gradient:  [-15722.09309781    -26.29819648     44.01642781] 

Iter 7000
x = (1582.05, 2.29, -3.49), f(x) = 425538.37
Gradient:  [-11398.92018671    -19.06686603     31.91303756] 

Iter 7500
x = (1586.93, 2.30, -3.50), f(x) = 377603.31
Gradient:  [-8264.50909651   -13.82396623    23.13776962] 

Iter 8000
x = (1590.46, 2.31, -3.51), f(x) = 352405.67
Gradient:  [-5991.98077428   -10.02272959    16.7754756 ] 

Iter 8500
x = (1593.02, 2.31, -3.52), f(x) = 339160.22
Gradient:  [-4344.33953427    -7.26673567    12.16264948] 

Iter 9000
x = (1594.88, 2.31, -3.53), f(x) = 332197.59
Gradient:  [-3149.75743415    -5.26856948     8.8182324 ] 

Iter 9500
x = (1596.23, 2.32, -3.53), f(x) = 328537.60
Gradient:  [-2283.65481466    -3.81984782     6.39344435] 

Iter 10000
x = (1597.21, 2.32, -3.53), f(x) = 326613.68
Gradient:  [-1655.708232      -2.76948751     4.63541091] 

Iter 10500
x = (1597.91, 2.32, -3.53), f(x) = 325602.35
Gradient:  [-1200.43087594    -2.00794938     3.36079164] 

Iter 11000
x = (1598.43, 2.32, -3.54), f(x) = 325070.73
Gradient:  [-870.34313176   -1.45581473    2.43666002] 

Iter 11500
x = (1598.80, 2.32, -3.54), f(x) = 324791.28
Gradient:  [-631.02106267   -1.05550297    1.76664092] 

Iter 12000
x = (1599.07, 2.32, -3.54), f(x) = 324644.38
Gradient:  [-457.50643281   -0.76526668    1.28085992] 

Iter 12500
x = (1599.26, 2.32, -3.54), f(x) = 324567.16
Gradient:  [-331.70388192   -0.55483795    0.92865625] 

Iter 13000
x = (1599.41, 2.32, -3.54), f(x) = 324526.57
Gradient:  [-240.49381034   -0.40227172    0.67329956] 

Iter 13500
x = (1599.51, 2.32, -3.54), f(x) = 324505.23
Gradient:  [-174.36417228   -0.2916573     0.48815943] 

Iter 14000
x = (1599.58, 2.32, -3.54), f(x) = 324494.02
Gradient:  [-126.4184909    -0.21145901    0.35392809] 

Iter 14500
x = (1599.64, 2.32, -3.54), f(x) = 324488.12
Gradient:  [-91.65664387  -0.1533132    0.25660693] 

Iter 15000
x = (1599.68, 2.32, -3.54), f(x) = 324485.02
Gradient:  [-66.45341441  -0.111156     0.1860466 ] 

Iter 15500
x = (1599.71, 2.32, -3.54), f(x) = 324483.39
Gradient:  [-48.18042752  -0.08059095   0.13488855] 

Iter 16000
x = (1599.73, 2.32, -3.54), f(x) = 324482.54
Gradient:  [-34.932044    -0.0584305    0.09779765] 

Iter 16500
x = (1599.74, 2.32, -3.54), f(x) = 324482.09
Gradient:  [-25.32662661  -0.04236361   0.0709058 ] 

Iter 17000
x = (1599.75, 2.32, -3.54), f(x) = 324481.85
Gradient:  [-18.36245299  -0.0307147    0.05140852] 

Iter 17500
x = (1599.76, 2.32, -3.54), f(x) = 324481.72
Gradient:  [-13.31324873  -0.02226895   0.0372725 ] 

There is an exact solution:

In [7]:
x = vstack((x1, x2))
x = vstack( (ones((1, x.shape[1])), x))
dot(dot(linalg.inv(dot(x, x.T)),x), y)
Out[7]:
array([ 1599.7805884 ,     2.32128786,    -3.53935822])