Linear Least Squares (Key)¶

In class we derived the formula for linear least squares of one variable. In this notebook you will learn a bit of the numerical library numpy, use numpy to compute linear regression, and then compute it yourself using formulas from class

Begin by running the cell below. Then go back and carefully read through all the code. There is a lot of new stuff here. Note how to create numpy arrays/matrices and how to compute a linear least squares regression.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Step 1: Prepare your data
# x: Independent variable (input)
# y: Dependent variable (output)
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Step 2: Perform linear regression using the least squares method

# Add a column of ones to the input data for the intercept (bias term)
X = np.vstack([x, np.ones(len(x))]).T

# Calculate the slope (m) and intercept (b)
a, b = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Slope (a): {a:.4f}")
print(f"Intercept (b): {b:.4f}")

# Step 3: Predict y values using the regression line
y_pred = a * x + b

# Optional: Plot the data and the regression line
plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, y_pred, color='red', label='Regression line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
Slope (a): 0.6000
Intercept (b): 2.2000
No description has been provided for this image

An aside about numpy matrices¶

What happened to x? Here's the original $x$, which is an array

In [2]:
x
Out[2]:
array([1, 2, 3, 4, 5])

We add a row of 1s after it and take the transpose to get the input matrix $X$

In [3]:
X
Out[3]:
array([[1., 1.],
       [2., 1.],
       [3., 1.],
       [4., 1.],
       [5., 1.]])

Breaking this down into pieces, first let's make a python list that contains $x$ and an array of ones

In [4]:
[x,np.ones(len(x))]
Out[4]:
[array([1, 2, 3, 4, 5]), array([1., 1., 1., 1., 1.])]

Now let's use numpy to make a vertical stack. The first element in the list becomes the first row

In [5]:
np.vstack([x, np.ones(len(x))])
Out[5]:
array([[1., 2., 3., 4., 5.],
       [1., 1., 1., 1., 1.]])

And now take the transpose

In [6]:
np.vstack([x, np.ones(len(x))]).T
Out[6]:
array([[1., 1.],
       [2., 1.],
       [3., 1.],
       [4., 1.],
       [5., 1.]])

Practice with matrices¶

Make a numpy matrix that is a row of 5 zeros followed by a row of 5 ones, then 5 zeros, then 5 ones again. Use built in functions and vstack (don't just type a bunch of 0 and 1 -- can you guess the name of a function that makes an array of zeros?)

In [7]:
np.vstack([np.zeros(5), np.ones(5), np.zeros(5), np.ones(5)])
Out[7]:
array([[0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1.]])

Now make a similar matrix that is a row of all 1s followed by all 2s in the second row, then 3s then 4s. Again use built in function np.ones. Name this matrix M. Hint: $[2,2,2,2,2] = 2\cdot[1,1,1,1,1]$

In [8]:
a = np.ones(5)
M=np.vstack([a, 2*a, 3*a, 4*a])

compute M times M transpose and M transpose times M ($MM^T$ and $M^TM$). In numpy $AB$ can be computed with A @ B for matrices A and B

In [9]:
M @ M.T
Out[9]:
array([[ 5., 10., 15., 20.],
       [10., 20., 30., 40.],
       [15., 30., 45., 60.],
       [20., 40., 60., 80.]])

Linear Least Squares Regression¶

You can create a vector of normally distributed samples with mean $\mu$ and standard deviation $\sigma$ by using the numpy function np.random.normal(mu, sigma, n). Try creating a vector with 10 random samples, with a mean of 100 and a standard deviation of 5.

In [10]:
np.random.normal(100,5,10)
Out[10]:
array([100.93573996, 100.20696146, 100.33860864, 101.54845134,
        97.07334677, 101.96880446, 103.85349105, 102.62606973,
       104.37269392,  95.25899349])

Now create some data for linear regression. Make a vector $x$ of ints over the range $[0,9]$ and let $y$ be a linear function of $x$, $y = 3x+2+\epsilon(x)$ where $\epsilon(x)$ is a random Gaussian noise function $\epsilon(x) \sim N(0,1)$. Make a scatter plot of $y$ vs. $x$ and label it

In [11]:
x = np.array(range(10))
y = 3*x+2 + np.random.normal(0,1,10)
plt.scatter(x,y);
plt.title("Linear function $y \\approx 3x+2$");
No description has been provided for this image

Compute the correct linear regression coefficients using numpy as above. Check they are resonable.

In [12]:
# Add a column of ones to the input data for the intercept (bias term)
X = np.vstack([x, np.ones(len(x))]).T

# Calculate the slope (m) and intercept (b)
a0, b0 = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"Slope (a): {a0}")
print(f"Intercept (b): {b0}")
Slope (a): 3.1815959167949623
Intercept (b): 1.6929138583475791

Now compute the regression coefficients using the formulas from class. Begin by defining some very helpful variables: Sx, Sy will be $\sum_i {x_i}$ and $\sum_i {y_i}$ respectively. Next Sxx and Syy are the sum of squares: $\sum_i {x_i}^2$ and $\sum_i {y_i}^2$. Finally the inner product Sxy = $\sum_i x_iy_i$. The quickest way to do this involves using comprehensions and the sum function, but you can use loops for now if you need to.

In [13]:
Sx = sum(x)
Sy = sum(y)
Sxx = sum([i**2 for i in x])
Syy = sum([i**2 for i in y])
Sxy = sum([i*j for i,j in zip(x,y)])
In [14]:
Sx, Sy, Sxx, Syy, Sxy
Out[14]:
(45, 160.1009548392492, 285, 3403.5519567262904, 982.9359599122058)

Finally determine $a,b$ as in class. Display the absolute errors between your calculations and the ones numpy returned. (They should be close to machine precision, which is $10^{-15}$ give or take.

In [15]:
n = len(x)
denom = n*Sxx - Sx*Sx
a = (n*Sxy - Sx*Sy) / denom
b = (Sxx*Sy - Sx*Sxy) / denom
In [16]:
a,b
Out[16]:
(3.1815959167949615, 1.6929138583475882)
In [17]:
a-a0, b-b0 
Out[17]:
(-8.881784197001252e-16, 9.103828801926284e-15)

Application¶

Write a function linear_least_squares(x,y) which takes input vectors x,y and returns a,b as above. (Did you know python can return two values? Here's an example)

In [18]:
def two_numbers():
    a = 1
    b = 10
    return a,b
In [19]:
A, B = two_numbers()
print(A,B)
1 10
In [20]:
def linear_least_squares(x,y):
    Sx = sum(x)
    Sy = sum(y)
    Sxx = sum([i**2 for i in x])
    Syy = sum([i**2 for i in y])
    Sxy = sum([i*j for i,j in zip(x,y)])
    n = len(x)
    denom = n*Sxx - Sx*Sx
    a = (n*Sxy - Sx*Sy) / denom
    b = (Sxx*Sy - Sx*Sxy) / denom
    return a,b

Application¶

Now, using $a=5, b =-15$, run linear least squares 100 times on 100 vector pairs $(x,y)$, where each of the 25 $x$ are the same but the $y$ each have different amounts of Gaussian noise. Plot the resulting best fit lines all on the same graph.

  • Use np.arange to make your input vector $x$ cover the domain $[-5,5]$ with a step-size of 0.01
  • Create arrays to store all the computed a and b values (you'll use this later)
  • If you call plt.plot() in a loop, it will keep adding to the same plot
  • Give your plot a title!
In [21]:
all_a = []
all_b = []
for i in range(100):
    x = np.arange(-5,5,0.01)
    y = 5*x - 15 + np.random.normal(0,10,len(x))
    a, b = linear_least_squares(x,y)
    y_pred = a*x + b
    all_a += [a]
    all_b += [b]
    plt.plot(x,y_pred)
No description has been provided for this image

Determine the average of the $a$s and $b$s returned above. Compare these to the true $a,b$. Explain your result.(There is an np.mean function)

In [22]:
np.mean(all_a), np.mean(all_b)
Out[22]:
(5.005095981078317, -15.002493519769928)
In [23]:
plt.hist(all_a);
No description has been provided for this image
In [ ]: