Lecture Notes: Introduction to Multiple Linear Regression

These notes are designed to accompany the Test_Scores.ipynb Jupyter notebook. We will explore how to build and interpret a multiple linear regression model to predict student exam scores.

1. What is a Multiple Linear Regression Model?

Simple linear regression models the relationship between a single independent variable (X) and a dependent variable (Y). However, in most real-world scenarios, an outcome is influenced by more than one factor.

Multiple Linear Regression is an extension of this concept. It allows us to model the linear relationship between a dependent variable and two or more independent variables.

The goal is to find an equation that best predicts the dependent variable (our target) as a linear combination of the independent variables (our features).

The general form of the equation is:

$$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon $$

2. Identifying Relationships: The Correlation Heatmap

Before building a model, it's crucial to understand the relationships between our variables. The correlation matrix heatmap in the notebook is a powerful tool for this.

How to Read the Heatmap:

Co-linearity (or Multicollinearity):

The heatmap also helps us spot potential issues. Co-linearity occurs when two or more independent variables are highly correlated with each other (e.g., minutes_studying and current_grade). This can sometimes make it difficult for the model to determine the individual effect of each variable.

3. Interpreting the statsmodel OLS Regression Results

The statsmodel library provides a detailed summary of our regression model. Let's break down the key components from the notebook's output.


                         OLS Regression Results
================================================================================
Dep. Variable:     final_exam_score   R-squared (uncentered):           0.997
Model:                          OLS   Adj. R-squared (uncentered):      0.997
...
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
student_id             -0.0129      0.018     -0.720      0.474      -0.049       0.023
minutes_studying        0.4746      0.029     16.317      0.000       0.417       0.533
current_grade           0.6617      0.044     14.945      0.000       0.574       0.750
num_pets                0.2385      0.323      0.738      0.463      -0.405       0.882
screen_time_minutes    -0.0057      0.003     -1.746      0.085      -0.012       0.001
==============================================================================
Omnibus:                        0.974   Durbin-Watson:                   2.299
...
Kurtosis:                       3.305   Cond. No.                         555.
==============================================================================
            

Key Takeaways