Lecture Notes: Introduction to Multiple Linear Regression

These notes are designed to accompany the Test_Scores.ipynb Jupyter notebook. We will explore how to build and interpret a multiple linear regression model to predict student exam scores.

1. What is a Multiple Linear Regression Model?

Simple linear regression models the relationship between a single independent variable (X) and a dependent variable (Y). However, in most real-world scenarios, an outcome is influenced by more than one factor.

Multiple Linear Regression is an extension of this concept. It allows us to model the linear relationship between a dependent variable and two or more independent variables.

The goal is to find an equation that best predicts the dependent variable (our target) as a linear combination of the independent variables (our features).

The general form of the equation is:

$$ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon $$

Y is the dependent variable (e.g., final_exam_score).
$X_1, X_2, ..., X_p$ are the independent variables (e.g., minutes_studying, current_grade, etc.).
$\beta_0$ is the intercept (the value of Y when all X's are 0). In our notebook's statsmodel output, we run a model without an intercept for simplicity in interpretation.
$\beta_1, \beta_2, ..., \beta_p$ are the coefficients for each independent variable.
$\epsilon$ is the error term, representing the variability in Y that cannot be explained by the model.

2. Identifying Relationships: The Correlation Heatmap

Before building a model, it's crucial to understand the relationships between our variables. The correlation matrix heatmap in the notebook is a powerful tool for this.

How to Read the Heatmap:

Scale: The color bar ranges from -1 to 1.
Positive Correlation (approaching 1): As one variable increases, the other tends to increase. In the notebook, minutes_studying has a strong positive correlation (0.8) with final_exam_score. This is indicated by a light color.
Negative Correlation (approaching -1): As one variable increases, the other tends to decrease. screen_time_minutes has a negative correlation (-0.23) with final_exam_score. This is indicated by a dark color.
No Correlation (approaching 0): There is no linear relationship between the variables. num_pets has a very weak correlation (0.13) with the final exam score.

Co-linearity (or Multicollinearity):

The heatmap also helps us spot potential issues. Co-linearity occurs when two or more independent variables are highly correlated with each other (e.g., minutes_studying and current_grade). This can sometimes make it difficult for the model to determine the individual effect of each variable.

3. Interpreting the `statsmodel` OLS Regression Results

The statsmodel library provides a detailed summary of our regression model. Let's break down the key components from the notebook's output.


                         OLS Regression Results
================================================================================
Dep. Variable:     final_exam_score   R-squared (uncentered):           0.997
Model:                          OLS   Adj. R-squared (uncentered):      0.997
...
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
student_id             -0.0129      0.018     -0.720      0.474      -0.049       0.023
minutes_studying        0.4746      0.029     16.317      0.000       0.417       0.533
current_grade           0.6617      0.044     14.945      0.000       0.574       0.750
num_pets                0.2385      0.323      0.738      0.463      -0.405       0.882
screen_time_minutes    -0.0057      0.003     -1.746      0.085      -0.012       0.001
==============================================================================
Omnibus:                        0.974   Durbin-Watson:                   2.299
...
Kurtosis:                       3.305   Cond. No.                         555.
==============================================================================

coef (Coefficients): This is one of the most important parts. Each coefficient tells you how much the final_exam_score is expected to change if that independent variable increases by one unit, while all other variables are held constant.
- minutes_studying (0.4746): For every additional minute a student studies, their final exam score is predicted to increase by about 0.47 points, assuming their current grade, number of pets, etc., stay the same.
- current_grade (0.6617): For each point higher in their current grade, a student's final exam score is predicted to increase by 0.66 points, all else being equal.
- screen_time_minutes (-0.0057): For every additional minute of screen time, the score is predicted to decrease by 0.0057 points.
P>|t| (p-value): This value helps determine the statistical significance of each variable. It tests the null hypothesis that the coefficient is 0 (i.e., the variable has no effect on the target).
- A low p-value (typically < 0.05) suggests that you can reject the null hypothesis. The variable is statistically significant and likely has a real effect on the final_exam_score.
- In our results, minutes_studying and current_grade have p-values of 0.000, making them highly significant.
- num_pets has a p-value of 0.463, which is high. This indicates that the number of pets is not a statistically significant predictor of the exam score in this model.
Cond. No. (Condition Number): This number helps diagnose multicollinearity. A high condition number (often cited as greater than 30) suggests that there may be strong correlations between your independent variables, which can make the coefficient estimates less reliable. Our value of 555. is very high, indicating that multicollinearity is likely present in our model (as we suspected from the heatmap).

Key Takeaways

Multiple linear regression helps us understand the combined effect of several features on a target variable.
Coefficients quantify the impact of each feature independently.
P-values tell us which features are statistically significant predictors.
Heatmaps and the Condition Number are essential tools for identifying multicollinearity, which can affect model stability.