These notes are designed to accompany the Test_Scores.ipynb
Jupyter notebook. We will explore how to build and interpret a multiple linear regression model to predict student exam scores.
Simple linear regression models the relationship between a single independent variable (X) and a dependent variable (Y). However, in most real-world scenarios, an outcome is influenced by more than one factor.
Multiple Linear Regression is an extension of this concept. It allows us to model the linear relationship between a dependent variable and two or more independent variables.
The goal is to find an equation that best predicts the dependent variable (our target) as a linear combination of the independent variables (our features).
The general form of the equation is:
final_exam_score
).minutes_studying
, current_grade
, etc.).statsmodel
output, we run a model without an intercept for simplicity in interpretation.Before building a model, it's crucial to understand the relationships between our variables. The correlation matrix heatmap in the notebook is a powerful tool for this.
minutes_studying
has a strong positive correlation (0.8) with final_exam_score
. This is indicated by a light color.screen_time_minutes
has a negative correlation (-0.23) with final_exam_score
. This is indicated by a dark color.num_pets
has a very weak correlation (0.13) with the final exam score.The heatmap also helps us spot potential issues. Co-linearity occurs when two or more independent variables are highly correlated with each other (e.g., minutes_studying
and current_grade
). This can sometimes make it difficult for the model to determine the individual effect of each variable.
statsmodel
OLS Regression ResultsThe statsmodel
library provides a detailed summary of our regression model. Let's break down the key components from the notebook's output.
OLS Regression Results
================================================================================
Dep. Variable: final_exam_score R-squared (uncentered): 0.997
Model: OLS Adj. R-squared (uncentered): 0.997
...
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
student_id -0.0129 0.018 -0.720 0.474 -0.049 0.023
minutes_studying 0.4746 0.029 16.317 0.000 0.417 0.533
current_grade 0.6617 0.044 14.945 0.000 0.574 0.750
num_pets 0.2385 0.323 0.738 0.463 -0.405 0.882
screen_time_minutes -0.0057 0.003 -1.746 0.085 -0.012 0.001
==============================================================================
Omnibus: 0.974 Durbin-Watson: 2.299
...
Kurtosis: 3.305 Cond. No. 555.
==============================================================================
coef
(Coefficients): This is one of the most important parts. Each coefficient tells you how much the final_exam_score
is expected to change if that independent variable increases by one unit, while all other variables are held constant.
minutes_studying
(0.4746): For every additional minute a student studies, their final exam score is predicted to increase by about 0.47 points, assuming their current grade, number of pets, etc., stay the same.current_grade
(0.6617): For each point higher in their current grade, a student's final exam score is predicted to increase by 0.66 points, all else being equal.screen_time_minutes
(-0.0057): For every additional minute of screen time, the score is predicted to decrease by 0.0057 points.P>|t|
(p-value): This value helps determine the statistical significance of each variable. It tests the null hypothesis that the coefficient is 0 (i.e., the variable has no effect on the target).
final_exam_score
.minutes_studying
and current_grade
have p-values of 0.000
, making them highly significant.num_pets
has a p-value of 0.463
, which is high. This indicates that the number of pets is not a statistically significant predictor of the exam score in this model.Cond. No.
(Condition Number): This number helps diagnose multicollinearity. A high condition number (often cited as greater than 30) suggests that there may be strong correlations between your independent variables, which can make the coefficient estimates less reliable. Our value of 555.
is very high, indicating that multicollinearity is likely present in our model (as we suspected from the heatmap).