Brogramo
Guest
Guest

Simple linear regression: Predicting the total number of wins using average points scored

Introduction

This post will determine if a team’s average number of points in a regular season is a predictor variable for the total number of wins in a regular season using an aggregated sample from the FiveThirtyEight NBA Elo dataset for the years 1995 to 2015.

Frequently used variables

The total_wins variable represents the total number of wins by a team in a regular season, and the avg_pts variable measures the average points scored by a team in a regular season.

Figure 1

OLS regression results for total number of wins and average points scored

How a simple linear regression model predicts the response variable using the predictor variable

In general terms, a simple linear regression model predicts the value of a response variable using one predictor variable if there is a linear relationship between the response and predictor variables.

Coefficient of correlation

The strength of the relationship between a response variable and a predictor variable is a measurement referred to as the coefficient of correlation.

The coefficient of correlation is denoted as P for a population and R for a sample. According to ZyBooks (n.d), an R-value greater than 0 and less than 0.40 corresponds to a weak correlation, a value greater than 0.40 less than 0.8 corresponds to a moderate correlation, and a value greater than 0.80 and less than 1.0 corresponds to a strong correlation.

Additionally, if one variable increases and the other increases, the relationship is positive, and if one variable increases and the other variable decreases, the relationship is negative.

Constructing a simple regression model

The population simple linear regression model is Y=β0+β1X+ε, where ε is the regression error term, Y is the response variable, X is the predictor variable, β0 is the intercept, and β1 is the coefficient of correlation (slope).

Since the dataset at hand is a sample aggregated from the FiveThirtyEight NBA Elo dataset for the years 1995 to 2015, I will use the sample simple linear regression model, which is Y^i=b0+b1Xi. In the simple regression model, Y^i is the response variable, Xi is the predictor variable, b0 is the intercept, and b1 is the coefficient of correlation. The carrot symbol in Y^ denotes a sample parameter, and each ith parameter corresponds to a data point in the sample data.

I can begin using the model by inputting an X-value of average points scored, and the model will output the expected total number of wins in a regular season.

Y^ = β0+β1X+ε = 
Y^ = -85.5476 + 1.2849(X)
Total wins = -85.5476 + 1.2849(X)
Where β0 = intercept and β1 = coefficient of correlation from the OSL model in figure 1

However, the usefulness of the model has not yet been determined. An F-test and a T-test will confirm if the model is useful for predicting the total number of wins using the average points scored.

An F-test will check if at least one predictor variable is useful for predicting the total number of wins, even though the model only has one predictor variable.

A T-test will check if there is an association between the average points scored and the total number of wins.

Typically, an F-test is used in a multiple regression model (a model with more than one predictor variable) to check if at least one predictor variable is useful for predicting the response variable. Once it is understood that at least one predictor variable is useful for predicting the response variable, a T-test checks each predictor variable individually to determine if it has an association with the response variable.

Besides conducting an F-test and a T-test, I will interpret the accuracy of the model by discussing the correlation coefficient for the entire model referred to in figure 1 as R-Squared.

F-test

The null hypothesis for the F-test is that the model is not useful for predicting the total number of wins in a regular season because all slope parameters are equal to zero. Mathematically, the null hypothesis is H0:β1=0.

The alternative hypothesis is that if at least one predictor variable is useful in predicting the total number of wins in a regular season, then at least one of the slope parameters should be non-zero. Mathematically, the alternative hypothesis is Ha: At least one βi≠0 for i=1.

Since the p-value of 4.41e-243 from figure 1 is nearly zero, sufficient evidence exists to reject the null hypothesis in favor of the alternative hypothesis that at least one predictor variable is useful in predicting the total number of wins in a regular season.

Since a simple regression model only has one predictor variable, we know that the predictor variable being referenced in the F-test is average points scored. A T-test will explicitly determine if the average points scored variable is a good predictor for the total number of wins.

T-test

The null hypothesis for the T-test is that no association exists between the total number of wins and average points scored in a regular season. Mathematically, the null hypothesis is H0:β1=0. If β1=0, then the correlation of coefficient between the number of wins and average scored points is zero. In other words, a zero coefficient of correlation means the two variables are not dependent on each other.

Since the team’s coach assumes that the average score of points for a team can be used to predict the total number of wins in a regular season, the alternative hypothesis is that an association exists between the two variables. Mathematically, the alternative hypothesis is Ha:β1≠0 and is two-tailed.

Figure 1 shows that the two-tailed p-value for the t-statistic is 0.0000. According to ZyBooks (n.d.), reject the null hypothesis in favor of the alternative hypothesis if the p-value is less than the significance level. Alternatively, reject the alternative hypothesis in favor of the null hypothesis if the p-value is greater than the significance level.

Since the significance level was never specified, assume a 5% significance level. Even if we had chosen a 1% significance level, the p-value would still be statistically significant. Since the p-value of 0.0000 is less than the significance level of 0.05 (or 0.01 for that matter), the conclusion is to reject the null hypothesis in favor of the alternative hypothesis. The results of this test suggest that there is an association between the total number of wins and average points scored in a regular season.

Using the simple regression model to predict total number of wins

At this point in the analysis, the simple regression model has been defined and tested for predicting the total number of wins using average points scored.

I will use the model to predict the total number of wins using an average points score of 85, 95, and 100.

Y=-85.5476 + 1.2849(X)
Y=-85.5476 + 1.2849(95) = 37 wins
Y=-85.5476 + 1.2849(85) = 24 wins
Y=-85.5476 + 1.2849(110) = 56 wins

Figure 2

Scatterplot of average total number of wins vs average total points scored

Figure two shows the relationship between the total number of wins and the average points scored. I can use the scatterplot in figure 2 to check if my model produced reasonable results.

The R-squared value in figure 1 measures the accuracy of the model and since the variance for the total number of wins in figure 2 is spread out for each X-value, the R-squared value is low, indicating a less than perfect model. 

If the R-squared value is 1, the model would produce accurate results and all the total number of wins in figure 2 for each x-value would fit the regression line perfectly. However, if the R-squared value is zero, then the model would not be able to predict the total number of wins using the average points scored.

For an 85-average points score X-value, the model predicted 24 total wins and the scatterplot shows that the range of wins for 85 points is between 15 and 65 wins, corresponding to a 22.8% accuracy or R-squared value.

The difference between the actual value and the predicted value is known as the residual. A positive residual means the predicted value is higher than the actual population value and a negative residual means that the predicted value is less than the actual or expected value.

If we know the Y population parameter value for a given X-value, we can determine the regression residual between the actual population value and the predicted value by subtracting Y^ from Y.

How the correlation coefficient gets the strength and direction of the association between two variables

Figure 2 shows the relationship between the total number of wins and the average points scored for the NBA Wins dataset. It also shows that the Pearson coefficient of correlation for the total number of wins and average points scored is 0.4777.

According to MedCalc (2022), the Pearson correlation coefficient ranges between -1 and 1 and measures on average the correlation between two variables. There is a positive correlation when one variable increases as the other increases and a negative correlation when one variable increases as the other decreases.

A correlation of 1 or -1 indicates a perfect relationship between X and Y and shows that all the data points lie on the regression line. A correlation of zero implies that there is no relationship between X and Y. Since the correlation of 0.4777 is positive, it indicates that the average values of Y increase as X increases with a moderate strength since the value is greater than 0.40 and less than 0.80.

Since the p-value in figure two is 0.0, the Pearson coefficient of correlation is statistically significant.

Figure 2 also shows a linear relationship between the total number of wins and average points scored. Although the relationship is not perfect and any predictions made using this model might have high residual, average points scored seem like a good predictor for the total number of wins.

Conclusion

In this post I developed a simple regression model to predict the total number of wins in a regular season using the average points scored as the predictor variable from a data sample from the NBA Dataset.

The F-test and T-test showed that the average points scored can be used to predict the total number of wins in a regular season and the correlation coefficient for the model in figure one (R-squared) showed that the model has a 22.8% accuracy in predicting the true value of Y for a given X-value. The Pearson coefficient of correlation also showed statistical significance between the two variables.