Testing the Prediction Ability of Composite Score

by Jon Nichols

Composite Score was never developed with the intention of being able to predict the future.  The goal was to create a stat that could accurately reflect why teams won basketball games and which players contributed the most to that winning.  CS does not take into account age, health, and other factors that vary from year to year.

Despite all that, I’ve always been curious to see how good of a job Composite Score would do at predicting a team’s success.  To test this, I did a bit of regression analysis (this article will have some statistical stuff in it, but I’ll try to explain things so that people can understand it whether or not they’re experienced with stats).

For this simple analysis, I collected three variables.  One variable was each team’s win percentage in 2006-07.  Another variable was each team’s win percentage in 2007-08.  The third and final variable was each 07-08 team’s weighted 06-07 Composite Score.

For the weighted 06-07 Composite Score, I looked at the percentage of minutes each player on the 07-08 teams played (this data is available at 82games.com) and multiplied that by their 06-07 Composite Score.  This simulated a situation where if you knew how the minutes of your favorite team were going to be distributed this season, and you knew all their Composite Scores from last season, you could predict their win total.

To sum it all up, I’m testing which is better at predicting a team’s success:
1. Their prior winning percentage
2. Their prior Composite Scores

(One note: Players that are on winning teams have higher Composite Scores, so the two factors above aren’t totally unrelated.)

06-07 Record vs. 07-08 Record

Residuals:
Min       1Q   Median       3Q      Max
-0.33934 -0.08645  0.02291  0.07975  0.43006

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1979     0.1097   1.804  0.08195 .
V3            0.6041     0.2123   2.845  0.00821 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1513 on 28 degrees of freedom
Multiple R-squared: 0.2242,     Adjusted R-squared: 0.1965
F-statistic: 8.094 on 1 and 28 DF,  p-value: 0.008212

As you can see from the chart, there is a pretty strong correlation between a team’s record last year and its record this year.  This is supported by the numbers above (which look like nonsense to many of you).  The p-value of .00821 reinforces that there is definitely a relationship between the two variables (lower p-values indicate a stronger relationship).  The R^2 value is .2242, which indicates the strength of the linear relationship (R^2 ranges from 0 to 1, with numbers closer to 1 indicating a stronger relationship).  More on this later.

06-07 Composite Score vs. 07-08 Record

Residuals:
Min        1Q    Median        3Q       Max
-0.361736 -0.097663  0.005405  0.100009  0.253191

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.414374   0.033511  12.365 7.32e-13 ***
V2          0.005569   0.001432   3.888 0.000566 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1384 on 28 degrees of freedom
Multiple R-squared: 0.3506,     Adjusted R-squared: 0.3274
F-statistic: 15.12 on 1 and 28 DF,  p-value: 0.0005664

The scatter plot above looks pretty similar to the previous one, so it’s hard to tell if Composite Scores are a better predictor by simply looking at the charts.  Instead, we’ll turn to the statistics.  As you recall, the p-value of the last regression was .00821.  With Composite Score, it is even lower, at .000566.  This is an argument in Composite Score’s favor.  In addition, if you look at the R^2 for Composite Score (.3506), it is higher than with the 06-07 record (.2242), another argument in CS’s favor.  Finally, if you look at the MSE (mean square error) of the record correlation (0.02288) compared to the MSE of the CS correlation (0.01915), it shows that there is more variability and unreliability when using the 06-07 record.

In other words, if we made two models for projecting a team’s winning percentage, with one based on last season’s record and the other based on last season’s weighted Composite Score, the one based on the records would be more in error more often.

Taking a step back, I think this is significant, but there are a few catches.  I think it’s significant because teams generally play pretty similarly from year to year, so the last season’s win percentage should generally be a good predictor of the next season’s win percentage.  However, using Composite Scores (and of course somehow knowing how many minutes each player would play) is an even more accurate way of predicting the team’s record.  If I knew nothing of basketball and was asked to predict a team’s record, I would prefer to know the Composite Scores of its players (and how many minutes they would play) over how that team did last season.

Now, for the catches.  As I mentioned before, Composite Score is slightly based on team success, so it does cheat a little bit.  In addition, using the Composite Score method takes into account free agent signings and trades (although it ignores rookies and injuries), which gives it the edge.  There are also some statistical limitations.  Correlation does not mean causation.  In this case, Composite Score and Win % are both reflecting something else: the talents of the players themselves.

Finally, this study is not comparing my rating system to any of the other great ones out there.  It’s simply showing that Composite Score is not a bunch of random numbers, and they do have some prediction value.