Predicting NBA Veterans’ Value for the 2021–2022 Season
Veteran assets are incredibly important to help add pedigree to young rosters and turn teams into contenders. At the same time, veterans are vastly different based on the accolades they have acquired throughout their career and how much they impact winning. Thus, this project isn’t simply just predicting the values of veterans but also classifying veteran NBA players as high level bench players, starters, stars, or superstars. The model will apply a variety of player evaluation metrics per season dating back to 1996, when advanced stats started to really be used to evaluate players performance. The model will then predict player drop offs or improvements for the 2021–2022 season using a veteran value metric.
While statistics can definitely paint a story of how impactful a player is to his specific team, it is hard to evaluate all players on the same basis given the differences in roster construction. In addition, this model doesn’t take into account the new rule changes that hinder the free throw rates for certain players. However, this project can help to prove why a player predicted to improve significantly from the previous season could be a great pick up for a team this year or even next year in free agency. In addition, this model could help fantasy enthusiasts key in on sleeper pickups who may be projected to play well this season.
The majority of the data used to train the model comes from a Kaggle Data set of player stats dating back from the 1950–2017. While the data has a lot of basic counting stats, a majority of advanced stats are left off because most advanced stats weren’t calculated until the late 1990s. Along with player stats and advanced player stats, I also separately scraped Team Net Rating Data (TmNetRtg) and Roster Continuity. Both of these statistics are important inevaluating players because while this is an individual evaluation, basketball is a team sport and thus context needs to be added to player performance. The advanced statistics and team statistics were taken from basketball reference and the official NBA website. The completed csv data can be found in the github link provided under model_data.csv.
There was a lot of data cleaning involved. Since TmNetRtg data is only tracked from the 1996–1997 season onwards, I discarded the previous data before that season. I also had to apply constraints for injury riddled seasons, NBA lockout seasons, and obviously veteran players. I defined veteran players as players over the age of 30, with thresholds for minimum games and minutes played in a season. In order to match TmNetRtg with players I had to manipulate players traded in a certain season to not include the ‘TOT’ which adds the stats for the two teams a player played for in one season. Rather, I chose the team the player played the most games on in that season.
In the original dataset player portfolios encompassed many different counting stats including FT, FG%, AST, RBD, etc. The issue with counting stats is that they do not provide context to player performance. The best player on a below 0.500 team can put up crazy stats at low efficiencies which translates to less wins. For that reason, advanced stats are a better measurement of player impact as it looks at the rate of scoring, assisting, rebounding, rather than simply counting totals. After deciding the baseline features, I used a heatmap to evaluate correlation between the feature options. Highly correlated features can skew the performance of a model and thus we want a set of features that encompasses both offensive and defensive impact. Below is the original set of feature options before choices were made.
Highly correlated features are marked in red with values representing how correlated the features are. Features with a correlation above 0.85 are significant. DRB% (Defensive Rebound %) is highly correlated with TRB% so it makes more sense to include DRB% and ORB% separately and dismiss total stats as we want to evaluate offensive and defensive impact independantly. Since we have OWS and DWS it doesn’t make sense to include WS as that is just a net total of two other features, same goes for BPM. VORP is a highly offensive focused statistic on player creation and scoring per 100 possessions, which is why it is highly correlated with OWS. FG% is highly correlated with eFG% and since we have 2P% and 3P% we don’t need FG%.
The main concerns left are evaluating WS against OWS and WS/48. Win Shares represent how much a player impacts winning on a team based on a variety of individual stats as compared to team evaluation. While these features are highly correlated, Win Shares will be used in our veteran value metric and it is still important to have WS/48 and OWS as both will be factored into the model, so it doesn’t make sense to drop any of these features. The final issue is evaluating TS% (True Shooting %) and eFG% (Effective Field Goal %).
True shooting factors in free throw attempts and makes, so it is a more reliable stat in determining efficiency as a scorer. If you are mixing in a lot of free throw attempts that will boost your overall TS%. eFG% is mostly about how much you shoot and it applies a weight on threes. If a player is a high volume shooter who makes a lot of threes as much as two pointers, his eFG% will be high. For the purpose of evaluating veterans, high volume shooting isn’t as much of a concern as efficient scorers, because veterans typically receive less shot attempts than they would in their “prime” years. Thus, eFG% is dropped from the feature choices.
The purpose of the target veteran value metric is to evaluate how much a player will improve or decline from their previous season. Thus two different factors are involved in the veteran value. First is the change in a player’s win shares from the current season and the next season. The next factor is a change in the team net rating from the current season and the next season. While win shares take into account a player’s ability to impact winning, team net rating accounts for how good the team is in general. A piece-wise combination of these two metrics is used to add context and weigh improvement on good and bad teams respectively. The implementation of this metric is in the regression analysis notebook available in the GitHub link provided below.
A simple logistic regression model is performed on each of the features to see how much it impacts a players chance of declining in the following season. A decline is simply viewed as a veteran value score of below 0. Some of the key results are shown below. Both Offensive Win Shares and PER have a significant impact in predicting a decline in a veteran. This makes sense as win shares is a major part of the veteran value metric and PER encapsulates a lot of basic offensive tracking stats of a player which is highlighted as separate features of this model. A positive defensive box plus-minus translates to a lower probability of decline in a player, which is interesting because higher offensive performance is more likely to result in decline compared to defensive stats.
There are a few hypotheses worth taking a look at. The first is the distribution of the veteran value amongst older veterans aged 35 and older. It is important to see the distribution of these veteran players based on a difference in age and in turn mileage on the body. The data shows that older veterans are less likely to have steep declines or improvements as they are typically well past their prime and their roles are much more defined and limited. Very few teams are asking a lot out of a 35+ year old, unless your name is Lebron James or Chris Paul.
Another important factor to consider is how much a player’s role impacts their veteran value classification. A superstar is asked to do a lot more than a typical starter or bench player. Even a superstar player who is predicted to have a significant decline is miles better than a bench player expected to have a big improvement. Classifying players as such will help add context to different sets of players when it comes time to analyze the results of the model. The graph below shows that starters have a larger distribution of veteran values than bench players. Once again having a larger role as a starter creates larger discrepancies in performance between consecutive seasons.
Given the above results, there clearly is a need to classify players. A heuristic is applied to the players to classify them as a superstar (3), a star (2), a starter (1), and a bench player (0). The metrics utilized include Win Shares per 48 minutes, minutes played per game, and usage rate. Both superstars and stars should have a WS/48 of at least 0.15 and play at least 30 minute per game. These values were estimated based on historic player data that was measured and evaluated. The main difference between a superstar and a star is that the usage rate should be higher for a superstar compared to a star, but the output should be similar. A starter should have a lower amount of win shares while at least playing 25 minutes per game. The classification values were applied to each player and added as a feature called ‘Player Level’.
Before choosing a model it is important to apply some basic principles to improve model performance. Removing outliers is the first step to improving predictions. The lower and upper bound veteran value is set to (-30000,30000). About 44 data points were removed from the original total of 821 data points, resulting in a final set of 777 players to train the model on. Models typically perform better when classification models are used. Thus, the veteran value metric needs to be transformed into a different class to convert the regression to a classification problem. After testing different class sizes, I decided to use 5 classes split equally based on percentile thresholds. The 5 classifications range from values of [0,4], where 0 represents a significant drop off from the previous year, 1 is a smaller drop off, 2 is essentially the same level from previous years, 3 a slight improvement, and 4 is a significantly large improvement from last year.
Six models were tested to see which had the best performance. The models used are Logistic Regression Classifier, KNN Classifier, Random Forest Classifier, AdaBoost Classifier, Neural Network, and a Voting Classifier. Initially, cross-validation is used using 10 folds to see which model performs the best. The reason to apply a large number of folds on each model is to test if there is high variability in the performance and gather more metrics to evaluate the models better. Using 10 folds means that the data will be split into 10 parts and is tested 10 times with each part representing the test data set in each separate model. After performing cross validation on every model, the Random Forest Classifier performed the best. The models were then trained on the data set using a 90–10 split of training to validation. We found that this split gave the best performance in terms of area under the curve. The higher the AUC, the better the performance of the model is at distinguishing between positive and negative classes.
The voting classifier performed the best. The voting classifier trains on the other five models and predicts the veteran value class based on their highest probability of chosen class as the output. This model uses a soft voting which means the output class is the prediction based on the average of probability given to that class. This model is trained on the set of players ranging from the 1997 to 2020 NBA season and predicts the VV class for the 2021–2022 NBA veteran class.
Note: For more information on the implementation of this model as well as model heuristics calculated, visit the GitHub link below. Stay tuned for a follow up article analyzing some of the player predictions for this season. Check out my LinkedIn as well if you want to connect or collaborate on a project.