Sitemap

Decoding NBA Tactics: Using Shot Zone Data and Markov Chains for Game Simulation

8 min readJun 18, 2024

--

If you follow the NBA or any major professional sports, you may have heard the term “tracking data.” Tracking data captures positional information on players while also capturing other spatio-temporal metrics (speed, acceleration, etc.). This technology has revolutionized the world of sports, making game strategy and analysis much more data-driven.

The growth of tracking data has resulted in numerous metrics such as shot zone appetite and efficiency for both players and teams. In the pace-and-space era, shot efficiency and shot appetite have become major focuses for teams, as the game has become more analytic.

The Boston Celtics are a great example of this trend. Head coach Joe Mazzulla has a clear philosophy of using pace and three-point shooting to beat teams. He utilizes a 5-out offense, with his best lineup featuring four great ball handlers and shooters who can all drive and kick to create open threes. The statistics back up this strategy: they took the most above-the-break threes in the NBA this season (33 per game) and also had the best efficiency on corner threes (43% on 9.3 attempts). This offensive philosophy, paired with versatile defensive personnel, has allowed the Celtics to achieve an NBA-best 64 wins in the 2023–2024 regular season.

This brings up an important question: how dynamic is shot zone data in effectively simulating NBA games? The purpose of this project was to evaluate the utility of shot zone distributions for each team within the scope of Markov Chain models. While there are more robust and extensive sports betting models, this project serves as research into mapping the intricacies of shot distribution and transition probabilities between game states. The goal is to uncover a deeper understanding of the strategic decisions and predictive capabilities that shape the modern NBA game.

Model Definition

I decided to use a Markov Chain model to simulate offensive possessions between two teams. A Markov model consists of a finite set of states. In the context of simulating game scores, these states represent in-game events, with the absorbing states being events that lead to a change in possession. The states I defined are:

‘turnover’, ‘jump ball’, ‘timeout’, ‘start of period’, ‘substitution’, ‘end of period’, ‘3pt shot’, ‘2pt shot’, ‘defensive rebound’, ‘offensive rebound’, ‘Non-shooting foul’, ‘2-pt shooting foul’, ‘3-pt shooting foul’, ‘1-pt shooting foul’

The absorbing states are defensive rebound and turnovers as these states determine the end of a possession.

Shot Zone Classification

With the states are defined, we need data. I sourced the full 2022–2023 season of play-by-play data from NBAStuffer. The key aspect of this play-by-play data is that it includes x, y coordinates of shots relative to a standard NBA court. This allows me to classify these shots into shot zones using a rule-based function. Here are the core rules I defined:

Some shots had missing x, y coordinates. For those rows, I utilized the latest OpenAI model, GPT-4o, passing in the play description and prompting it to classify the shots based on these descriptions. After successfully applying the shot classification, I scraped shot zone data for teams from the NBA Stats website and stored it in a config file. I did the same with possession per game information and free throw percentage for each team.Expected Possession Value

After merging the shot zone efficiency data with the classified shot zones, I filtered the dataset to include only the defined states mentioned earlier. It was time to create an Expected Possession Value (EPV) metric. The EPV metric quantifies the value of a possession in the simulation. It utilizes the frequency of shot types combined with the team’s shot zone efficiency. I also included the frequency of shooting fouls and free throw percentage to account for “and-ones” and shooting fouls. The metric is shown below:

If the state was not a 2-point shot or 3-point shot, the EPV was set to 0. After applying this user-defined function (UDF) to the data, our dataset was ready for the simulations.

Transition Matrices

Another core part of a Markov Chain model is the transition matrix. The transition matrix is a square matrix that describes the probabilities of moving from one state to another. Each element in the matrix, P(i,j), represents the probability of transitioning from state i to state j. There are two important facts about how these states:

  1. The probability of transitioning to the next state depends only on the current state, not on the sequence of states that preceded it. This is known as the Markov property.
  2. The sum of the probabilities of transitioning from a given state to all possible next states is 1.

For each NBA team, I filtered the dataset to include only regular season plays where the team had possession. I tracked the state and next state for each of these plays to construct the transition matrix. These matrices were stored in a hashmap and saved to a config file. Here is the transition matrix for the Golden State Warriors in the 2022–2023 NBA season:

Simulation

With the transition matrices and EPV metric calculated and stored, we can start simulating our games. A few configurations need to be made:

  1. Number of Possessions: I calculated the average number of possessions per game for the two teams and rounded it to the nearest even number to ensure each team has an equal number of possessions.
  2. Tip-off: The starting team is chosen randomly with a 50/50 chance.
  3. Initial State: The first state is always ‘start of period’.

The Monte Carlo simulation proceeds as follows:

  • The next state is chosen randomly using the probabilities from the transition matrix for the current state.
  • If the state is not an absorbing state, the possession continues.
  • If the state is a 2-point shot or 3-point shot, the shot zone distribution for the team in possession is used to randomly select a shot zone based on the probability distributions.
  • The EPV is then calculated as the average EPV for the team in the selected shot zone.
  • If the current state is not a shot, the EPV is calculated as the average EPV for the current state transitioning to the chosen next state.

Refer to the code below for a better understanding:

Model Validation

To test this model, I decided to run simulations on the 2023 playoff matchups to compare the simulations with the actual outcomes. The main objective was to accurately predict the winners of each round. Secondary objectives included accurately measuring the point differentials between teams and the over/under for scores.

I created live graphs that update after each simulation. The first graph shows the total scores after each simulation, and the second graph displays the current win percentage after N simulations. The video below demonstrates how the live graphs work.

As you can see, after N simulations (games) the model had LAL winning 55% of the games. In the western conference semis the Lakers won the series 4–2 againt the Warriors. I ran this same simulation with the Warriors and Kings over 100 games. Interestingly, the Kings were favored with a 61% win %. In the same way, Boston was favored against Miami slightly winning 56% of the games using a 100 game simulation.

Left: Warriors vs Kings, Right: Heat vs Celtics

Insights and Limitations

There are several interesting points to take away from this model. In both scenarios, the less favored team won the matchup. During the regular season, both Sacramento and Boston were considered much better teams than their opponents. However, this model does not account for playoff coaching schematics and individual player reliability in the playoffs. It’s challenging to capture the individual impact that players like Jimmy Butler and Steph Curry have on their teams in the playoffs using a team-focused shot-zone based simulation. Additionally, the model does not consider injuries, which are a significant factor in the playoffs.

Another observation is that the game scores seem to slightly underestimate the actual game scores. The Heat averaged around 110 points per game in the series, while Boston averaged 105 points per game.

Despite these limitations, the model performed well in predicting the absolute average point differential. In the Warriors-Kings series, the simulation predicted a point differential of 12.27, while the actual series had a point differential of 10.71. Similarly, in the Celtics-Heat series, the simulation predicted a point differential of 12.5, and the actual point differential was 12.71. This consistency suggests that while the EPV metric may need adjustments — such as including luck-adjusted efficiency in a shortened series or incorporating individual usage percentages with player shooting metrics — it is effective across matchups and yields fairly accurate results in terms of predicting the margin of victory.

Areas for Improvement

  1. Incorporating Individual Player Impact: To better reflect playoff performance, consider integrating player-specific metrics, particularly for high-impact players like Jimmy Butler and Steph Curry.
  2. Adjusting for Injuries: Factor in injury reports and their impact on team performance.
  3. Enhancing EPV Metric: Improve the EPV metric by including luck-adjusted efficiency and individual player usage percentages.
  4. Defensive Metrics: Incorporate team defensive impact in some form to enhance EPV metrics.

Final Thoughts

The goal of this project was to isolate shot appetite — where teams take their shots — and combine it with their efficiency to visualize how effective this is at simulating matchups. I am fairly pleased with how well it performs. It seems that offensive shot profiles are a significant indicator of team performance, highlighting how much teams define their offensive identity around analytics and coaching philosophy.

Tracking data has fundamentally changed the game. In the pace-and-space era, teams that can create open three-point shots while also applying pressure at the rim to draw fouls have a very solid chance of winning. This model underscores the importance of these strategies.

It would be interesting to compare this Markov model to a more comprehensive stats-based model that incorporates a wide range of player and team statistics. Sports betting models are fascinating, and while this is not the most robust model out there, there is much to learn from game simulation using shot zone data.

If you are interested in testing this model out yourself refer to the following:

App Website: https://nbashotzonemarkov.streamlit.app/

Github: https://github.com/GauravMohan1/nba_shot_zone_markov

If you are interested in collaborating or have any questions? Reach out to me on LinkedIn

LinkedIn: https://www.linkedin.com/in/gaurav-mohan/

--

--

Gaurav Mohan
Gaurav Mohan

Written by Gaurav Mohan

I enjoy exploring how data science and computer vision can influence sports strategy. I also enjoy exploring the use cases of Generative AI in full-stack apps.

Responses (1)