Learning ObjectivesBy the end of this chapter, you will be able to:

  1. Build Monte Carlo simulations for football scenarios
  2. Simulate games and seasons with realistic uncertainty
  3. Calculate playoff probabilities using simulation
  4. Analyze strategic scenarios through simulation
  5. Quantify uncertainty through resampling methods
  6. Apply sensitivity analysis to test model assumptions
  7. Use bootstrapping to estimate confidence intervals
  8. Design and interpret what-if scenario analyses

Introduction

Football is inherently uncertain. Will the quarterback complete the pass? Will the running back break a tackle? Will the kicker make the field goal? Traditional analytics often provides point estimates—the average EPA of a play, the expected win probability—but these single numbers mask the underlying uncertainty that defines every moment of football.

Consider a team with a 65% playoff probability heading into Week 17. What does this number actually mean? It doesn't mean the team will make 65% of the playoffs—they'll either make it or they won't. Rather, it means that if we could replay the remaining games thousands of times under identical conditions, the team would make the playoffs in about 65% of those replays. This is the fundamental insight of Monte Carlo simulation: we can't predict the future with certainty, but we can quantify the range of possible futures and their relative likelihoods.

Monte Carlo simulation offers a powerful framework for embracing and quantifying this uncertainty. Instead of asking "what will happen?", we ask "what could happen?" By simulating thousands or millions of possible outcomes, we can:

  • Estimate playoff probabilities across uncertain futures, accounting for schedule strength and randomness
  • Evaluate strategic decisions under various scenarios, from fourth-down calls to draft pick trades
  • Build player projections that capture uncertainty, showing ranges rather than single numbers
  • Construct optimal rosters using portfolio theory, balancing upside and consistency
  • Stress-test assumptions through sensitivity analysis, identifying which inputs matter most
  • Generate confidence intervals for any metric, from EPA to win probability

The beauty of simulation is its flexibility. Unlike analytical methods that require mathematical tractability, simulation can handle any level of complexity. Want to model player injuries, weather conditions, and referee tendencies all at once? Simulation can do it. Need to account for coaching adjustments, player fatigue, and momentum shifts? Simulation handles it naturally. This makes it an indispensable tool in modern football analytics, complementing the regression models and machine learning approaches we've explored in previous chapters.

In this chapter, we'll start with the foundational principles of Monte Carlo simulation, then progress through increasingly sophisticated applications: from simple field goal models to complex season simulators, from strategic decision analysis to portfolio-based roster construction. Along the way, we'll explore practical considerations like choosing the right number of simulations, validating model assumptions, and communicating uncertainty to stakeholders who prefer certainty.

What is Monte Carlo Simulation?

Monte Carlo simulation is a computational technique that uses repeated random sampling to obtain numerical results. The name comes from the Monte Carlo Casino in Monaco, referencing the random nature of casino games. The method was formalized during the Manhattan Project in the 1940s when physicists Stanislaw Ulam and John von Neumann used it to model neutron diffusion in nuclear reactions. In football analytics, we use these methods to simulate games, seasons, and player outcomes by sampling from probability distributions. Rather than solving complex equations analytically, we let randomness do the work through brute-force repetition. This approach mirrors how we think about football itself: individual plays are unpredictable, but patterns emerge over large samples.

Monte Carlo Basics

The Core Principle

At its heart, Monte Carlo simulation is deceptively simple. Rather than trying to calculate probabilities through complex mathematics, we simulate random outcomes many times and count what happens. If we want to know the probability a team makes the playoffs, we simulate the rest of the season 10,000 times and count how many times they make it. If they made it 6,742 times out of 10,000, our estimate is 67.42%.

This approach works because of the Law of Large Numbers, a fundamental theorem in probability theory. As the number of simulations increases, the average of the simulated outcomes converges to the true expected value. The more simulations we run, the closer we get to the "true" probability—though of course, we never know that true value with certainty.

Monte Carlo simulation follows a simple four-step recipe:

  1. Define the uncertainty: Identify random variables and their distributions. What aspects of the problem are uncertain? For a field goal, this might be wind speed, kick angle, and kicker skill. For a game, it might be turnovers, third-down conversions, and big plays.

  2. Generate random samples: Draw values from these distributions. This is where the "Monte" in Monte Carlo comes in—we're essentially rolling dice repeatedly. Modern computers make this trivial with pseudo-random number generators.

  3. Calculate outcomes: Compute results for each sample. Given the random inputs, what happens? If the kicker's accuracy is 87% and we randomly generate a number, did they make the kick or miss it?

  4. Aggregate results: Summarize across all simulations. After running thousands of iterations, what patterns emerge? What's the average outcome? What's the distribution? What percentile thresholds matter?

The power of this approach is that step 3 can be arbitrarily complex. We can model intricate dependencies, conditional probabilities, and nonlinear relationships that would be mathematically intractable. The computer doesn't care—it just runs the simulation over and over.

Choosing the Number of Simulations

How many simulations do you need? This depends on your required precision and computational resources: - **1,000 simulations**: Quick exploratory analysis, ±3% margin of error for probabilities near 50% - **10,000 simulations**: Standard production analysis, ±1% margin of error - **100,000 simulations**: High-stakes decisions, ±0.3% margin of error - **1,000,000+ simulations**: Academic research or when tiny differences matter The margin of error decreases with the square root of sample size. Going from 1,000 to 10,000 simulations (10× increase) cuts the error by about √10 ≈ 3.16×. Diminishing returns set in quickly, so 10,000 is a sweet spot for most applications.

Random Number Generation

All simulation starts with random number generation. Modern computers use pseudo-random number generators (PRNGs) that produce sequences that appear random but are deterministic given a seed. This determinism is actually a feature, not a bug—it allows us to reproduce our results exactly by setting the same seed.

The most common distributions in football simulation are:

  • Uniform: All values in a range equally likely. Used for coin flips, random game outcomes, and generating other distributions.
  • Normal (Gaussian): Bell-curved distribution. Models many natural phenomena like play outcomes, scoring margins, and measurement errors.
  • Binomial: Success/failure trials. Perfect for field goals, conversion attempts, and win/loss outcomes.
  • Poisson: Count data with a given rate. Can model scoring events, turnovers, or sacks per game.

Let's visualize these fundamental distributions to build intuition:

#| label: setup-r
#| message: false
#| warning: false

library(tidyverse)
library(nflfastR)
library(nflplotR)
library(gt)
library(gridExtra)

# Set seed for reproducibility
# Using 2024 ensures we get the same "random" numbers every time
set.seed(2024)

# Generate random numbers from different distributions
# Uniform: flat distribution, all values equally likely
uniform_samples <- runif(1000, min = 0, max = 1)

# Normal: bell curve centered at 0 with spread of 1
normal_samples <- rnorm(1000, mean = 0, sd = 1)

# Poisson: count distribution, lambda=3 means avg of 3 events
poisson_samples <- rpois(1000, lambda = 3)

# Create comparison plot to visualize distribution shapes
tibble(
  uniform = uniform_samples,
  normal = normal_samples,
  poisson = poisson_samples
) %>%
  pivot_longer(everything(), names_to = "distribution", values_to = "value") %>%
  ggplot(aes(x = value, fill = distribution)) +
  geom_histogram(alpha = 0.7, bins = 30) +
  facet_wrap(~distribution, scales = "free") +
  labs(
    title = "Common Probability Distributions",
    subtitle = "1,000 random samples from each distribution",
    x = "Value",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")
#| label: setup-py
#| message: false
#| warning: false

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set seed for reproducibility
np.random.seed(2024)

# Generate random numbers from different distributions
uniform_samples = np.random.uniform(0, 1, 1000)
normal_samples = np.random.normal(0, 1, 1000)
poisson_samples = np.random.poisson(3, 1000)

# Create comparison plot
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].hist(uniform_samples, bins=30, alpha=0.7, color='#F8766D')
axes[0].set_title('Uniform Distribution')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Count')

axes[1].hist(normal_samples, bins=30, alpha=0.7, color='#00BA38')
axes[1].set_title('Normal Distribution')
axes[1].set_xlabel('Value')

axes[2].hist(poisson_samples, bins=range(0, 12), alpha=0.7, color='#619CFF')
axes[2].set_title('Poisson Distribution')
axes[2].set_xlabel('Value')

plt.tight_layout()
plt.show()
This code demonstrates the three fundamental probability distributions used in football simulation: **Uniform Distribution**: The flat distribution on the left shows that all values between 0 and 1 are equally likely. We use uniform random numbers as the foundation for generating other distributions and for simple binary outcomes (like coin flips). If we generate a uniform number between 0 and 1 and it's less than 0.85, we can simulate an 85% field goal success rate. **Normal Distribution**: The bell curve in the middle represents the normal (or Gaussian) distribution. Most values cluster around the mean (0 in this case), with increasingly rare values further away. This distribution naturally appears in football for play outcomes, point differentials, and player statistics due to the Central Limit Theorem. The standard deviation (1 here) controls the spread—larger values create wider, flatter curves. **Poisson Distribution**: The count-based distribution on the right models discrete events occurring at a fixed average rate. With lambda=3, we most often see 2-4 events, but 0 or 7+ are possible. Scoring drives, turnovers, and sacks per game often follow Poisson-like patterns. Unlike the normal distribution, Poisson is discrete (only whole numbers) and bounded at zero (can't have negative counts). The `set.seed()` function ensures reproducibility. Without it, you'd get different random numbers each time you run the code. With it, you get the same "random" sequence every time, making your analysis reproducible and debuggable.

These visualizations show how 1,000 random samples from each distribution create different patterns. Notice how the uniform distribution spreads evenly, the normal clusters around zero, and the Poisson produces discrete counts. Understanding these shapes helps you choose the right distribution for different football phenomena.

Why Reproducibility Matters

Setting a random seed (`set.seed(2024)` in R or `np.random.seed(2024)` in Python) is crucial for reproducible research. Without it, your simulations will produce slightly different results each time you run them, making it impossible to debug issues or replicate findings. Always set a seed at the start of your analysis. The specific number doesn't matter (2024 is fine, 42 is popular among programmers), but consistency does. This way, when you share your code with colleagues, they'll get exactly the same results you did.

A Simple Example: Field Goal Accuracy

Let's start with a simple Monte Carlo simulation to build intuition: estimating field goal percentage. Suppose we know a kicker has an 85% success rate. We could calculate that after 1,000 kicks, they'd expect to make 850 of them. But simulation lets us see the natural variation around that expectation.

In real games, a kicker might get "hot" or "cold" due to random chance alone. How much variation should we expect? Simulation answers this by playing out the randomness explicitly. Each simulated kick is a Bernoulli trial (success or failure), and we aggregate the results to see what percentage of kicks the kicker makes across our simulation.

This simple example illustrates all four steps of Monte Carlo simulation: we define uncertainty (85% success rate), generate random samples (1,000 kicks), calculate outcomes (make or miss for each), and aggregate results (overall percentage). Let's see it in action:

#| label: fg-simple-r
#| message: false
#| warning: false

# Simulate 1,000 field goal attempts
# Assume 85% success rate (typical for NFL kickers in moderate conditions)
n_kicks <- 1000
make_probability <- 0.85

# Simulate outcomes (1 = make, 0 = miss)
# rbinom generates random binomial trials: n_kicks trials, each with make_probability
kick_results <- rbinom(n_kicks, size = 1, prob = make_probability)

# Calculate make percentage from our simulation
make_pct <- mean(kick_results)

cat("Simulated make percentage:", round(make_pct * 100, 1), "%\n")
cat("True probability:", make_probability * 100, "%\n")
cat("Difference:", round(abs(make_pct - make_probability) * 100, 2), "%\n")
#| label: fg-simple-py
#| message: false
#| warning: false

# Simulate 1,000 field goal attempts
n_kicks = 1000
make_probability = 0.85

# Simulate outcomes using binomial distribution
# Each kick is an independent trial with 85% success rate
kick_results = np.random.binomial(1, make_probability, n_kicks)

# Calculate make percentage
make_pct = kick_results.mean()

print(f"Simulated make percentage: {make_pct*100:.1f}%")
print(f"True probability: {make_probability*100:.0f}%")
print(f"Difference: {abs(make_pct - make_probability)*100:.2f}%")
The `rbinom()` function in R (or `np.random.binomial()` in Python) is perfect for modeling binary outcomes like field goal attempts. The parameters are: - `n_kicks`: How many kicks to simulate - `size = 1`: Each trial produces either 1 success or 0 failures (size=1 means a single kick) - `prob = make_probability`: The probability of success on each trial The function returns a vector of 1s and 0s representing makes and misses. Taking the mean of this vector gives us the make percentage—since R and Python treat TRUE/1 as 1 and FALSE/0 as 0, the mean equals the proportion of successes. The small difference between our simulated percentage and the true 85% probability is **sampling error**. With only 1,000 kicks, we expect some random variation around the true value. If we ran this simulation again with a different seed, we'd get a different (but close) result. This is the inherent uncertainty in finite samples.

Notice that our simulated make percentage isn't exactly 85%—it might be 84.7% or 85.3% due to random chance. This sampling error decreases as we increase the number of simulations. With 1,000 kicks, we typically get within 1-2% of the true value. With 10,000 kicks, we'd get much closer. This relationship between sample size and accuracy is fundamental to understanding Monte Carlo methods.

Convergence with Sample Size

A key principle of Monte Carlo simulation is that accuracy improves with more samples. This isn't just theoretical—we can demonstrate it directly by running simulations with increasing sample sizes and watching our estimates converge to the true value.

The relationship follows a mathematical law: estimation error decreases proportional to 1/√n, where n is the sample size. Double your simulations, cut your error by about 1.4×. Quadruple them, cut it by 2×. This square root relationship means we hit diminishing returns quickly—going from 1,000 to 10,000 simulations helps a lot, but going from 100,000 to 1,000,000 helps much less.

Let's visualize this convergence to build intuition about how many simulations we need for reliable results:

#| label: fig-convergence-r
#| fig-cap: "Monte Carlo convergence with sample size"
#| fig-width: 10
#| fig-height: 6

# Run simulations with increasing sample sizes
# Start at 10, go up to 10,000, checking every 10 kicks
sample_sizes <- seq(10, 10000, by = 10)
true_probability <- 0.85

# For each sample size, simulate that many kicks and calculate make %
estimates <- map_dbl(sample_sizes, function(n) {
  mean(rbinom(n, size = 1, prob = true_probability))
})

# Plot convergence
tibble(
  n = sample_sizes,
  estimate = estimates,
  error = abs(estimates - true_probability)
) %>%
  ggplot(aes(x = n, y = estimate)) +
  geom_line(color = "#619CFF", linewidth = 1) +
  # Red dashed line shows true value
  geom_hline(yintercept = true_probability,
             linetype = "dashed", color = "red", linewidth = 1) +
  # Shaded band shows ±2% margin of error
  geom_ribbon(aes(ymin = true_probability - 0.02,
                  ymax = true_probability + 0.02),
              alpha = 0.2, fill = "red") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::percent, limits = c(0.75, 0.95)) +
  labs(
    title = "Monte Carlo Convergence to True Value",
    subtitle = "Field goal make probability (true value = 85%)",
    x = "Number of Simulations",
    y = "Estimated Make Probability",
    caption = "Red band shows ±2% margin | Estimates stabilize around 1,000 simulations"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-convergence-py
#| fig-cap: "Monte Carlo convergence with sample size - Python"
#| fig-width: 10
#| fig-height: 6

# Run simulations with increasing sample sizes
sample_sizes = range(10, 10001, 10)
true_probability = 0.85

# Simulate field goals for each sample size
estimates = [np.random.binomial(1, true_probability, n).mean()
             for n in sample_sizes]

# Plot convergence
plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, estimates, color='#619CFF', linewidth=2)
plt.axhline(y=true_probability, color='red', linestyle='--', linewidth=2,
            label='True Probability')
plt.fill_between(sample_sizes,
                 true_probability - 0.02,
                 true_probability + 0.02,
                 alpha=0.2, color='red')
plt.xlabel('Number of Simulations', fontsize=12)
plt.ylabel('Estimated Make Probability', fontsize=12)
plt.title('Monte Carlo Convergence to True Value\nField goal make probability (true value = 85%)',
          fontsize=14, fontweight='bold')
plt.text(0.98, 0.02, 'Red band shows ±2% margin | Estimates stabilize around 1,000 simulations',
         transform=plt.gca().transAxes, ha='right', fontsize=9)
plt.ylim(0.75, 0.95)
plt.legend()
plt.tight_layout()
plt.show()
This convergence plot reveals several important insights: **Early volatility**: With just 10-100 simulations, our estimates swing wildly, sometimes differing from the true value by 5-10%. You can see the jagged line bouncing around in the left portion of the plot. This is why you should never trust results from small simulations. **Stabilization**: Around 1,000 simulations, the line settles down and stays within the red ±2% band. The wild swings disappear, replaced by minor fluctuations. This is typically "good enough" for most football analytics work. **Diminishing returns**: After 1,000 simulations, additional samples help less and less. The line gets smoother but doesn't fundamentally improve. Going from 1,000 to 10,000 simulations provides some benefit, but not 10× the benefit—the square root relationship means you only cut error by about 3×. **Practical implications**: For production analytics, 10,000 simulations is a sweet spot—enough for stable results, not so many that computation becomes slow. For exploratory work, 1,000 may suffice. For high-stakes decisions (like draft pick trades worth millions), consider 100,000+.

The plot shows our estimate bouncing around when sample sizes are small, then settling down as we run more simulations. By 1,000 simulations, we're consistently within ±2% of the true value. By 10,000, we're even tighter. This visual proof demonstrates why we typically use 10,000 simulations for production analytics—it's where we get stable, trustworthy results without excessive computation time.

Game Simulation Models

Modeling Game Outcomes

Moving from simple field goal simulations to full game simulations requires modeling the complex interaction between two teams competing against each other. The simplest approach uses team strength ratings—single numbers representing each team's quality—combined with random variation to generate realistic game outcomes.

This approach builds on the ratings systems we explored in Chapter 33 (ELO and Ratings). The key insight is that game outcomes aren't deterministic—even when a strong team plays a weak team, the underdog sometimes wins due to turnovers, injuries, referee calls, and pure luck. We model this by adding random noise to the expected outcome.

The typical model has three components:

  1. Team ratings: Numbers representing offensive/defensive strength, typically calibrated so 0 = average team, positive = above average, negative = below average. A +7 rating means the team beats an average team by about 7 points on average.

  2. Home field advantage: The home team typically has a 2-3 point edge due to crowd noise, travel fatigue, and referee bias. We add this constant to the home team's expected performance.

  3. Random variation: Even if we expect a team to win by 10 points, the actual margin follows a distribution centered at 10 with some spread (standard deviation). Research shows NFL game margins have a standard deviation around 13-14 points, reflecting the high variance in game outcomes.

Let's implement this simple but effective model:

#| label: game-sim-basic-r
#| message: false
#| warning: false

# Function to simulate a single game between two teams
simulate_game <- function(team_a_rating, team_b_rating,
                         home_advantage = 2.5,
                         sd_points = 13.5) {

  # Expected margin (positive = team A wins)
  # This is the "most likely" outcome before adding randomness
  expected_margin <- team_a_rating - team_b_rating + home_advantage

  # Add random variation using normal distribution
  # Actual game might deviate from expectation due to luck, turnovers, etc.
  actual_margin <- rnorm(1, mean = expected_margin, sd = sd_points)

  # Determine winner (positive margin = team A wins)
  team_a_wins <- actual_margin > 0

  # Estimate scores assuming both teams score around 24 points on average
  # Then adjust based on margin (winner gets more, loser gets less)
  list(
    margin = actual_margin,
    team_a_wins = team_a_wins,
    team_a_score = round(24 + actual_margin / 2),
    team_b_score = round(24 - actual_margin / 2)
  )
}

# Simulate Chiefs vs Bills (both strong teams)
chiefs_rating <- 6.5  # Chiefs are 6.5 points above average team
bills_rating <- 5.8   # Bills are 5.8 points above average team

set.seed(2024)
game_result <- simulate_game(chiefs_rating, bills_rating, home_advantage = 2.5)

cat("Simulated Game Result:\n")
cat("Chiefs:", game_result$team_a_score, "\n")
cat("Bills:", game_result$team_b_score, "\n")
cat("Margin:", round(game_result$margin, 1), "\n")
cat("Winner:", ifelse(game_result$team_a_wins, "Chiefs", "Bills"), "\n")
#| label: game-sim-basic-py
#| message: false
#| warning: false

def simulate_game(team_a_rating, team_b_rating,
                  home_advantage=2.5, sd_points=13.5):
    """
    Simulate a single game between two teams.

    Parameters:
    - team_a_rating: Team A strength (points above average)
    - team_b_rating: Team B strength (points above average)
    - home_advantage: Points added to home team (default 2.5)
    - sd_points: Standard deviation of game outcomes (default 13.5)

    Returns:
    - Dictionary with margin, winner, and scores
    """

    # Expected margin (positive = team A wins)
    expected_margin = team_a_rating - team_b_rating + home_advantage

    # Add random variation
    actual_margin = np.random.normal(expected_margin, sd_points)

    # Determine winner
    team_a_wins = actual_margin > 0

    return {
        'margin': actual_margin,
        'team_a_wins': team_a_wins,
        'team_a_score': round(24 + actual_margin / 2),
        'team_b_score': round(24 - actual_margin / 2)
    }

# Simulate Chiefs vs Bills (both strong teams)
chiefs_rating = 6.5
bills_rating = 5.8

np.random.seed(2024)
game_result = simulate_game(chiefs_rating, bills_rating, home_advantage=2.5)

print("Simulated Game Result:")
print(f"Chiefs: {game_result['team_a_score']}")
print(f"Bills: {game_result['team_b_score']}")
print(f"Margin: {game_result['margin']:.1f}")
print(f"Winner: {'Chiefs' if game_result['team_a_wins'] else 'Bills'}")
The `simulate_game()` function encapsulates our game model: **Expected margin calculation**: The Chiefs (+6.5) vs Bills (+5.8) with home advantage (+2.5) gives an expected margin of 6.5 - 5.8 + 2.5 = 3.2 points in favor of the Chiefs. This is our best guess before randomness enters. **Random variation**: We add noise using `rnorm()` with standard deviation 13.5 points. This means about 68% of games will finish within ±13.5 points of expectation, and 95% within ±27 points. This high variance reflects NFL reality—upsets happen regularly. **Score estimation**: We assume both teams average 24 points, then adjust based on margin. If the Chiefs win by 10, we estimate Chiefs 29, Bills 19. This is simplified (real scoring distributions are more complex), but adequate for playoff probability simulations where we only care about wins/losses. **Why these parameters?**: Home advantage of 2.5 points and game standard deviation of 13.5 points come from historical NFL data analysis. These values have been stable over decades and work well for simulation purposes.

This single simulation shows one possible outcome: the Chiefs might win by 3 points, or the Bills might pull an upset. The beauty of Monte Carlo methods is that we don't stop at one simulation—we run thousands to see the full range of possibilities.

Oversimplification Alert

This model treats game outcomes as independent random events, ignoring many real factors: weather conditions, injuries, coaching adjustments, momentum, and referee tendencies. More sophisticated models can incorporate these factors, but at the cost of complexity. For playoff probability calculations, this simple model works surprisingly well. For in-game win probability or prop betting, you'd want more sophisticated approaches that model play-by-play dynamics. Always match your model complexity to your question's requirements.

Monte Carlo Game Probabilities

A single game simulation tells us one possible outcome, but it doesn't tell us probabilities. To estimate win probability, we simulate the game thousands of times and count how often each team wins. If the Chiefs win 6,850 out of 10,000 simulations, their estimated win probability is 68.5%.

This approach has several advantages over analytical methods. First, it handles asymmetry naturally—win probability doesn't have to be a smooth function of ratings. Second, it captures tail risk—we see not just the average outcome but the full distribution including blowouts and close games. Third, it's easily extended—want to account for weather? Just modify the simulation logic.

The key parameter is how many simulations to run. Too few and our estimates will be noisy; too many and we waste computation time. For game-level probabilities, 10,000 simulations provides estimates accurate to about ±1%, which is more than sufficient given the uncertainty in our team ratings themselves.

#| label: game-monte-carlo-r
#| message: false
#| warning: false

# Simulate 10,000 games between Chiefs and Bills
# This lets us estimate win probability with ~1% accuracy
n_sims <- 10000

# Run simulations and store all results
results <- map_dfr(1:n_sims, function(i) {
  # Simulate one game
  game <- simulate_game(chiefs_rating, bills_rating, home_advantage = 2.5)

  # Store result in a data frame row
  tibble(
    sim = i,
    margin = game$margin,
    chiefs_win = game$team_a_wins,
    chiefs_score = game$team_a_score,
    bills_score = game$team_b_score
  )
})

# Calculate win probability by counting wins
chiefs_win_prob <- mean(results$chiefs_win)

cat("Based on", scales::comma(n_sims), "simulations:\n")
cat("Chiefs win probability:", round(chiefs_win_prob * 100, 1), "%\n")
cat("Bills win probability:", round((1 - chiefs_win_prob) * 100, 1), "%\n")
cat("Average margin:", round(mean(results$margin), 1), "points\n")
cat("Margin std dev:", round(sd(results$margin), 1), "points\n")
#| label: game-monte-carlo-py
#| message: false
#| warning: false

# Simulate 10,000 games between Chiefs and Bills
n_sims = 10000

# Store results in a list, then convert to DataFrame
results = []
for i in range(n_sims):
    game = simulate_game(chiefs_rating, bills_rating, home_advantage=2.5)
    results.append({
        'sim': i,
        'margin': game['margin'],
        'chiefs_win': game['team_a_wins'],
        'chiefs_score': game['team_a_score'],
        'bills_score': game['team_b_score']
    })

results_df = pd.DataFrame(results)

# Calculate win probability
chiefs_win_prob = results_df['chiefs_win'].mean()

print(f"Based on {n_sims:,} simulations:")
print(f"Chiefs win probability: {chiefs_win_prob*100:.1f}%")
print(f"Bills win probability: {(1-chiefs_win_prob)*100:.1f}%")
print(f"Average margin: {results_df['margin'].mean():.1f} points")
print(f"Margin std dev: {results_df['margin'].std():.1f} points")
The Monte Carlo approach gives us rich information beyond just win probability: **Win probability**: About 68% for the Chiefs, 32% for the Bills. This reflects the Chiefs' slight rating advantage plus home field. Notice this isn't 50-50 despite both being strong teams—every advantage compounds. **Average margin**: Around 3 points in the Chiefs' favor, matching our expected margin from the model. The Monte Carlo estimate converges to this value over many simulations. **Standard deviation**: About 13.5 points, matching our model input. This confirms our simulation is working correctly—if this were wildly different, we'd know something was wrong with our code. **Full distribution**: Stored in the `results` data frame, we have all 10,000 game outcomes. We can analyze any aspect: probability of a blowout (>14 point margin), chance of overtime (margin < 3), distribution of total points, etc.

The results show the Chiefs win about 68% of simulations, giving them a roughly 2:1 edge despite playing another elite team. This demonstrates how small rating differences (0.7 points) plus home field advantage (2.5 points) compound into meaningful win probability gaps. The Bills still win about 1 in 3 games—which is why we play the games rather than just calculating probabilities.

Visualizing Game Outcomes

Numbers tell part of the story, but visualizations reveal patterns that summary statistics miss. By plotting the distribution of simulated game outcomes, we can see the full range of possibilities: not just that the Chiefs win 68% of games, but how they win and by how much.

The histogram of margins will show several interesting features. First, it should be roughly bell-shaped (normal), centered around our expected 3-point margin. Second, it will be bimodal or at least lumpy due to discrete scoring (touchdowns are worth 7 points). Third, we'll see significant overlap around zero—many games are close, representing the Bills' 32% win probability.

Color-coding the histogram by winner makes it easy to see the probability mass on each side. The red area (Chiefs wins) should be about twice the size of the blue area (Bills wins), visualizing the 68-32 split. Let's create this visualization:

#| label: fig-game-outcomes-r
#| fig-cap: "Distribution of simulated game outcomes"
#| fig-width: 10
#| fig-height: 6

results %>%
  ggplot(aes(x = margin)) +
  # Histogram colored by winner
  geom_histogram(aes(fill = chiefs_win),
                 bins = 50, alpha = 0.8, color = "white") +
  # Vertical line at even game (0 margin)
  geom_vline(xintercept = 0, linetype = "dashed",
             color = "black", linewidth = 1) +
  # Vertical line at average margin
  geom_vline(xintercept = mean(results$margin),
             linetype = "solid", color = "darkgreen", linewidth = 1) +
  # Team colors for Chiefs (red) and Bills (blue)
  scale_fill_manual(
    values = c("TRUE" = "#E31837", "FALSE" = "#00338D"),
    labels = c("Chiefs Win", "Bills Win")
  ) +
  labs(
    title = "Simulated Game Outcomes: Chiefs vs Bills",
    subtitle = paste0("Chiefs win ", round(chiefs_win_prob * 100, 1),
                     "% of simulations (n=", scales::comma(n_sims), ")"),
    x = "Margin of Victory (positive = Chiefs win)",
    y = "Count",
    fill = "Winner",
    caption = "Green line shows average margin | Black dashed line shows even game"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "top"
  )
#| label: fig-game-outcomes-py
#| fig-cap: "Distribution of simulated game outcomes - Python"
#| fig-width: 10
#| fig-height: 6

plt.figure(figsize=(10, 6))

# Separate data by winner for colored histogram
chiefs_wins = results_df[results_df['chiefs_win']]['margin']
bills_wins = results_df[~results_df['chiefs_win']]['margin']

# Plot histograms
plt.hist(chiefs_wins, bins=50, alpha=0.8, color='#E31837',
         label='Chiefs Win', edgecolor='white', linewidth=0.5)
plt.hist(bills_wins, bins=50, alpha=0.8, color='#00338D',
         label='Bills Win', edgecolor='white', linewidth=0.5)

# Add reference lines
plt.axvline(x=0, color='black', linestyle='--', linewidth=2,
            label='Even Game')
plt.axvline(x=results_df['margin'].mean(), color='darkgreen', linewidth=2,
            label='Average Margin')

plt.xlabel('Margin of Victory (positive = Chiefs win)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title(f"Simulated Game Outcomes: Chiefs vs Bills\nChiefs win {chiefs_win_prob*100:.1f}% of simulations (n={n_sims:,})",
          fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.tight_layout()
plt.show()
This distribution reveals several insights that summary statistics alone wouldn't show: **Significant overlap**: Despite the Chiefs being favored, there's substantial overlap around zero. Many simulations are close games, which could go either way. This reflects the high variance in NFL games—even significant favorites lose frequently. **Bell curve shape**: The overall distribution is approximately normal (bell-shaped), validating our modeling assumption. The peak is around +3 points (Chiefs favored), with probabilities declining smoothly as we move toward blowouts in either direction. **Asymmetry**: There's slightly more red (Chiefs wins) than blue (Bills wins) mass, especially on the right side. This visualizes the 68-32% split. The red extends further right than the blue extends left, indicating the Chiefs have higher blowout potential. **Tail events**: We can see rare but possible extreme outcomes. Chiefs wins by 30+ and Bills wins by 20+ both appear in the distribution, though infrequently. These tail events matter for betting markets and strategy decisions.

The visualization makes clear what the 68% win probability means in practice: the distribution is centered on a small Chiefs advantage, but with enough variance that the Bills win a substantial minority of games. The overlapping distributions show why football outcomes remain uncertain even when one team is clearly favored.

Reading Distribution Plots

When interpreting distribution plots like this: - **Center (mean/median)**: Where is the peak? This is the most likely range of outcomes. - **Spread (variance)**: How wide is the distribution? Narrow means predictable, wide means uncertain. - **Skew**: Is it symmetric or lopsided? NFL game margins are fairly symmetric. - **Tails**: How much mass is in extreme outcomes? This matters for risk assessment. - **Overlap**: When comparing two groups, how much do they overlap? More overlap = harder to distinguish.

Season Simulation and Playoff Odds

Building a Season Simulator

Calculating playoff probabilities requires simulating entire seasons, not just individual games. This adds complexity: we need to model 17 games for each team, account for opponent strength varying by schedule, and aggregate results to determine playoff seeding. The 7-team playoff format means we need to rank all teams by record and apply tiebreakers.

Season simulation builds on our game simulation function but adds scheduling logic. Each team plays 17 games against different opponents, with some at home and some away. The sequence of opponents matters less than their overall strength, so we can simplify by having teams play random opponents from the pool. In reality, NFL schedules have structure (division games, same-finish games), but for playoff probability estimation, random schedules work reasonably well.

The key insight is that season outcomes involve compounding uncertainty. A team's final record depends on 17 game outcomes, each uncertain. Small differences in team quality (1-2 points of rating) compound over 17 games into meaningful differences in expected wins. Yet randomness remains large enough that a team with expected 10 wins might finish anywhere from 7-13 wins with reasonable probability.

Let's build a simplified but functional season simulator for the AFC:

#| label: season-sim-setup-r
#| message: false
#| warning: false
#| cache: true

# Create simplified team ratings for AFC
# Ratings based on point differential expectations vs average team
afc_teams <- tibble(
  team = c("KC", "BUF", "BAL", "MIA", "CIN", "JAX", "CLE", "HOU",
           "PIT", "LAC", "IND", "TEN", "LV", "NYJ", "NE", "DEN"),
  rating = c(6.5, 5.8, 5.2, 4.8, 4.5, 3.2, 2.8, 2.5,
             2.0, 1.5, 0.5, -1.0, -2.0, -2.5, -3.5, -4.0),
  division = c("West", "East", "North", "East", "North", "South", "North", "South",
               "North", "West", "South", "South", "West", "East", "East", "West")
)

# Function to simulate a complete season
# Returns final win totals for all teams
simulate_season <- function(teams, n_games = 17) {

  # Initialize results with zero wins
  results <- teams %>%
    select(team, rating) %>%
    mutate(wins = 0)

  # Simulate each team's season
  for(i in 1:nrow(teams)) {
    team_rating <- teams$rating[i]

    # Play against random opponents (simplified scheduling)
    # In reality, NFL schedule has structure, but random works for our purposes
    opponents <- sample(setdiff(teams$team, teams$team[i]), n_games)
    opponent_ratings <- teams$rating[teams$team %in% opponents]

    # Simulate each of the 17 games
    for(j in 1:n_games) {
      # Randomly assign home/away (50-50 split)
      is_home <- runif(1) > 0.5
      home_adv <- ifelse(is_home, 2.5, -2.5)

      # Simulate the game
      game <- simulate_game(team_rating, opponent_ratings[j],
                           home_advantage = home_adv)

      # Add win if team won
      if(game$team_a_wins) {
        results$wins[i] <- results$wins[i] + 1
      }
    }
  }

  # Return results sorted by wins (playoff seeding order)
  results %>%
    mutate(losses = n_games - wins) %>%
    arrange(desc(wins))
}

# Run one simulation to see sample output
set.seed(2024)
season_result <- simulate_season(afc_teams)

season_result %>%
  left_join(afc_teams %>% select(team, division), by = "team") %>%
  select(team, division, wins, losses, rating) %>%
  head(10) %>%
  gt() %>%
  cols_label(
    team = "Team",
    division = "Division",
    wins = "Wins",
    losses = "Losses",
    rating = "Rating"
  ) %>%
  fmt_number(
    columns = rating,
    decimals = 1
  ) %>%
  tab_header(
    title = "Simulated Season Results",
    subtitle = "Top 10 teams (one simulation)"
  ) %>%
  tab_style(
    style = cell_fill(color = "#E8F4EA"),
    locations = cells_body(rows = 1:7)
  )
#| label: season-sim-setup-py
#| message: false
#| warning: false
#| cache: true

# Create simplified team ratings for AFC
afc_teams = pd.DataFrame({
    'team': ["KC", "BUF", "BAL", "MIA", "CIN", "JAX", "CLE", "HOU",
             "PIT", "LAC", "IND", "TEN", "LV", "NYJ", "NE", "DEN"],
    'rating': [6.5, 5.8, 5.2, 4.8, 4.5, 3.2, 2.8, 2.5,
               2.0, 1.5, 0.5, -1.0, -2.0, -2.5, -3.5, -4.0],
    'division': ["West", "East", "North", "East", "North", "South", "North", "South",
                 "North", "West", "South", "South", "West", "East", "East", "West"]
})

def simulate_season(teams, n_games=17):
    """Simulate a full season for all teams."""

    # Initialize results
    results = teams[['team', 'rating']].copy()
    results['wins'] = 0

    # Each team plays n_games
    for i, row in teams.iterrows():
        team_rating = row['rating']

        # Play against random opponents
        opponents = teams[teams['team'] != row['team']].sample(n_games)

        # Simulate each game
        for _, opp in opponents.iterrows():
            # Random home/away
            is_home = np.random.random() > 0.5
            home_adv = 2.5 if is_home else -2.5

            game = simulate_game(team_rating, opp['rating'],
                               home_advantage=home_adv)

            if game['team_a_wins']:
                results.loc[results['team'] == row['team'], 'wins'] += 1

    results['losses'] = n_games - results['wins']
    return results.sort_values('wins', ascending=False)

# Run one simulation
np.random.seed(2024)
season_result = simulate_season(afc_teams)

season_result_display = season_result.merge(
    afc_teams[['team', 'division']],
    on='team'
)[['team', 'division', 'wins', 'losses', 'rating']].head(10)

print("\nSimulated Season Results (Top 10 teams - one simulation):")
print(season_result_display.to_string(index=False))
print("\nTop 7 teams make playoffs (highlighted)")
The `simulate_season()` function creates a complete 17-game season for all teams: **Scheduling**: We randomly sample 17 opponents for each team. This isn't realistic NFL scheduling (which has division structure), but it's adequate for playoff probability estimation. The randomness averages out over thousands of simulations. **Home/away split**: Each game is randomly assigned as home or away with 50-50 probability. This approximates the reality that teams play about half their games at home. We could make this exactly 8-9 home games, but the randomness doesn't materially affect our results. **Win counting**: After simulating all games, we count total wins for each team. The nested loops mean we simulate 16 teams × 17 games = 272 games per season (though each game is counted twice from different perspectives). **Sorting by wins**: We arrange teams by win total to approximate playoff seeding. In reality, division winners get automatic berths and complex tiebreakers apply, but for playoff probability purposes, "top 7 teams by wins" works reasonably well.

One simulation shows a plausible season outcome, with top teams winning 11-13 games and weak teams winning 4-7 games. Notice some randomness: a team with a mediocre rating might sneak into the top 7, or a strong team might underperform. This is football—variance matters. The question is: over thousands of simulated seasons, how often does each team make the playoffs?

Simplifications in This Model

Our season simulator makes several simplifying assumptions: 1. **Independence**: Each game is independent, ignoring injuries, momentum, or coaching adjustments that carry over 2. **Static ratings**: Team strength doesn't change during the season (no injury updates or performance trends) 3. **Random schedules**: We ignore actual NFL scheduling structure (division games, same-finish opponents) 4. **Simple tiebreakers**: We use total wins only, ignoring conference record, common games, etc. These simplifications make the code cleaner and run faster. For production playoff predictors (like FiveThirtyEight or The Athletic), you'd address these with more sophisticated modeling. But for learning Monte Carlo methods, this framework is perfect.

Monte Carlo Season Simulations

Now comes the power of Monte Carlo simulation: we run thousands of seasons to estimate playoff probabilities. One simulated season showed the Chiefs winning 13 games, but what if they got unlucky with injuries or close games? What if their opponents over-performed? By simulating 10,000 seasons, we capture all these possibilities and estimate each team's probability of making the playoffs.

The process is straightforward: run simulate_season() 10,000 times, storing the results each time. Then for each team, calculate what percentage of simulations they finished in the top 7. This percentage is our playoff probability estimate. Teams with high ratings will make it most of the time; teams with low ratings rarely; teams in the middle have uncertain fates.

This approach naturally accounts for strength of schedule variations (some teams get luckier with random opponents), win distribution (teams can have the same expected wins but different variance), and interaction effects (if Team A beats Team B, both teams' playoff odds change). All the complexity is handled implicitly through simulation.

#| label: season-monte-carlo-r
#| message: false
#| warning: false
#| cache: true

# Run 10,000 season simulations
# This will take 20-30 seconds - we're simulating 272,000 games!
n_season_sims <- 10000

playoff_results <- map_dfr(1:n_season_sims, function(i) {

  # Simulate one complete season
  season <- simulate_season(afc_teams)

  # Top 7 teams make playoffs
  season %>%
    mutate(sim = i) %>%
    mutate(playoff = row_number() <= 7) %>%
    select(sim, team, wins, playoff)

})

# Calculate playoff probabilities by team
# Count how often each team made playoffs across all simulations
playoff_odds <- playoff_results %>%
  group_by(team) %>%
  summarise(
    playoff_pct = mean(playoff) * 100,  # % of sims where team made playoffs
    avg_wins = mean(wins),               # Average wins across all sims
    .groups = "drop"
  ) %>%
  left_join(afc_teams %>% select(team, rating, division), by = "team") %>%
  arrange(desc(playoff_pct))

playoff_odds %>%
  head(10) %>%
  gt() %>%
  cols_label(
    team = "Team",
    division = "Division",
    rating = "Rating",
    avg_wins = "Avg Wins",
    playoff_pct = "Playoff %"
  ) %>%
  fmt_number(
    columns = c(rating, avg_wins),
    decimals = 1
  ) %>%
  fmt_number(
    columns = playoff_pct,
    decimals = 1
  ) %>%
  tab_header(
    title = "Playoff Probabilities",
    subtitle = paste0("Based on ", scales::comma(n_season_sims), " season simulations")
  ) %>%
  data_color(
    columns = playoff_pct,
    colors = scales::col_numeric(
      palette = c("white", "#00BFC4"),
      domain = c(0, 100)
    )
  )
#| label: season-monte-carlo-py
#| message: false
#| warning: false
#| cache: true

# Run 10,000 season simulations
n_season_sims = 10000

playoff_results = []

for i in range(n_season_sims):
    # Simulate season
    season = simulate_season(afc_teams)

    # Top 7 teams make playoffs
    season['sim'] = i
    season['playoff'] = season.index < 7

    playoff_results.append(season[['sim', 'team', 'wins', 'playoff']])

playoff_results_df = pd.concat(playoff_results, ignore_index=True)

# Calculate playoff probabilities
playoff_odds = (playoff_results_df
    .groupby('team')
    .agg(
        playoff_pct=('playoff', lambda x: x.mean() * 100),
        avg_wins=('wins', 'mean')
    )
    .reset_index()
    .merge(afc_teams[['team', 'rating', 'division']], on='team')
    .sort_values('playoff_pct', ascending=False)
)

print(f"\nPlayoff Probabilities (Based on {n_season_sims:,} season simulations):")
print(playoff_odds.head(10).to_string(index=False))
The playoff probability results reveal several important patterns: **Strong correlation with rating**: The top-rated teams (Chiefs, Bills, Ravens) have the highest playoff probabilities, often 85-95%. Their superior talent makes playoffs likely even with some bad luck. The bottom teams (Patriots, Broncos, Jets) have near-zero playoff chances—they'd need extraordinary luck across 17 games. **Middle teams have uncertainty**: Teams rated around +2 to +4 might have 40-70% playoff odds. These teams genuinely don't know their fate—they could easily make or miss the playoffs depending on close games, injuries, and opponent performance. This is where Monte Carlo simulation adds the most value over simple predictions. **Average wins vs playoff %**: Notice that teams with similar average wins (say, 9.5 games) might have different playoff percentages. This reflects variance in their win distributions and the discrete nature of playoff cutoffs. A team that consistently wins 9-10 games has higher playoff odds than a team that's volatile between 7 and 12 wins, even with the same average. **Computational cost**: Simulating 10,000 seasons × 16 teams × 17 games = 2,720,000 total games. This takes 20-30 seconds on a modern computer, which is perfectly acceptable. For production systems updating in real-time, you might optimize the code or reduce to 5,000 simulations.

The table shows playoff probabilities ranging from near-100% (Chiefs) to near-0% (bottom teams), with interesting uncertainty in the middle. This matches our intuition: great teams almost always make it, terrible teams almost never do, and average teams face genuine uncertainty. Monte Carlo simulation quantifies this uncertainty precisely.

Visualizing Playoff Probabilities

Numbers in a table are informative, but a visualization reveals patterns more clearly. By plotting playoff probability against team rating, we can see the relationship between team quality and playoff chances. This visualization serves multiple purposes: it validates our model (better teams should have higher odds), reveals the strength of the relationship (how steep is the curve?), and identifies outliers (teams over- or under-performing their rating).

We expect an S-shaped (logistic) curve: very strong teams approach 100% playoff probability, very weak teams approach 0%, and teams around average are most uncertain. The steepness of this curve tells us how deterministic NFL seasons are—a steep curve means ratings are very predictive, while a shallow curve means lots of randomness.

Let's visualize this relationship with playoff probability on the y-axis and team rating on the x-axis:

#| label: fig-playoff-odds-r
#| fig-cap: "Playoff probabilities by team rating"
#| fig-width: 10
#| fig-height: 8

playoff_odds %>%
  ggplot(aes(x = rating, y = playoff_pct)) +
  # Points colored by division
  geom_point(aes(color = division), size = 4, alpha = 0.7) +
  # Smooth trend line showing relationship
  geom_smooth(method = "loess", se = TRUE, color = "black",
              linetype = "dashed", linewidth = 0.8) +
  # Team labels offset to avoid overlap
  geom_text(aes(label = team), hjust = -0.2, size = 3) +
  # 50% line showing playoff cutoff
  geom_hline(yintercept = 50, linetype = "dotted",
             color = "gray40", alpha = 0.5) +
  scale_y_continuous(labels = scales::percent_format(scale = 1),
                    limits = c(0, 100)) +
  labs(
    title = "Playoff Probability vs Team Rating",
    subtitle = paste0("Based on ", scales::comma(n_season_sims),
                     " season simulations"),
    x = "Team Rating (points above/below average)",
    y = "Playoff Probability",
    color = "Division",
    caption = "Dashed line shows fitted relationship | Better teams have higher playoff odds"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )
#| label: fig-playoff-odds-py
#| fig-cap: "Playoff probabilities by team rating - Python"
#| fig-width: 10
#| fig-height: 8

from scipy.interpolate import make_interp_spline

plt.figure(figsize=(10, 8))

# Color by division
divisions = playoff_odds['division'].unique()
colors = plt.cm.Set2(range(len(divisions)))
division_colors = dict(zip(divisions, colors))

for division in divisions:
    div_data = playoff_odds[playoff_odds['division'] == division]
    plt.scatter(div_data['rating'], div_data['playoff_pct'],
               label=division, s=100, alpha=0.7,
               color=division_colors[division])

# Add team labels
for _, row in playoff_odds.iterrows():
    plt.annotate(row['team'],
                (row['rating'], row['playoff_pct']),
                xytext=(5, 0), textcoords='offset points',
                fontsize=8)

# Add smooth trend line
x_smooth = np.linspace(playoff_odds['rating'].min(),
                       playoff_odds['rating'].max(), 100)
from scipy.interpolate import UnivariateSpline
spl = UnivariateSpline(playoff_odds['rating'], playoff_odds['playoff_pct'], s=500)
plt.plot(x_smooth, spl(x_smooth), 'k--', alpha=0.5, linewidth=2,
         label='Fitted Trend')

# Add 50% reference line
plt.axhline(y=50, color='gray', linestyle=':', alpha=0.5)

plt.xlabel('Team Rating (points above/below average)', fontsize=12)
plt.ylabel('Playoff Probability (%)', fontsize=12)
plt.title(f'Playoff Probability vs Team Rating\nBased on {n_season_sims:,} season simulations',
          fontsize=14, fontweight='bold')
plt.ylim(0, 100)
plt.legend(title='Division', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

This visualization reveals several key insights: **Strong positive relationship**: There's a clear upward trend from bottom-left to top-right. Better teams (higher ratings) have substantially higher playoff probabilities. The relationship is roughly linear in the middle range but flattens at the extremes (ceiling at 100%, floor at 0%). **S-curve pattern**: The fitted curve shows the characteristic S-shape (sigmoid/logistic function). Teams rated below -2 have near-zero playoff chances, teams above +5 have near-certain playoff berths, and teams in the -1 to +4 range face genuine uncertainty. This is the "valley of uncertainty" where season outcomes are least predictable. **Spread within ratings**: Notice some scatter around the trend line. Teams with identical ratings can have slightly different playoff odds due to schedule luck, simulation randomness, or division effects. This residual variation reminds us that ratings don't determine outcomes—they just shift probabilities. **Division clustering**: The color-coding by division can reveal if certain divisions are stacked or weak. If all teams in a division have similar ratings, they'll cluster together on the plot. This matters for playoff seeding since division winners get automatic berths. **Practical implications**: A team around 0 rating (average) has roughly 30-40% playoff odds. Each additional point of rating adds about 8-10 percentage points of playoff probability. This quantifies the value of team improvement—getting 2 points better (equivalent to a major offseason upgrade) might boost playoff odds from 35% to 55%.

The plot makes clear that team quality matters enormously for playoff probability, but randomness keeps most teams in the 20-80% range rather than at the extremes. Only truly elite or terrible teams have playoff destinies that approach certainty.

The Playoff Probability Spectrum

Based on these simulations, we can categorize teams: - **>85% playoff probability**: Elite teams, should make playoffs barring disaster - **65-85%**: Strong teams, likely make playoffs but not guaranteed - **35-65%**: Bubble teams, genuine uncertainty about playoff fate - **15-35%**: Long shots, need things to break right - **<15%**: Very unlikely to make playoffs, would require extraordinary luck Teams in the 35-65% range benefit most from simulation-based playoff odds. For them, single-number predictions ("you'll win 9 games") obscure the uncertainty that defines their season.

Strategic Scenario Analysis

Fourth Down Decision Simulation

Monte Carlo simulation excels at evaluating strategic decisions with uncertain outcomes. Consider a fourth-and-2 at the opponent's 35-yard line. The coach faces three options: go for it, kick a field goal, or punt. Each decision has different probability distributions of outcomes, and the "right" choice depends on these distributions, not just expected values.

Traditional expected value analysis might say "going for it is worth +1.2 points," but that masks the variance. Going for it might yield 7 points (touchdown), 3 points (field goal after conversion), or -2 points (turnover on downs). The field goal attempt might yield 3 points (make) or -1.5 points (miss, opponent gets good field position). Punting might yield -0.5 points (opponent starts backed up but gets the ball).

Monte Carlo simulation lets us see the full distribution of outcomes, not just the expected value. This is crucial because risk matters. A conservative coach might prefer the 3-point field goal with low variance over the higher-EV but volatile go-for-it decision. A coach down 10 points in the fourth quarter needs the touchdown upside. Simulation informs these context-dependent choices.

Let's model this decision with realistic probabilities:

#| label: fourth-down-sim-r
#| message: false
#| warning: false

# Scenario: 4th and 2 at opponent's 35-yard line
# Options: Go for it, Kick FG, Punt

simulate_fourth_down <- function(decision, n_sims = 10000) {

  results <- numeric(n_sims)

  for(i in 1:n_sims) {

    if(decision == "go") {
      # Going for it: 50% conversion rate on 4th and 2
      if(runif(1) < 0.50) {
        # Converted! Now simulate rest of drive
        # 40% TD, 20% FG, 40% turnover/punt from deeper in redzone
        outcome <- sample(c("td", "fg", "turnover"), 1,
                         prob = c(0.4, 0.2, 0.4))
        results[i] <- case_when(
          outcome == "td" ~ 7,        # Touchdown
          outcome == "fg" ~ 3,        # Field goal
          outcome == "turnover" ~ -2, # Turnover, opponent gets ball
          TRUE ~ 0
        )
      } else {
        # Failed conversion - turnover on downs at 35 yard line
        # Opponent has good field position, expected to score ~2 points
        results[i] <- -2
      }

    } else if(decision == "kick_fg") {
      # 52-yard field goal attempt (~70% success from 35 yard line)
      if(runif(1) < 0.70) {
        results[i] <- 3  # Made field goal
      } else {
        # Missed FG, opponent gets ball at 35 or touchback
        results[i] <- -1.5
      }

    } else if(decision == "punt") {
      # Punt from 35 typically nets to ~opponent 10 yard line
      # Opponent starts backed up, expected points ~-0.5
      results[i] <- -0.5
    }
  }

  tibble(
    decision = decision,
    expected_points = mean(results),
    sd_points = sd(results),
    p10 = quantile(results, 0.1),
    p90 = quantile(results, 0.9),
    values = list(results)
  )
}

# Simulate all three decisions
fourth_down_results <- bind_rows(
  simulate_fourth_down("go"),
  simulate_fourth_down("kick_fg"),
  simulate_fourth_down("punt")
)

fourth_down_results %>%
  select(decision, expected_points, sd_points, p10, p90) %>%
  gt() %>%
  cols_label(
    decision = "Decision",
    expected_points = "Expected Points",
    sd_points = "Std Dev",
    p10 = "10th %ile",
    p90 = "90th %ile"
  ) %>%
  fmt_number(
    columns = c(expected_points, sd_points, p10, p90),
    decimals = 2
  ) %>%
  tab_header(
    title = "4th and 2 at Opponent's 35",
    subtitle = "Expected value of each decision"
  ) %>%
  data_color(
    columns = expected_points,
    colors = scales::col_numeric(
      palette = c("#F8766D", "white", "#00BFC4"),
      domain = c(-1, 3)
    )
  )
#| label: fourth-down-sim-py
#| message: false
#| warning: false

def simulate_fourth_down(decision, n_sims=10000):
    """Simulate fourth down decision outcomes."""

    results = np.zeros(n_sims)

    for i in range(n_sims):

        if decision == "go":
            # 50% conversion rate
            if np.random.random() < 0.50:
                # Converted - simulate drive outcome
                outcome = np.random.choice(
                    ["td", "fg", "turnover"],
                    p=[0.4, 0.2, 0.4]
                )
                if outcome == "td":
                    results[i] = 7
                elif outcome == "fg":
                    results[i] = 3
                else:  # turnover
                    results[i] = -2
            else:
                # Failed conversion
                results[i] = -2

        elif decision == "kick_fg":
            # 70% success rate from 52 yards
            if np.random.random() < 0.70:
                results[i] = 3
            else:
                results[i] = -1.5

        elif decision == "punt":
            # Punt outcome
            results[i] = -0.5

    return {
        'decision': decision,
        'expected_points': results.mean(),
        'sd_points': results.std(),
        'p10': np.percentile(results, 10),
        'p90': np.percentile(results, 90),
        'values': results
    }

# Simulate all three decisions
fourth_down_results = pd.DataFrame([
    simulate_fourth_down("go"),
    simulate_fourth_down("kick_fg"),
    simulate_fourth_down("punt")
])

print("\n4th and 2 at Opponent's 35 - Expected value of each decision:")
print(fourth_down_results[['decision', 'expected_points', 'sd_points', 'p10', 'p90']]
      .to_string(index=False))
The simulation results reveal important differences between the three decisions: **Expected value ranking**: "Go for it" has the highest expected value (around +1.0 to +1.5 points), followed by "kick FG" (around +1.8 to +2.1 points due to 70% success × 3 points), then "punt" (around -0.5 points). The field goal actually has slightly higher EV than going for it in this scenario, contrary to aggressive "go for it" advocates. **Risk differences**: Look at the standard deviations. "Go for it" has much higher variance (~3.5 points) than "kick FG" (~2.0 points) or "punt" (~0.5 points). This quantifies the risk-reward tradeoff. Going for it offers touchdown upside but turnover downside. **Percentile insights**: The 10th percentile shows downside risk, the 90th shows upside potential. For "go for it", the 10th percentile is around -2 (failed conversion) but the 90th is around +7 (touchdown). For "kick FG", the range is tighter: 10th percentile around -1.5 (miss), 90th around +3 (make). This shows the asymmetric outcomes. **Context matters**: In a tied game, the lower variance of the field goal makes it attractive. Down 10 points late, you need the touchdown upside, making "go for it" better despite similar expected value. Up 14 points, minimizing downside risk via punt might be optimal. The simulation provides the raw data; context determines the choice.

The analysis shows that while going for it and kicking the field goal have similar expected values, they have very different risk profiles. The field goal is safer but capped at 3 points; going for it is riskier but offers touchdown upside. Your choice depends on risk tolerance, game situation, and team strengths.

Expected Value Isn't Everything

A common mistake in analytics is choosing the highest expected value option without considering: 1. **Variance/Risk**: Two options with EV of +2.0 might have very different risk profiles 2. **Game context**: Down 14 points, you need touchdowns, not field goals 3. **Tail events**: Sometimes avoiding the worst-case matters more than maximizing average 4. **Repeated decisions**: In a single game, variance matters; over a season, EV dominates Monte Carlo simulation reveals the full distribution, letting you make informed decisions based on your specific situation.

Visualizing Decision Distributions

The summary statistics tell us expected values and variance, but visualizing the full distribution reveals patterns that numbers alone miss. By plotting probability densities for each decision, we can see how often each outcome occurs, where the mass of the distribution sits, and how much overlap exists between options.

The visualization will show three overlapping distributions: one for each fourth-down decision. We expect "go for it" to be bimodal (peaks at -2 for failures and +7 for touchdowns), "kick FG" to have most mass at +3 with a smaller peak at -1.5, and "punt" to be concentrated around -0.5. The overlap tells us when the choice matters—if all three distributions were identical, the decision wouldn't matter at all.

#| label: fig-fourth-down-dist-r
#| fig-cap: "Distribution of outcomes for 4th down decisions"
#| fig-width: 10
#| fig-height: 6

fourth_down_results %>%
  select(decision, values) %>%
  unnest(values) %>%
  ggplot(aes(x = values, fill = decision)) +
  # Density curves show probability of each outcome
  geom_density(alpha = 0.6, linewidth = 1) +
  # Zero line shows breakeven point
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  scale_fill_manual(
    values = c("go" = "#00BFC4", "kick_fg" = "#F8766D", "punt" = "#7CAE00"),
    labels = c("Go For It", "Kick FG", "Punt")
  ) +
  labs(
    title = "4th and 2 at Opponent's 35 Yard Line",
    subtitle = "Distribution of expected points for each decision",
    x = "Expected Points Added",
    y = "Probability Density",
    fill = "Decision"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "top"
  )
#| label: fig-fourth-down-dist-py
#| fig-cap: "Distribution of outcomes for 4th down decisions - Python"
#| fig-width: 10
#| fig-height: 6

plt.figure(figsize=(10, 6))

colors = {'go': '#00BFC4', 'kick_fg': '#F8766D', 'punt': '#7CAE00'}
labels = {'go': 'Go For It', 'kick_fg': 'Kick FG', 'punt': 'Punt'}

for _, row in fourth_down_results.iterrows():
    decision = row['decision']
    # Use KDE for smooth density
    from scipy.stats import gaussian_kde
    density = gaussian_kde(row['values'])
    xs = np.linspace(row['values'].min(), row['values'].max(), 200)
    plt.plot(xs, density(xs), linewidth=2, alpha=0.8,
             color=colors[decision], label=labels[decision])
    plt.fill_between(xs, density(xs), alpha=0.3, color=colors[decision])

plt.axvline(x=0, color='black', linestyle='--', linewidth=2)
plt.xlabel('Expected Points Added', fontsize=12)
plt.ylabel('Probability Density', fontsize=12)
plt.title('4th and 2 at Opponent\'s 35 Yard Line\nDistribution of expected points for each decision',
          fontsize=14, fontweight='bold')
plt.legend(title='Decision', loc='upper right')
plt.tight_layout()
plt.show()

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

The distribution plot reveals insights that summary statistics obscure: **Bimodal "go for it"**: The teal distribution shows two clear peaks—one around -2 (failed conversion) and one around +7 (touchdown after conversion). This bimodality reflects the binary nature of the decision: you either convert and likely score big, or fail and give the opponent good field position. The valley between peaks represents the less common outcomes (converting but only getting a field goal). **Concentrated "kick FG"**: The red distribution has most of its mass concentrated at +3 (successful field goal) with a smaller peak at -1.5 (missed kick). This shows the lower variance—outcomes cluster tightly around these two values. There's less uncertainty about what will happen. **Narrow "punt"**: The green distribution is very concentrated around -0.5. Punting has the least variance—you know what you'll get. This makes it the "safe" option that avoids both upside and downside. **Overlap implications**: Notice significant overlap in the left tails. All three decisions have some probability of negative outcomes. The "go for it" distribution extends further right (more upside) but also has more mass in negative territory (more downside risk). The choice depends on whether you value the upside or fear the downside.

The visualization makes the risk-reward tradeoff crystal clear: going for it offers the highest upside (touchdown peak near +7) but also more downside risk, while the field goal is safer with less variance but capped upside. The punt is the most predictable but gives up points on average.

Drive Outcome Simulation

Modeling Drive Results

Moving from single decisions to complete drives requires simulating sequences of plays. A drive simulation tracks field position, down-and-distance, and play outcomes over multiple plays until the drive ends (touchdown, field goal, turnover, or turnover on downs). This is more complex than game simulation but reveals insights about offensive efficiency, play-calling strategy, and scoring probability by field position.

The key components of a drive simulator are:

  1. Play-by-play modeling: Each play advances the ball some yardage (positive or negative) based on play type and randomness
  2. Down-and-distance tracking: After each play, update downs and yards to go for first down
  3. Drive-ending events: Model touchdowns, turnovers, turnovers on downs, and field position for field goal attempts
  4. Play-type selection: Balance between running and passing, potentially adjusting based on down and distance

We'll build a simplified drive simulator that captures the essential dynamics without play-by-play granularity. The model will randomly choose run or pass, simulate yardage gained from distributions, and track drive progress until an ending event occurs:

#| label: drive-sim-r
#| message: false
#| warning: false

# Simulate a complete drive starting at given yard line
simulate_drive <- function(start_yard = 25, offense_rating = 0) {

  yard_line <- start_yard
  down <- 1
  yards_to_go <- 10
  plays <- 0
  max_plays <- 20  # Safety valve to prevent infinite loops

  while(plays < max_plays) {
    plays <- plays + 1

    # Determine play type (60% pass, 40% run in modern NFL)
    play_type <- sample(c("pass", "run"), 1, prob = c(0.6, 0.4))

    # Simulate yards gained
    # Pass: higher mean, higher variance (more boom/bust)
    # Run: lower mean, lower variance (more consistent)
    if(play_type == "pass") {
      yards <- round(rnorm(1, mean = 7 + offense_rating, sd = 10))
    } else {
      yards <- round(rnorm(1, mean = 4 + offense_rating * 0.5, sd = 5))
    }

    # Cap yards at reasonable bounds (-15 to +75)
    yards <- max(-15, min(yards, 75))

    # 5% chance of turnover (interception or fumble)
    if(runif(1) < 0.05) {
      return(list(result = "turnover", plays = plays, yards = yard_line))
    }

    # Update field position and yards to go
    yard_line <- yard_line + yards
    yards_to_go <- yards_to_go - yards

    # Check for touchdown
    if(yard_line >= 100) {
      return(list(result = "touchdown", plays = plays, yards = 100))
    }

    # Check for safety (backed into own endzone)
    if(yard_line <= 0) {
      return(list(result = "safety", plays = plays, yards = 0))
    }

    # Check for first down
    if(yards_to_go <= 0) {
      down <- 1
      yards_to_go <- 10
    } else {
      down <- down + 1

      # Turnover on downs after 4th down
      if(down > 4) {
        return(list(result = "turnover_downs", plays = plays,
                   yards = yard_line))
      }
    }
  }

  # Max plays reached (shouldn't happen often)
  return(list(result = "timeout", plays = plays, yards = yard_line))
}

# Simulate 10,000 drives to see distribution of outcomes
n_drive_sims <- 10000

drive_results <- map_dfr(1:n_drive_sims, function(i) {
  drive <- simulate_drive(start_yard = 25, offense_rating = 1)
  tibble(
    sim = i,
    result = drive$result,
    plays = drive$plays,
    yards = drive$yards
  )
})

# Summarize results
drive_summary <- drive_results %>%
  group_by(result) %>%
  summarise(
    count = n(),
    pct = n() / n_drive_sims * 100,
    avg_plays = mean(plays),
    .groups = "drop"
  ) %>%
  arrange(desc(count))

drive_summary %>%
  gt() %>%
  cols_label(
    result = "Drive Result",
    count = "Count",
    pct = "Percentage",
    avg_plays = "Avg Plays"
  ) %>%
  fmt_number(
    columns = pct,
    decimals = 1
  ) %>%
  fmt_number(
    columns = avg_plays,
    decimals = 1
  ) %>%
  tab_header(
    title = "Drive Outcome Simulation",
    subtitle = paste0("Based on ", scales::comma(n_drive_sims), " simulated drives")
  ) %>%
  data_color(
    columns = pct,
    colors = scales::col_numeric(
      palette = c("white", "#619CFF"),
      domain = c(0, max(drive_summary$pct))
    )
  )
#| label: drive-sim-py
#| message: false
#| warning: false

def simulate_drive(start_yard=25, offense_rating=0):
    """Simulate a complete drive."""

    yard_line = start_yard
    down = 1
    yards_to_go = 10
    plays = 0
    max_plays = 20

    while plays < max_plays:
        plays += 1

        # Determine play type (60% pass, 40% run)
        play_type = np.random.choice(['pass', 'run'], p=[0.6, 0.4])

        # Simulate yards gained
        if play_type == 'pass':
            yards = round(np.random.normal(7 + offense_rating, 10))
        else:
            yards = round(np.random.normal(4 + offense_rating * 0.5, 5))

        # Cap yards
        yards = max(-15, min(yards, 75))

        # Check for turnover
        if np.random.random() < 0.05:
            return {'result': 'turnover', 'plays': plays, 'yards': yard_line}

        # Update field position
        yard_line += yards
        yards_to_go -= yards

        # Check for touchdown
        if yard_line >= 100:
            return {'result': 'touchdown', 'plays': plays, 'yards': 100}

        # Check for safety
        if yard_line <= 0:
            return {'result': 'safety', 'plays': plays, 'yards': 0}

        # Check for first down
        if yards_to_go <= 0:
            down = 1
            yards_to_go = 10
        else:
            down += 1
            if down > 4:
                return {'result': 'turnover_downs', 'plays': plays,
                       'yards': yard_line}

    return {'result': 'timeout', 'plays': plays, 'yards': yard_line}

# Simulate 10,000 drives
n_drive_sims = 10000

drive_results = pd.DataFrame([
    simulate_drive(start_yard=25, offense_rating=1)
    for _ in range(n_drive_sims)
])

# Summarize results
drive_summary = (drive_results
    .groupby('result')
    .agg(
        count=('result', 'count'),
        avg_plays=('plays', 'mean')
    )
    .reset_index()
)
drive_summary['pct'] = drive_summary['count'] / n_drive_sims * 100
drive_summary = drive_summary.sort_values('count', ascending=False)

print(f"\nDrive Outcome Simulation (Based on {n_drive_sims:,} simulated drives):")
print(drive_summary.to_string(index=False))
The drive simulation results show realistic NFL drive outcomes: **Touchdown rate**: Somewhere around 20-25% of drives end in touchdowns, matching NFL averages for above-average offenses. The `offense_rating = 1` parameter gives the offense a slight edge, increasing scoring probability. **Turnover rate**: With a 5% per-play turnover probability and drives averaging 6-8 plays, we get realistic total turnover rates around 25-30% of drives. This includes interceptions, fumbles, and turnovers on downs. **Turnover on downs**: This happens when offenses fail to convert on 4th down or run out of downs before getting a first down. The percentage reflects conservative play-calling (we don't attempt many 4th downs in this simple model). **Play count**: Drives averaging 5-7 plays align with NFL reality. Touchdown drives tend to be longer (more first downs), while turnovers happen quicker (fewer plays before the mistake). **Model limitations**: This is a simplified model that doesn't account for field position effects (scoring probability should increase as you get closer), time management, or strategic 4th down decisions. More sophisticated drive models would incorporate these factors.

The results show a realistic distribution of drive outcomes, with scoring rates that match NFL data for above-average offenses. This type of simulation can be extended to model entire games (alternating drives for each team), inform fourth-down decisions (by simulating drive continuation probability), or evaluate play-calling strategies.

Extending Drive Simulation

This basic drive simulator can be enhanced in several ways: 1. **Field position effects**: Make play outcomes depend on field position (easier to gain yards in open field) 2. **Down-and-distance effects**: Adjust play type selection based on situation (more passes on 3rd-and-long) 3. **Red zone modeling**: Different dynamics inside the 20-yard line (condensed field, higher scoring probability) 4. **Clock management**: Track game time and model end-of-half situations 5. **Personnel effects**: Different player skills affect play outcomes (elite QB increases pass efficiency) Each extension adds realism but also complexity. Start simple, validate against real data, then add sophistication as needed.

Bootstrapping and Resampling Methods

Understanding Bootstrap Confidence Intervals

Monte Carlo simulation generates data from a model, but what if we want to quantify uncertainty in real data? Bootstrapping provides the answer. This resampling technique estimates confidence intervals and standard errors without making distributional assumptions, making it perfect for football analytics where data often violates normality assumptions.

The bootstrap works by repeatedly resampling your observed data with replacement, calculating your statistic of interest on each resample, and using the distribution of those statistics to estimate uncertainty. If you want a confidence interval for average EPA per play, you resample your plays thousands of times and see how the average varies across resamples.

Why does this work? The bootstrap principle states that the relationship between your sample and the resampled data approximates the relationship between the true population and your sample. In other words, the variability you see in bootstrap resamples mimics the variability you'd see if you could collect new samples from the population.

Let's use bootstrapping to estimate confidence intervals for team offensive efficiency:

#| label: bootstrap-r
#| message: false
#| warning: false
#| cache: true

# Load real play-by-play data
pbp_2023 <- load_pbp(2023)

# Calculate team EPA per play
team_epa <- pbp_2023 %>%
  filter(season_type == "REG", !is.na(epa), play_type %in% c("pass", "run")) %>%
  group_by(posteam) %>%
  summarise(
    n_plays = n(),
    observed_epa = mean(epa),
    .groups = "drop"
  )

# Bootstrap function: resample plays and recalculate EPA
bootstrap_team_epa <- function(team_abbr, pbp_data, n_boot = 1000) {

  # Get plays for this team
  team_plays <- pbp_data %>%
    filter(posteam == team_abbr, !is.na(epa), play_type %in% c("pass", "run"))

  # Bootstrap: resample plays with replacement n_boot times
  boot_estimates <- map_dbl(1:n_boot, function(i) {
    # Sample n plays with replacement
    boot_sample <- team_plays %>%
      slice_sample(n = nrow(team_plays), replace = TRUE)

    # Calculate EPA for this bootstrap sample
    mean(boot_sample$epa)
  })

  # Calculate confidence intervals from bootstrap distribution
  tibble(
    team = team_abbr,
    observed_epa = mean(team_plays$epa),
    boot_mean = mean(boot_estimates),
    boot_se = sd(boot_estimates),
    ci_lower = quantile(boot_estimates, 0.025),  # 2.5th percentile
    ci_upper = quantile(boot_estimates, 0.975)   # 97.5th percentile
  )
}

# Run bootstrap for top 6 teams
top_teams <- team_epa %>%
  arrange(desc(observed_epa)) %>%
  head(6) %>%
  pull(posteam)

boot_results <- map_dfr(top_teams,
                        ~bootstrap_team_epa(.x, pbp_2023, n_boot = 1000))

boot_results %>%
  gt() %>%
  cols_label(
    team = "Team",
    observed_epa = "Observed EPA",
    boot_mean = "Bootstrap Mean",
    boot_se = "Std Error",
    ci_lower = "95% CI Lower",
    ci_upper = "95% CI Upper"
  ) %>%
  fmt_number(
    columns = c(observed_epa, boot_mean, boot_se, ci_lower, ci_upper),
    decimals = 3
  ) %>%
  tab_header(
    title = "Bootstrap Confidence Intervals for Team EPA",
    subtitle = "Based on 1,000 bootstrap resamples of 2023 regular season"
  )
#| label: bootstrap-py
#| message: false
#| warning: false
#| cache: true

# Would load real data here - using simulated for demonstration
# pbp_2023 = nfl.import_pbp_data([2023])

# Simulated team data for demonstration
np.random.seed(2024)

def bootstrap_team_epa(team_epa_values, n_boot=1000):
    """Bootstrap confidence intervals for team EPA."""

    boot_estimates = []

    for i in range(n_boot):
        # Resample with replacement
        boot_sample = np.random.choice(team_epa_values,
                                      size=len(team_epa_values),
                                      replace=True)
        boot_estimates.append(boot_sample.mean())

    boot_estimates = np.array(boot_estimates)

    return {
        'observed_epa': team_epa_values.mean(),
        'boot_mean': boot_estimates.mean(),
        'boot_se': boot_estimates.std(),
        'ci_lower': np.percentile(boot_estimates, 2.5),
        'ci_upper': np.percentile(boot_estimates, 97.5)
    }

# Simulate example data for demonstration
print("\nBootstrap Confidence Intervals for Team EPA")
print("(Based on 1,000 bootstrap resamples)")
print("\nExample: Team with 1000 plays, mean EPA = 0.15")

# Simulated data
team_plays = np.random.normal(0.15, 2.0, 1000)
result = bootstrap_team_epa(team_plays, n_boot=1000)

print(f"Observed EPA: {result['observed_epa']:.3f}")
print(f"Bootstrap Mean: {result['boot_mean']:.3f}")
print(f"Standard Error: {result['boot_se']:.3f}")
print(f"95% CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
The bootstrap results reveal important insights about uncertainty: **Standard error**: The bootstrap SE tells us the typical variability in our EPA estimate. Teams with more plays have smaller standard errors (more data = less uncertainty). Teams with similar observed EPA might have very different confidence intervals based on sample size and variance. **Confidence intervals**: The 95% CI gives us a range of plausible values for the true mean EPA. If the interval is [0.08, 0.22], we're 95% confident the true EPA falls in this range. Narrow intervals indicate precise estimates; wide intervals indicate uncertainty. **Bootstrap vs observed**: The bootstrap mean should be very close to the observed mean—this validates that our bootstrap is working correctly. Large differences would indicate a problem with the resampling procedure. **Overlapping intervals**: If two teams' confidence intervals overlap substantially, we can't confidently say which team is truly better. The observed difference might be just random variation. This is why rankings based on point estimates can be misleading.

The bootstrap confidence intervals show that even teams with similar observed EPA values may have very different levels of uncertainty depending on sample size and variance in their play outcomes. This uncertainty quantification is crucial for making fair comparisons between teams.

When to Use Bootstrap vs Parametric Methods

Use bootstrapping when: - Sample sizes are moderate (>30 but <1000) - Data violates normality assumptions - You're calculating complex statistics (ratios, percentiles) - You want robust results without distributional assumptions Use parametric methods (t-tests, etc.) when: - Sample sizes are very large (>1000) - Data is approximately normal - You're calculating simple statistics (means, proportions) - Computational speed matters For football analytics, bootstrap is often preferable because play outcomes are non-normal (bounded, skewed, heavy-tailed) and we often work with moderate sample sizes (100-1000 plays per team per season).

Sensitivity Analysis

Identifying Key Assumptions

Every model makes assumptions: team ratings are accurate, home field advantage is 2.5 points, game outcomes follow a normal distribution. Sensitivity analysis tests how results change when these assumptions vary. If small changes in an assumption dramatically alter conclusions, that assumption is critical and deserves careful calibration. If results are robust to assumption changes, we can be more confident.

In football analytics, sensitivity analysis is particularly important because we can't directly observe many key parameters. We don't know the "true" value of home field advantage—we estimate it from data. We don't know if team strength changes during the season—we assume it's constant. Sensitivity analysis reveals which uncertainties matter most.

The process is straightforward: vary one assumption at a time across a reasonable range, re-run your analysis, and plot how results change. The steepness of the relationship tells you the importance of that assumption. Flat relationships mean the assumption doesn't matter much; steep relationships mean it's critical.

Let's test how playoff odds change with team rating:

#| label: sensitivity-r
#| message: false
#| warning: false

# Test how playoff odds change with team rating
# Vary rating from -5 to +10 points (full range of NFL teams)

sensitivity_ratings <- seq(-5, 10, by = 0.5)

sensitivity_results <- map_dfr(sensitivity_ratings, function(rating) {

  # Run 1000 simulations per rating (fewer for speed)
  playoff_count <- 0

  for(i in 1:1000) {
    # Team with this rating
    team_wins <- 0

    # Play 17 games against random opponents
    for(game in 1:17) {
      # Random opponent drawn from distribution of NFL teams
      opponent_rating <- rnorm(1, mean = 0, sd = 4)

      # Random home/away
      is_home <- runif(1) > 0.5
      home_adv <- ifelse(is_home, 2.5, -2.5)

      # Simulate game
      game_result <- simulate_game(rating, opponent_rating,
                                   home_advantage = home_adv)
      if(game_result$team_a_wins) team_wins <- team_wins + 1
    }

    # Assume 10 wins needed for playoffs (simplified threshold)
    if(team_wins >= 10) playoff_count <- playoff_count + 1
  }

  tibble(
    rating = rating,
    playoff_prob = playoff_count / 1000
  )
})

# Plot sensitivity
ggplot(sensitivity_results, aes(x = rating, y = playoff_prob)) +
  geom_line(color = "#619CFF", linewidth = 1.5) +
  geom_point(color = "#619CFF", size = 2) +
  # 50% line shows where teams are coin flips for playoffs
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  # 0 line shows average team
  geom_vline(xintercept = 0, linetype = "dotted", color = "gray50") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    title = "Sensitivity Analysis: Team Rating vs Playoff Probability",
    subtitle = "How does playoff probability change with team strength?",
    x = "Team Rating (points above/below average)",
    y = "Playoff Probability",
    caption = "Red line: 50% threshold | Gray line: average team | Steep slope indicates strong sensitivity"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))
#| label: sensitivity-py
#| message: false
#| warning: false

# Test how playoff odds change with team rating
sensitivity_ratings = np.arange(-5, 10.5, 0.5)

sensitivity_results = []

for rating in sensitivity_ratings:
    playoff_count = 0

    for i in range(1000):
        team_wins = 0

        # Play 17 games
        for game in range(17):
            opponent_rating = np.random.normal(0, 4)
            is_home = np.random.random() > 0.5
            home_adv = 2.5 if is_home else -2.5

            game_result = simulate_game(rating, opponent_rating,
                                       home_advantage=home_adv)
            if game_result['team_a_wins']:
                team_wins += 1

        # 10 wins threshold
        if team_wins >= 10:
            playoff_count += 1

    sensitivity_results.append({
        'rating': rating,
        'playoff_prob': playoff_count / 1000
    })

sensitivity_df = pd.DataFrame(sensitivity_results)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(sensitivity_df['rating'], sensitivity_df['playoff_prob'],
         'o-', color='#619CFF', linewidth=2, markersize=6)
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=1.5,
            label='50% threshold')
plt.axvline(x=0, color='gray', linestyle=':', linewidth=1,
            label='Average team')
plt.xlabel('Team Rating (points above/below average)', fontsize=12)
plt.ylabel('Playoff Probability', fontsize=12)
plt.title('Sensitivity Analysis: Team Rating vs Playoff Probability\nHow does playoff probability change with team strength?',
          fontsize=14, fontweight='bold')
plt.ylim(0, 1)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
The sensitivity analysis reveals important patterns: **S-shaped curve**: The relationship between rating and playoff probability follows a sigmoid (S-curve). Very weak teams (<-3 rating) have near-zero playoff chances, very strong teams (>+6 rating) approach 100%, and teams in between show the steepest sensitivity. **Steepest around average**: The slope is steepest around 0 rating (average teams). This is where small rating changes matter most. A +1 to +2 point improvement for an average team might boost playoff odds from 35% to 55%—a massive 20 percentage point swing. For elite teams already at 95%, the same improvement only adds a few percentage points. **Diminishing returns**: Improving from -3 to -2 rating barely helps (both have ~5% playoff odds). Improving from +7 to +8 also barely helps (both have ~95% odds). The middle of the distribution is where improvements matter most. **Practical implications**: This curve tells us where to focus improvement efforts. For average teams fighting for playoff spots, every incremental improvement is valuable. For terrible or elite teams, the marginal value of improvement is lower—though of course, elite teams still want to maximize Super Bowl odds, not just playoff odds.

The steep slope in the middle of the curve shows that playoff probability is highly sensitive to team rating for average-to-good teams, but less sensitive for extreme teams. This quantifies the common wisdom that "every game matters" for bubble teams while elite teams can afford a few losses.

Multi-Variable Sensitivity

Real-world uncertainty involves multiple parameters simultaneously. A team's playoff odds depend on their rating AND their schedule strength AND the playoff threshold. Multi-variable sensitivity analysis explores how results change when we vary multiple assumptions together, revealing interaction effects that single-variable analysis misses.

The standard approach uses a grid of parameter combinations: try all pairs of (rating, schedule_difficulty) and record the resulting playoff probability. This creates a two-dimensional sensitivity surface that we visualize as a heatmap. The color intensity shows playoff probability, while the x and y axes show the two parameters.

This analysis reveals whether parameters interact (do they matter more in combination?) or act independently (can we analyze them separately?). It also identifies regions of high uncertainty (where small parameter changes create large outcome changes) versus robustness (where results are stable despite parameter variation).

#| label: fig-sensitivity-2d-r
#| fig-cap: "Sensitivity to multiple parameters"
#| fig-width: 10
#| fig-height: 8

# Test sensitivity to both team rating AND schedule strength
rating_values <- seq(-2, 8, by = 2)
schedule_difficulty <- seq(-3, 3, by = 1.5)

# Create all combinations of ratings and schedule difficulty
sensitivity_grid <- expand_grid(
  rating = rating_values,
  schedule = schedule_difficulty
)

# For each combination, simulate seasons and calculate playoff probability
sensitivity_grid <- sensitivity_grid %>%
  rowwise() %>%
  mutate(
    playoff_prob = {
      playoff_count <- 0
      for(i in 1:500) {  # 500 sims per combination for speed
        team_wins <- 0
        for(game in 1:17) {
          # Opponent strength centered on schedule difficulty
          opponent_rating <- rnorm(1, mean = schedule, sd = 3)
          game_result <- simulate_game(rating, opponent_rating,
                                       home_advantage = 2.5)
          if(game_result$team_a_wins) team_wins <- team_wins + 1
        }
        if(team_wins >= 10) playoff_count <- playoff_count + 1
      }
      playoff_count / 500
    }
  ) %>%
  ungroup()

# Create heatmap
ggplot(sensitivity_grid, aes(x = rating, y = schedule, fill = playoff_prob)) +
  geom_tile(color = "white", linewidth = 1) +
  # Add probability labels to tiles
  geom_text(aes(label = scales::percent(playoff_prob, accuracy = 1)),
            color = "white", fontface = "bold", size = 4) +
  scale_fill_gradient2(
    low = "#F8766D", mid = "#FDB462", high = "#00BFC4",
    midpoint = 0.5,
    labels = scales::percent
  ) +
  labs(
    title = "Two-Way Sensitivity Analysis",
    subtitle = "Playoff probability by team rating and schedule difficulty",
    x = "Team Rating (points above average)",
    y = "Schedule Difficulty (avg opponent rating)",
    fill = "Playoff\nProbability"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-sensitivity-2d-py
#| fig-cap: "Sensitivity to multiple parameters - Python"
#| fig-width: 10
#| fig-height: 8

# Test sensitivity to both team rating AND schedule strength
rating_values = np.arange(-2, 9, 2)
schedule_difficulty = np.arange(-3, 4, 1.5)

sensitivity_grid = []

for rating in rating_values:
    for schedule in schedule_difficulty:
        playoff_count = 0

        for i in range(500):
            team_wins = 0
            for game in range(17):
                opponent_rating = np.random.normal(schedule, 3)
                game_result = simulate_game(rating, opponent_rating,
                                           home_advantage=2.5)
                if game_result['team_a_wins']:
                    team_wins += 1

            if team_wins >= 10:
                playoff_count += 1

        sensitivity_grid.append({
            'rating': rating,
            'schedule': schedule,
            'playoff_prob': playoff_count / 500
        })

sensitivity_df = pd.DataFrame(sensitivity_grid)

# Create heatmap
pivot_data = sensitivity_df.pivot(index='schedule', columns='rating',
                                   values='playoff_prob')

plt.figure(figsize=(10, 8))
sns.heatmap(pivot_data, annot=True, fmt='.0%', cmap='RdYlGn',
            cbar_kws={'label': 'Playoff Probability'},
            linewidths=1, linecolor='white', vmin=0, vmax=1)
plt.xlabel('Team Rating (points above average)', fontsize=12)
plt.ylabel('Schedule Difficulty (avg opponent rating)', fontsize=12)
plt.title('Two-Way Sensitivity Analysis\nPlayoff probability by team rating and schedule difficulty',
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
The two-dimensional sensitivity analysis reveals interaction effects: **Diagonal patterns**: Notice how playoff probability increases along the diagonal from bottom-left (weak team, hard schedule = low playoff odds) to top-right (strong team, easy schedule = high playoff odds). This shows both factors matter and reinforce each other. **Schedule matters more for average teams**: For weak teams (rating -2), schedule barely matters—they have low playoff odds regardless. Similarly, elite teams (rating +8) make playoffs almost regardless of schedule. But for teams rated +2 to +6, schedule difficulty creates 20-30 percentage point swings in playoff probability. **Interaction effects**: A +4 rated team with an easy schedule (-3 difficulty) might have 85% playoff odds, while the same team with a hard schedule (+3 difficulty) might have only 45% odds—a 40 percentage point swing! This interaction shows that the two factors aren't independent—they compound. **Practical insights**: This analysis quantifies "schedule luck." Two teams with identical ratings can have vastly different playoff probabilities based on who they play. This matters for evaluating coaches (don't penalize them for tough schedules), setting betting lines (adjust for schedule), and predicting playoff races (account for remaining schedule strength). **Robustness regions**: The corners of the heatmap (near 0% or 100% playoff probability) represent robust predictions—small parameter changes don't alter the conclusion. The middle region (around 50%) represents high uncertainty—small changes in either parameter swing the outcome.

The heatmap makes clear that playoff probability depends on the interaction of team strength and schedule difficulty. Neither factor alone tells the full story—you need both. This is why sophisticated playoff predictors track remaining schedule strength alongside team ratings.

Sensitivity Analysis Best Practices

When conducting sensitivity analysis: 1. **Choose realistic ranges**: Don't vary parameters beyond plausible values (NFL home advantage isn't 20 points) 2. **Test one-at-a-time first**: Understand individual effects before exploring interactions 3. **Use enough simulations**: Each parameter combination needs sufficient simulations for stable estimates 4. **Document assumptions**: Note which parameters are varied and which are held constant 5. **Prioritize influential parameters**: Focus on parameters where results are most sensitive 6. **Validate against reality**: If your sensitivity analysis predicts something impossible, recalibrate Sensitivity analysis is most valuable for assumptions that are uncertain and influential. Don't waste time on parameters that barely affect results.

Summary

Monte Carlo simulation is a powerful tool for football analytics that embraces uncertainty rather than ignoring it. By simulating thousands or millions of possible outcomes, we can calculate probabilities, evaluate decisions, and quantify risk in ways that analytical methods cannot match.

Key techniques we've covered:

Basic Monte Carlo Simulation: Generate random outcomes according to probability distributions, repeat thousands of times, and aggregate results. The Law of Large Numbers ensures convergence to true values as sample sizes increase. Use 10,000+ simulations for production work.

Game Simulation: Model game outcomes using team ratings, home field advantage, and random variation. The simple approach (normal distribution of margins) works surprisingly well for playoff probability calculations, though more sophisticated models capture additional nuances.

Season Simulation: Extend game simulation to full seasons by tracking wins and losses over 17 games. Playoff probabilities emerge naturally from counting how often teams finish in the top 7. This accounts for schedule strength variation and win distribution effects that simple projections miss.

Strategic Decision Analysis: Evaluate fourth-down decisions, play calls, and roster moves by simulating distributions of outcomes rather than just expected values. Risk-reward tradeoffs become explicit, helping coaches make context-appropriate decisions.

Bootstrapping: Estimate confidence intervals and standard errors for statistics calculated from real data by resampling with replacement. This provides uncertainty quantification without parametric assumptions, perfect for non-normal football data.

Sensitivity Analysis: Test how results change when assumptions vary. Single-variable sensitivity reveals which parameters matter most; multi-variable analysis uncovers interaction effects. Use this to identify assumptions worth calibrating carefully versus those that don't affect conclusions.

Practical takeaways:

  1. Simulation complements analysis: Use Monte Carlo to quantify uncertainty in your models and test decisions under realistic variation
  2. More simulations = more accuracy: But with diminishing returns following a square root relationship
  3. Validate your assumptions: Compare simulated distributions to actual data to ensure your model captures reality
  4. Communicate distributions: Show ranges and percentiles, not just point estimates
  5. Test sensitivity: Understand which assumptions matter most and calibrate them carefully
  6. Context determines decisions: Expected value is important, but variance, tail risk, and game situation matter too

Monte Carlo simulation has become indispensable in modern football analytics, powering playoff predictors, fourth-down decision tools, draft value calculators, and roster optimization systems. The techniques in this chapter provide a foundation for building your own simulation-based analyses to answer questions that traditional analytics cannot address.

Exercises

Conceptual Questions

  1. Convergence: Explain why Monte Carlo estimates become more accurate with more simulations. What is the mathematical relationship between number of simulations and estimation error? Why do we typically use 10,000 simulations rather than 100 or 1,000,000?

  2. Independence: Why might simulating games as independent events lead to biased playoff probability estimates? What factors create correlation between games within a season? How could you modify the simulation model to account for these dependencies?

  3. Risk vs. Uncertainty: In the context of football analytics, explain the difference between risk (quantifiable randomness) and uncertainty (unknown unknowns). Give examples of each. How does Monte Carlo simulation help with risk but not with uncertainty?

  4. Bootstrapping vs Simulation: Compare and contrast bootstrapping (resampling observed data) with Monte Carlo simulation (generating new data from a model). When would you use each approach? Can you use both together?

Coding Exercises

Exercise 1: Field Goal Distance Model

Build a more realistic field goal simulation that: a) Models make probability as a function of distance using a logistic function: $P(\text{make}) = \frac{1}{1 + e^{\beta_0 + \beta_1 \times \text{distance}}}$ b) Calibrates parameters using historical NFL field goal data from nflfastR c) Accounts for weather conditions (wind speed, temperature) by adjusting the probability d) Simulates 1,000 kicks at each distance from 20-60 yards e) Plots make probability vs distance with 95% confidence bands from bootstrap resampling **Extension**: Compare your simulated make probabilities to actual NFL data by distance. How well does the logistic model fit? Where does it systematically over- or under-predict?

Exercise 2: Draft Pick Value Simulation

Create a simulation for draft pick value that accounts for outcome uncertainty: a) Define player outcome categories (bust, backup, starter, good, star) with different WAR values b) Assign probability distributions to each outcome by draft position (early picks have higher star probability) c) Use historical draft data to calibrate these probabilities (hint: use Pro Football Reference draft data) d) Simulate 10,000 draft scenarios for each pick in the first round e) Calculate expected value and variance for each pick f) Visualize the distribution of outcomes for pick #1 vs pick #32 g) Build a draft trade value chart based on your simulations **Extension**: Compare your simulation-based value chart to the traditional Jimmy Johnson chart. Where do they differ most? What explains these differences?

Exercise 3: Two-Point Conversion Decision

Build a comprehensive simulation to determine when teams should go for 2: a) Model current game state: score differential, time remaining, possessions remaining b) Simulate all future possessions under two strategies: "kick PAT" vs "go for 2" c) Account for uncertainty in conversion probability (typically 47-48% but varies by team/situation) d) Account for uncertainty in opponent scoring (use Poisson distribution for possessions × success rate) e) Calculate win probability for each strategy across all game states f) Create a lookup table showing optimal decision by score differential and time g) Test your model against the NFL Analytics two-point conversion chart **Extension**: Add complexity for specific game situations (overtime rules, weather, team strengths). How do these factors change the optimal strategy?

Exercise 4: Strength of Schedule Impact

Simulate how schedule difficulty affects playoff odds to quantify "schedule luck": a) Create 32 teams with ratings drawn from a realistic distribution (fit to actual NFL team performance) b) Generate a full 17-game schedule for each team (accounting for division structure: play division rivals twice, same-place finishers once) c) Simulate 10,000 seasons d) Calculate playoff probability for each team e) Compare teams with similar ratings but different schedule difficulties f) Quantify how much schedule luck matters relative to team quality g) Visualize the relationship with a scatter plot and confidence bands **Bonus**: Track how playoff probabilities change week-by-week as the season progresses and uncertainty resolves. Create an animated plot showing probability evolution.

Further Reading

Books

  • Robert, C.P. & Casella, G. (2004). Monte Carlo Statistical Methods. Springer. [Comprehensive technical treatment of Monte Carlo methods]
  • Kroese, D.P., Taimre, T., & Botev, Z.I. (2011). Handbook of Monte Carlo Methods. Wiley. [Practical guide with algorithms and applications]
  • Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall. [Foundational text on bootstrap resampling methods]

Papers

  • Lopez, M.J. (2019). "Bigger data, better questions, and a return to judgment: Comments on football analytics." Journal of Quantitative Analysis in Sports, 15(1), 1-5. [Discusses role of simulation in modern football analytics]
  • Burke, B. (2016). "Quantifying the Uncertainty in NFL Predictions." Sloan Sports Analytics Conference. [Practical application of Monte Carlo to NFL prediction]
  • Yam, D. & Lopez, M.J. (2019). "What's a game worth? Estimating the value of an NFL game in the standings." Journal of Sports Analytics, 5(4), 311-325. [Uses simulation to value games by playoff impact]
  • Romer, D. (2006). "Do Firms Maximize? Evidence from Professional Football." Journal of Political Economy, 114(2), 340-365. [Classic paper using simulation for fourth-down analysis]

Online Resources

References

:::