Chapter 32: Bayesian Methods in Football | Football Analytics Textbook

Learning ObjectivesBy the end of this chapter, you will be able to:

Understand Bayesian inference fundamentals and philosophical differences from frequentist approaches
Build hierarchical models for football data that account for nested structure
Update beliefs with new evidence using Bayes' theorem
Quantify uncertainty in predictions using posterior distributions
Compare Bayesian and frequentist approaches for football analytics problems

Introduction

In traditional (frequentist) statistics, we estimate parameters as fixed but unknown values and quantify uncertainty through confidence intervals and p-values. Bayesian statistics offers a fundamentally different paradigm: parameters are treated as random variables with probability distributions that reflect our uncertainty about their true values.

This philosophical difference has profound practical implications for football analytics. In the frequentist view, a player's true ability is a fixed number that we're trying to estimate with increasing precision as we collect more data. In the Bayesian view, our uncertainty about a player's ability is represented by a probability distribution that evolves as we observe more performance data. This distribution captures not just our best guess, but the entire range of plausible values weighted by their likelihood.

For football analytics, the Bayesian approach offers several compelling advantages:

Natural uncertainty quantification: Rather than point estimates, we get full probability distributions that show the range of plausible values and their relative likelihoods
Hierarchical modeling: Perfect for nested structures (players within teams, teams within divisions, games within seasons) that are endemic to football data
Information pooling: Weak data can borrow strength from strong data, allowing us to make reasonable inferences even for backup players with limited snaps
Sequential learning: Easy to update predictions as new games occur, reflecting how coaches and analysts actually think about player evaluation
Prior knowledge incorporation: Can formally use expert judgment, historical data, or domain knowledge to improve estimates

Consider a practical example: You want to evaluate a rookie quarterback's true ability after 3 games where he averaged 0.15 EPA per play. A frequentist approach gives you a point estimate (0.15 EPA/play) with a wide confidence interval—perhaps [-0.05, 0.35]. This tells you the data is consistent with a wide range of abilities, but doesn't help you decide if this QB is genuinely good or just got lucky.

A Bayesian approach gives you a probability distribution over possible abilities, incorporating both the limited data you've observed and prior knowledge about typical rookie performance. If historical data shows most rookies average around 0.0 EPA/play, your posterior distribution might center around 0.08 EPA/play—partway between the observed 0.15 and the historical 0.0. Moreover, it naturally shrinks extreme estimates toward more reasonable values, protecting you from overreacting to small samples.

Perhaps most importantly, Bayesian methods let you answer questions that matter for decision-making: "What's the probability this QB is above average?" or "How likely is he to be a top-10 QB?" These probabilistic statements are natural in the Bayesian framework but awkward or impossible in the frequentist framework.

Why Bayesian Methods for Football?

Football presents ideal conditions for Bayesian methods: 1. **Limited sample sizes**: Players and teams have finite games 2. **Hierarchical structure**: Players nest within teams, teams within leagues 3. **Expert knowledge**: Decades of football understanding can inform priors 4. **Sequential updates**: Games happen weekly, allowing belief updates 5. **Uncertainty matters**: Decision-making requires knowing confidence levels

Bayes' Theorem and the Bayesian Framework

The Foundation: Bayes' Theorem

At the heart of Bayesian inference lies Bayes' theorem, a mathematical formula that describes how to rationally update beliefs in light of new evidence. Despite its reputation as an advanced statistical technique, the core idea is intuitive: we start with some initial beliefs (the prior), observe some data, and update our beliefs to account for what we've learned (the posterior).

Bayes' theorem describes how to update probabilities based on new evidence:

$$ P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)} $$

Where:

$P(\theta | D)$ = Posterior: Our updated belief about parameter $\theta$ after seeing data $D$. This is what we ultimately care about—what should we believe about the parameter given everything we know?
$P(D | \theta)$ = Likelihood: Probability of observing data $D$ given parameter $\theta$. This captures how consistent the observed data is with different parameter values.
$P(\theta)$ = Prior: Our initial belief about $\theta$ before seeing data. This encodes our background knowledge, expert opinion, or simply our state of uncertainty before collecting data.
$P(D)$ = Marginal likelihood: Total probability of the data (normalizing constant). This ensures the posterior is a valid probability distribution (sums/integrates to 1).

The marginal likelihood $P(D)$ is often difficult to compute, but we rarely need it for inference. Instead, we work with the proportional form:

$$ P(\theta | D) \propto P(D | \theta) \cdot P(\theta) $$

This says: Posterior is proportional to likelihood times prior. The posterior combines what the data tells us (likelihood) with what we knew beforehand (prior). When we have lots of data, the likelihood dominates and the prior has little influence. When we have little data, the prior prevents us from drawing extreme conclusions from noise.

The Key Insight: Learning as Belief Updating

Bayes' theorem formalizes how we naturally think about learning. If you believe a kicker is "probably good" (prior), then observe him make 3 of 3 field goals (likelihood), you update to "definitely good, but maybe he got lucky" (posterior). Conversely, if you observe him miss 3 of 3 (different likelihood), you update to "probably not that good" (different posterior). Bayes' theorem tells us exactly how to balance prior beliefs and new evidence.

The Bayesian Workflow

Specify the model: Choose likelihood function for your data
Choose priors: Encode prior knowledge about parameters
Compute posterior: Combine prior and likelihood (usually via MCMC)
Check the model: Posterior predictive checks, convergence diagnostics
Make inference: Extract summaries, predictions, decisions from posterior

Prior, Likelihood, and Posterior

Let's illustrate these concepts with a concrete example: estimating a kicker's true field goal success probability. This is an ideal starting example because:

The parameter is intuitive: We're estimating a probability (between 0 and 1)
The data is simple: Success/failure outcomes on field goal attempts
We have prior knowledge: We know NFL kickers typically make 80-85% of field goals
The math is tractable: This is a "conjugate prior" situation where we can compute the posterior analytically

In this example, we'll observe a kicker make 18 of 20 field goal attempts and use Bayesian inference to estimate his true ability. The key question: Should we believe his true ability is 90% (the observed rate), or should we be more conservative given the small sample? Bayesian inference gives us a principled answer.

We'll use a Beta-Binomial model, which is the standard Bayesian approach for estimating proportions or probabilities. The Beta distribution is a flexible probability distribution on [0, 1], making it perfect for representing beliefs about success probabilities. The Binomial distribution models the number of successes in a fixed number of independent trials—exactly what we have with field goal attempts.

#| label: beta-binomial-setup-r
#| message: false
#| warning: false

library(tidyverse)
library(patchwork)

# Set up grid for plotting
# We'll evaluate the distributions at 1000 points from 0 to 1
# This gives smooth curves for visualization
theta <- seq(0, 1, length.out = 1000)

# Observed data: kicker made 18 of 20 field goals
successes <- 18
attempts <- 20

# Prior: Beta(10, 3) - encoding belief that kickers are generally good
# Beta(α, β) has mean = α/(α+β), so Beta(10, 3) has mean ≈ 0.77
# This captures our prior belief that NFL kickers typically make ~77% of FGs
# The parameters α=10 and β=3 represent equivalent to having seen
# 10 makes and 3 misses before observing this kicker
prior <- dbeta(theta, 10, 3)

# Likelihood: Binomial(20, theta)
# For each possible value of theta (success probability),
# what's the probability of observing exactly 18 successes in 20 attempts?
likelihood <- dbinom(successes, attempts, theta)
# Scale likelihood to match prior height for visualization purposes only
likelihood_scaled <- likelihood / max(likelihood) * max(prior)

# Posterior: Beta(10+18, 3+2) = Beta(28, 5)
# With conjugate priors, the posterior is also Beta with updated parameters:
# α_post = α_prior + successes
# β_post = β_prior + failures
# This is the magic of conjugacy—we can compute the posterior analytically!
posterior <- dbeta(theta, 10 + successes, 3 + (attempts - successes))

# Combine into dataframe for plotting
bayes_df <- tibble(
  theta = rep(theta, 3),
  density = c(prior, likelihood_scaled, posterior),
  distribution = rep(c("Prior", "Likelihood", "Posterior"), each = length(theta))
) %>%
  mutate(distribution = factor(distribution,
                               levels = c("Prior", "Likelihood", "Posterior")))

# Create visualization showing all three components
ggplot(bayes_df, aes(x = theta, y = density, color = distribution)) +
  geom_line(linewidth = 1.2) +
  scale_color_manual(values = c("Prior" = "#619CFF",
                                 "Likelihood" = "#F8766D",
                                 "Posterior" = "#00BA38")) +
  labs(
    title = "Bayesian Inference for Kicker Success Rate",
    subtitle = "Observed: 18/20 field goals made",
    x = "Success Probability (θ)",
    y = "Density",
    color = "Distribution"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )

#| label: beta-binomial-setup-py
#| message: false
#| warning: false

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Set up grid for plotting
# Evaluate distributions at 1000 points between 0 and 1
theta = np.linspace(0, 1, 1000)

# Observed data: kicker made 18 of 20 field goals
successes = 18
attempts = 20

# Prior: Beta(10, 3) - encoding belief that kickers are generally good
# Beta(α, β) distribution with α=10, β=3 gives mean = 10/13 ≈ 0.77
# This represents our prior belief before seeing this kicker's data
prior = stats.beta(10, 3).pdf(theta)

# Likelihood: Binomial(20, theta)
# For each possible success probability theta, calculate the probability
# of observing 18 successes in 20 attempts
likelihood = stats.binom(attempts, theta).pmf(successes)
# Scale for visualization (doesn't affect the actual inference)
likelihood_scaled = likelihood / likelihood.max() * prior.max()

# Posterior: Beta(10+18, 3+2) = Beta(28, 5)
# The posterior combines prior and likelihood with simple addition:
# New α = old α + observed successes
# New β = old β + observed failures
posterior = stats.beta(10 + successes, 3 + (attempts - successes)).pdf(theta)

# Create visualization showing the Bayesian updating process
plt.figure(figsize=(10, 6))
plt.plot(theta, prior, label='Prior', linewidth=2, color='#619CFF')
plt.plot(theta, likelihood_scaled, label='Likelihood (scaled)',
         linewidth=2, color='#F8766D')
plt.plot(theta, posterior, label='Posterior', linewidth=2, color='#00BA38')
plt.xlabel('Success Probability (θ)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Inference for Kicker Success Rate\nObserved: 18/20 field goals made',
          fontsize=14, fontweight='bold')
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate and display posterior summary statistics
# These tell us what we should believe about the kicker's true ability
post_mean = (10 + successes) / (13 + attempts)
post_median = stats.beta(10 + successes, 3 + (attempts - successes)).median()
post_95 = stats.beta(10 + successes, 3 + (attempts - successes)).interval(0.95)

print(f"\nPosterior Summary:")
print(f"  Mean: {post_mean:.3f}")
print(f"  Median: {post_median:.3f}")
print(f"  95% Credible Interval: ({post_95[0]:.3f}, {post_95[1]:.3f})")

This code demonstrates the complete Bayesian inference workflow for a simple problem: **Prior Selection**: We use Beta(10, 3), which has mean 0.77 and represents roughly the league-average kicker. Think of this as saying "before seeing this kicker, I believe he's probably around league average, equivalent to having seen about 13 field goal attempts." **Likelihood**: The binomial likelihood captures how probable our observed data (18/20) is under different values of the true success probability. The likelihood peaks around 0.90 (the observed rate) because that value makes the observed data most probable. **Posterior Computation**: Because the Beta distribution is "conjugate" to the Binomial likelihood, the posterior is also Beta with updated parameters. We simply add the observed data to the prior parameters: Beta(10+18, 3+2) = Beta(28, 5). **Interpretation**: The posterior mean of 0.848 is between the prior mean (0.77) and observed rate (0.90). With only 20 attempts, we don't completely trust the data, so we shrink toward our prior. With more attempts, the posterior would move closer to the observed rate.

The posterior represents our updated belief: it's a compromise between the prior (centered around 0.77) and the data (18/20 = 0.90). Notice how the posterior (green curve) lies between the prior (blue) and likelihood (red). This is Bayesian learning in action—we've moved from our initial belief toward what the data suggests, but not all the way there given the limited sample.

With more data, the posterior would move closer to the observed rate. If this kicker maintains a 90% success rate over 100 attempts, the posterior would center much closer to 0.90, because the large sample provides strong evidence that overwhelms the prior. Conversely, if he made 18 of 20 but then regressed to 75% over the next 100 attempts, our posterior would reflect that he's probably around league average, and his hot start was likely luck.

Understanding Credible Intervals

The 95% credible interval [0.78, 0.91] means: "Given our model and data, there's a 95% probability the kicker's true success rate is between 78% and 91%." This is much more intuitive than a frequentist confidence interval, which has a more convoluted interpretation about long-run frequencies of interval-construction procedures.

Bayesian Inference for Simple Problems

Example 1: Beta-Binomial Model for Conversion Rates

Now that we understand the basic mechanics of Bayesian inference, let's apply it to a real football analytics problem: estimating team-specific third down conversion rates. This is a more complex example than the single kicker because we'll be estimating rates for all 32 NFL teams simultaneously.

The Beta-binomial model is a conjugate prior setup perfect for any success/failure data in football:

Third down conversion rates: Did the offense gain a first down? (Yes/No)
Red zone touchdown rates: Did a red zone drive end in a TD? (Yes/No)
Two-point conversion success: Did the attempt succeed? (Yes/No)
Fourth down conversion rates: Did the offense convert? (Yes/No)
Completion percentage: Was the pass completed? (Yes/No)

The key advantage of this model is its simplicity and interpretability. We're estimating a probability for each team, and the Beta-binomial conjugacy means we can compute posteriors analytically without MCMC sampling.

Model Setup:

$$ \begin{align} \text{Successes} | \theta &\sim \text{Binomial}(n, \theta) \\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} $$

Where:
- $\theta$ is the true (unknown) conversion probability for a team
- $n$ is the number of attempts
- The Beta prior has parameters $\alpha$ and $\beta$ that encode our prior belief

Conjugacy: If the prior is Beta($\alpha$, $\beta$) and we observe $k$ successes in $n$ trials, the posterior is Beta($\alpha + k$, $\beta + n - k$). This means:

The posterior has the same distributional form as the prior (both Beta)
We update by simply adding observed data to prior parameters
No complex computation or sampling required
We get exact results instantly

Why Conjugate Priors Are Valuable

Conjugate priors like Beta-Binomial are special because they allow analytical posterior computation. While modern MCMC methods can handle any prior-likelihood combination, conjugate priors provide: 1. **Instant results**: No sampling, no convergence diagnostics 2. **Intuitive interpretation**: Prior parameters can be thought of as "pseudo-observations" 3. **Computational efficiency**: Useful for real-time applications 4. **Mathematical elegance**: Clear understanding of how prior and data combine However, they're limited to specific model families. For more complex models, we'll need MCMC.

#| label: third-down-analysis-r
#| message: false
#| warning: false
#| cache: true

library(nflfastR)

# Load 2023 data
pbp_2023 <- load_pbp(2023)

# Calculate third down conversion rates by team
third_down_data <- pbp_2023 %>%
  filter(down == 3, !is.na(posteam), play_type %in% c("pass", "run")) %>%
  mutate(conversion = if_else(first_down_rush == 1 | first_down_pass == 1, 1, 0)) %>%
  group_by(posteam) %>%
  summarise(
    attempts = n(),
    conversions = sum(conversion, na.rm = TRUE),
    raw_rate = conversions / attempts,
    .groups = "drop"
  )

# Bayesian estimation with weakly informative prior
# Prior: Beta(4, 6) centered at 0.4 (league average roughly)
alpha_prior <- 4
beta_prior <- 6

third_down_bayes <- third_down_data %>%
  mutate(
    # Posterior parameters
    alpha_post = alpha_prior + conversions,
    beta_post = beta_prior + (attempts - conversions),
    # Posterior mean (point estimate)
    bayes_rate = alpha_post / (alpha_post + beta_post),
    # Posterior standard deviation
    bayes_sd = sqrt(alpha_post * beta_post /
                    ((alpha_post + beta_post)^2 * (alpha_post + beta_post + 1))),
    # 95% credible interval
    lower_95 = qbeta(0.025, alpha_post, beta_post),
    upper_95 = qbeta(0.975, alpha_post, beta_post)
  ) %>%
  arrange(desc(bayes_rate))

# Show top 10 teams
third_down_bayes %>%
  select(posteam, attempts, conversions, raw_rate, bayes_rate, lower_95, upper_95) %>%
  head(10) %>%
  gt::gt() %>%
  gt::cols_label(
    posteam = "Team",
    attempts = "Attempts",
    conversions = "Conv.",
    raw_rate = "Raw Rate",
    bayes_rate = "Bayes Rate",
    lower_95 = "Lower 95%",
    upper_95 = "Upper 95%"
  ) %>%
  gt::fmt_number(
    columns = c(raw_rate, bayes_rate, lower_95, upper_95),
    decimals = 3
  ) %>%
  gt::tab_header(
    title = "Third Down Conversion Rates (2023)",
    subtitle = "Bayesian Estimates with 95% Credible Intervals"
  )

#| label: third-down-analysis-py
#| message: false
#| warning: false
#| cache: true

import nfl_data_py as nfl

# Load 2023 data
pbp_2023 = nfl.import_pbp_data([2023])

# Calculate third down conversion rates by team
third_down_data = (pbp_2023
    .query("down == 3 & posteam.notna() & play_type.isin(['pass', 'run'])")
    .assign(conversion=lambda x: ((x['first_down_rush'] == 1) |
                                   (x['first_down_pass'] == 1)).astype(int))
    .groupby('posteam')
    .agg(
        attempts=('conversion', 'count'),
        conversions=('conversion', 'sum')
    )
    .reset_index()
    .assign(raw_rate=lambda x: x['conversions'] / x['attempts'])
)

# Bayesian estimation with weakly informative prior
# Prior: Beta(4, 6) centered at 0.4 (league average roughly)
alpha_prior = 4
beta_prior = 6

third_down_bayes = (third_down_data
    .assign(
        alpha_post=lambda x: alpha_prior + x['conversions'],
        beta_post=lambda x: beta_prior + (x['attempts'] - x['conversions']),
        bayes_rate=lambda x: x['alpha_post'] / (x['alpha_post'] + x['beta_post']),
        bayes_sd=lambda x: np.sqrt(
            x['alpha_post'] * x['beta_post'] /
            ((x['alpha_post'] + x['beta_post'])**2 *
             (x['alpha_post'] + x['beta_post'] + 1))
        ),
        lower_95=lambda x: stats.beta(x['alpha_post'], x['beta_post']).ppf(0.025),
        upper_95=lambda x: stats.beta(x['alpha_post'], x['beta_post']).ppf(0.975)
    )
    .sort_values('bayes_rate', ascending=False)
)

# Show top 10 teams
print("\nThird Down Conversion Rates (2023)")
print("Bayesian Estimates with 95% Credible Intervals\n")
print(third_down_bayes[['posteam', 'attempts', 'conversions',
                        'raw_rate', 'bayes_rate', 'lower_95', 'upper_95']]
      .head(10)
      .to_string(index=False, float_format=lambda x: f'{x:.3f}'))

This code demonstrates Bayesian estimation for binomial data (success/failure outcomes) applied to all 32 NFL teams: **Data Preparation**: We filter to third down plays (down == 3) that are passes or runs, then calculate conversion rates (whether the play resulted in a first down) for each team. This gives us the raw observed conversion rates. **Prior Selection**: We use Beta(4, 6), which has mean 4/(4+6) = 0.4, roughly the league-average third down conversion rate. The parameters α=4 and β=6 can be interpreted as "pseudo-observations"—equivalent to having seen 4 conversions and 6 failures before observing any team-specific data. This prior is weakly informative: it provides some regularization but allows data to dominate with even modest sample sizes. **Posterior Computation**: With Beta-Binomial conjugacy, updating is trivial: - α_posterior = α_prior + observed conversions - β_posterior = β_prior + observed failures The posterior mean is α_post / (α_post + β_post), which automatically balances prior and data. **Credible Intervals**: Using `qbeta()` in R or `ppf()` in Python, we compute 95% credible intervals from the posterior Beta distribution. These represent our uncertainty about each team's true conversion rate after seeing the data. **Key Insight**: Teams with fewer attempts have wider credible intervals (more uncertainty) and estimates shrunk more toward 0.4 (the prior mean). Teams with many attempts have narrow intervals and estimates close to their raw rates. This is optimal regularization—trust data when you have lots of it, rely on prior when you don't.

Notice how Bayesian estimates "shrink" toward the prior mean (0.4), especially for teams with fewer attempts. This is called regularization and prevents overfitting to small samples. A team that converts 8 of 10 third downs (80%) probably isn't truly an 80% team—they likely got lucky in a small sample. The Bayesian estimate of perhaps 55-60% better reflects their true ability, accounting for both the impressive performance and the limited evidence.

This shrinkage is adaptive: it depends on sample size. Teams with 200+ attempts are barely shrunk, while teams with 50 attempts are shrunk substantially. This is exactly what we want—a data-driven regularization that protects us from small-sample noise without ignoring large-sample signal.

Example 2: Comparing Two Teams

One advantage of Bayesian inference is that comparing parameters is straightforward—we can directly compute probabilities like "What's the probability Team A has a higher conversion rate than Team B?"

R
Python

#| label: team-comparison-r
#| message: false
#| warning: false

# Compare two teams: let's pick two interesting teams
team_a <- "SF"
team_b <- "DAL"

# Get posterior parameters
team_a_params <- third_down_bayes %>%
  filter(posteam == team_a) %>%
  select(alpha_post, beta_post)

team_b_params <- third_down_bayes %>%
  filter(posteam == team_b) %>%
  select(alpha_post, beta_post)

# Monte Carlo simulation from posteriors
set.seed(2024)
n_sims <- 10000

theta_a <- rbeta(n_sims, team_a_params$alpha_post, team_a_params$beta_post)
theta_b <- rbeta(n_sims, team_b_params$alpha_post, team_b_params$beta_post)

# Calculate probability that Team A > Team B
prob_a_better <- mean(theta_a > theta_b)

# Expected difference
diff_samples <- theta_a - theta_b

# Visualize
tibble(
  Team_A = theta_a,
  Team_B = theta_b,
  Difference = diff_samples
) %>%
  pivot_longer(everything(), names_to = "Parameter", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = Parameter)) +
  geom_density(alpha = 0.6) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  scale_fill_manual(
    values = c("Team_A" = "#AA0000", "Team_B" = "#003594",
               "Difference" = "#999999"),
    labels = c(paste(team_a, "3rd Down %"),
               paste(team_b, "3rd Down %"),
               "Difference (SF - DAL)")
  ) +
  labs(
    title = "Posterior Distributions for Third Down Conversion Rate",
    subtitle = sprintf("P(%s > %s) = %.3f", team_a, team_b, prob_a_better),
    x = "Conversion Rate",
    y = "Density",
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )

# Print summary
cat(sprintf("\nComparison Summary:\n"))
cat(sprintf("P(%s > %s) = %.3f\n", team_a, team_b, prob_a_better))
cat(sprintf("Expected difference: %.3f\n", mean(diff_samples)))
cat(sprintf("95%% CI for difference: [%.3f, %.3f]\n",
            quantile(diff_samples, 0.025),
            quantile(diff_samples, 0.975)))

#| label: team-comparison-py
#| message: false
#| warning: false

# Compare two teams
team_a = "SF"
team_b = "DAL"

# Get posterior parameters
team_a_params = third_down_bayes[third_down_bayes['posteam'] == team_a][['alpha_post', 'beta_post']].iloc[0]
team_b_params = third_down_bayes[third_down_bayes['posteam'] == team_b][['alpha_post', 'beta_post']].iloc[0]

# Monte Carlo simulation from posteriors
np.random.seed(2024)
n_sims = 10000

theta_a = stats.beta(team_a_params['alpha_post'], team_a_params['beta_post']).rvs(n_sims)
theta_b = stats.beta(team_b_params['alpha_post'], team_b_params['beta_post']).rvs(n_sims)

# Calculate probability that Team A > Team B
prob_a_better = np.mean(theta_a > theta_b)

# Expected difference
diff_samples = theta_a - theta_b

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(theta_a, bins=50, alpha=0.6, density=True,
        label=f'{team_a} 3rd Down %', color='#AA0000')
ax.hist(theta_b, bins=50, alpha=0.6, density=True,
        label=f'{team_b} 3rd Down %', color='#003594')
ax.axvline(x=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('Conversion Rate', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Posterior Distributions for Third Down Conversion Rate\n' +
             f'P({team_a} > {team_b}) = {prob_a_better:.3f}',
             fontsize=14, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()

# Print summary
print(f"\nComparison Summary:")
print(f"P({team_a} > {team_b}) = {prob_a_better:.3f}")
print(f"Expected difference: {diff_samples.mean():.3f}")
print(f"95% CI for difference: [{np.percentile(diff_samples, 2.5):.3f}, " +
      f"{np.percentile(diff_samples, 97.5):.3f}]")

This probabilistic comparison is more nuanced than a simple hypothesis test. We get a direct answer to "How much better is Team A?" with full uncertainty quantification.

Hierarchical Models for Football Data

Hierarchical (multilevel) models are ideal for football's nested structure: players within teams, teams within divisions, games within seasons. They provide a principled way to share information across groups while respecting individual differences. This is perhaps the most powerful application of Bayesian methods to football analytics.

The fundamental insight is that football data exhibits natural hierarchies. Players are nested within teams. Teams are nested within divisions. Games are nested within seasons. When we estimate player performance, we should use not just that player's data, but also information from similar players, players on the same team, and the league as a whole. Hierarchical models formalize this intuition.

Why Hierarchical Models?

Consider the problem of rating NFL quarterbacks. Every season, we have about 40-50 QBs who throw meaningful numbers of passes. Some are starters with 500+ attempts, others are backups with 50 attempts. How should we estimate their abilities?

We could:

Complete pooling: Assume all QBs have the same ability (clearly wrong—Patrick Mahomes is not the same as a third-string backup)
No pooling: Estimate each QB independently using only his data (inefficient and noisy, especially for backups with limited attempts)
Partial pooling (hierarchical): Share information across QBs while allowing individual differences (optimal balance)

Hierarchical models implement partial pooling through a two-level structure:

Level 1 (Individual): Each QB has his own true ability $\mu_i$
Level 2 (Group): All QB abilities come from a common distribution with mean $\mu_{\text{league}}$ and spread $\sigma_{\text{QB}}$

This creates automatic regularization that:

Shrinks estimates with less data toward the group mean (protecting against noise)
Respects estimates with more data (trusting solid evidence)
Quantifies both within-group variation (play-to-play noise) and between-group variation (true differences in QB ability)
Handles imbalanced data naturally (starters and backups in the same model)
Pools information across players (backup QBs borrow strength from similar players)

The Shrinkage Principle

Shrinkage is not a flaw—it's a feature! When we have limited data, extreme estimates are usually wrong. A backup QB who averages 0.30 EPA over 50 attempts probably isn't an elite QB who just never got a chance. He's more likely an average-ish backup who had a lucky stretch. Hierarchical models automatically and optimally shrink such estimates toward the mean, with the amount of shrinkage determined by: 1. **Sample size**: More data = less shrinkage 2. **Group variance**: More spread in true abilities = less shrinkage 3. **Within-group noise**: Higher play-to-play variation = more shrinkage This is exactly what domain experts do intuitively—hierarchical models formalize it mathematically.

Example: Hierarchical QB Rating Model

We'll build a hierarchical model for QB performance measured by EPA per play.

Model Structure:

$$ \begin{align} \text{EPA}_{ij} &\sim \text{Normal}(\mu_i, \sigma_y) && \text{[Likelihood]} \\ \mu_i &\sim \text{Normal}(\mu_{\text{league}}, \sigma_{\text{QB}}) && \text{[QB effects]} \\ \mu_{\text{league}} &\sim \text{Normal}(0.05, 0.5) && \text{[League mean prior]} \\ \sigma_y &\sim \text{Exponential}(1/2) && \text{[Within-QB variation]} \\ \sigma_{\text{QB}} &\sim \text{Exponential}(1/0.5) && \text{[Between-QB variation]} \end{align} $$

Where:
- $\text{EPA}_{ij}$ is EPA for QB $i$ on play $j$
- $\mu_i$ is the true ability of QB $i$
- $\mu_{\text{league}}$ is the league average
- $\sigma_{\text{QB}}$ controls how much QBs differ from league average
- $\sigma_y$ is play-to-play variation

#| label: hierarchical-qb-r
#| message: false
#| warning: false
#| cache: true
#| results: hide

library(brms)
library(tidybayes)

# Prepare QB data (regular passers with at least 100 attempts)
qb_data <- pbp_2023 %>%
  filter(
    !is.na(epa),
    play_type == "pass",
    !is.na(passer_player_name),
    sack == 0
  ) %>%
  group_by(passer_player_name, passer_player_id) %>%
  filter(n() >= 100) %>%
  ungroup() %>%
  select(passer_player_name, passer_player_id, epa)

# Fit hierarchical model
qb_model <- brm(
  epa ~ (1 | passer_player_name),
  data = qb_data,
  family = gaussian(),
  prior = c(
    prior(normal(0.05, 0.5), class = Intercept),
    prior(exponential(2), class = sd),
    prior(exponential(0.5), class = sigma)
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  cores = 4,
  seed = 2024,
  backend = "cmdstanr",
  silent = 2,
  refresh = 0
)

#| label: hierarchical-qb-r-results
#| message: false
#| warning: false

# Extract QB-specific intercepts (random effects)
qb_effects <- qb_model %>%
  spread_draws(r_passer_player_name[passer_player_name, ]) %>%
  mean_qi(.width = 0.95) %>%
  rename(qb_effect = r_passer_player_name)

# Get league-level intercept
league_mean <- fixef(qb_model)[1, "Estimate"]

# Add raw means for comparison
qb_summary <- qb_data %>%
  group_by(passer_player_name) %>%
  summarise(
    n_plays = n(),
    raw_epa = mean(epa),
    .groups = "drop"
  ) %>%
  left_join(qb_effects, by = "passer_player_name") %>%
  mutate(
    hierarchical_epa = league_mean + qb_effect,
    shrinkage = raw_epa - hierarchical_epa
  ) %>%
  arrange(desc(hierarchical_epa))

# Plot comparison: raw vs hierarchical estimates
qb_summary %>%
  filter(n_plays >= 200) %>%
  head(15) %>%
  ggplot(aes(y = reorder(passer_player_name, hierarchical_epa))) +
  geom_point(aes(x = raw_epa, color = "Raw"), size = 3) +
  geom_point(aes(x = hierarchical_epa, color = "Hierarchical"), size = 3) +
  geom_segment(aes(x = raw_epa, xend = hierarchical_epa,
                   yend = reorder(passer_player_name, hierarchical_epa)),
               arrow = arrow(length = unit(0.2, "cm"))) +
  scale_color_manual(values = c("Raw" = "#F8766D", "Hierarchical" = "#00BA38")) +
  labs(
    title = "QB EPA: Raw vs Hierarchical Estimates",
    subtitle = "Arrows show shrinkage toward league mean",
    x = "EPA per Play",
    y = NULL,
    color = "Estimate Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )

# Show top QBs
qb_summary %>%
  head(10) %>%
  select(passer_player_name, n_plays, raw_epa, hierarchical_epa, .lower, .upper) %>%
  gt::gt() %>%
  gt::cols_label(
    passer_player_name = "QB",
    n_plays = "Plays",
    raw_epa = "Raw EPA",
    hierarchical_epa = "Bayes EPA",
    .lower = "Lower 95%",
    .upper = "Upper 95%"
  ) %>%
  gt::fmt_number(
    columns = c(raw_epa, hierarchical_epa, .lower, .upper),
    decimals = 3
  ) %>%
  gt::tab_header(
    title = "Top 10 QBs by Hierarchical EPA (2023)",
    subtitle = "With 95% Credible Intervals"
  )

#| label: hierarchical-qb-py
#| message: false
#| warning: false
#| cache: true

import pymc as pm
import arviz as az

# Prepare QB data (regular passers with at least 100 attempts)
qb_data_py = (pbp_2023
    .query("epa.notna() & play_type == 'pass' & passer_player_name.notna() & sack == 0")
    .groupby('passer_player_name')
    .filter(lambda x: len(x) >= 100)
    .reset_index(drop=True)
)

# Create numeric indices for QBs
qb_data_py['qb_idx'] = pd.Categorical(qb_data_py['passer_player_name']).codes
n_qbs = qb_data_py['qb_idx'].nunique()

# Build hierarchical model
with pm.Model() as qb_model_py:
    # Hyperpriors
    mu_league = pm.Normal('mu_league', mu=0.05, sigma=0.5)
    sigma_qb = pm.Exponential('sigma_qb', lam=2.0)
    sigma_y = pm.Exponential('sigma_y', lam=0.5)

    # QB-specific effects
    qb_effect = pm.Normal('qb_effect', mu=0, sigma=sigma_qb, shape=n_qbs)

    # Expected value for each play
    mu = mu_league + qb_effect[qb_data_py['qb_idx'].values]

    # Likelihood
    epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma_y,
                        observed=qb_data_py['epa'].values)

    # Sample from posterior
    trace = pm.sample(2000, tune=1000, chains=4, cores=4,
                     random_seed=2024, return_inferencedata=True,
                     progressbar=False)

# Extract QB effects
qb_names = pd.Categorical(qb_data_py['passer_player_name']).categories
qb_effects_post = trace.posterior['qb_effect'].values.reshape(-1, n_qbs)
league_mean_post = trace.posterior['mu_league'].values.flatten()

# Calculate hierarchical EPA for each QB
qb_results = pd.DataFrame({
    'passer_player_name': qb_names,
    'qb_effect_mean': qb_effects_post.mean(axis=0),
    'qb_effect_lower': np.percentile(qb_effects_post, 2.5, axis=0),
    'qb_effect_upper': np.percentile(qb_effects_post, 97.5, axis=0)
}).assign(
    hierarchical_epa=lambda x: league_mean_post.mean() + x['qb_effect_mean']
)

# Add raw EPA
raw_epa = (qb_data_py
    .groupby('passer_player_name')
    .agg(n_plays=('epa', 'count'), raw_epa=('epa', 'mean'))
    .reset_index()
)

qb_summary_py = qb_results.merge(raw_epa, on='passer_player_name').sort_values(
    'hierarchical_epa', ascending=False
)

# Show top 10 QBs
print("\nTop 10 QBs by Hierarchical EPA (2023)")
print("With 95% Credible Intervals\n")
print(qb_summary_py[['passer_player_name', 'n_plays', 'raw_epa',
                     'hierarchical_epa', 'qb_effect_lower', 'qb_effect_upper']]
      .head(10)
      .to_string(index=False, float_format=lambda x: f'{x:.3f}'))

This hierarchical model demonstrates the power of partial pooling for player evaluation: **Model Structure**: We have a two-level hierarchy: - **Level 1 (individual plays)**: Each play's EPA comes from a Normal distribution centered at that QB's ability - **Level 2 (QBs)**: Each QB's ability comes from a Normal distribution centered at the league mean **Hyperparameters**: `mu_league` is the league-average EPA, while `sigma_qb` controls how much QBs differ from each other. If `sigma_qb` is large, there's substantial variation in QB ability. If it's small, most QBs are similar. **Random Effects**: The `qb_effect` for each QB represents their deviation from league average. A positive effect means above-average, negative means below-average. Crucially, these effects are **not estimated independently**—they all come from the same distribution, so information pools across QBs. **Sampling**: We use `pm.sample()` with NUTS (No-U-Turn Sampler), the state-of-the-art MCMC algorithm. With 4 chains of 2000 iterations (1000 warmup), we get 4000 posterior samples for each parameter. **Shrinkage in Action**: The hierarchical structure automatically shrinks QB estimates toward `mu_league`. The amount of shrinkage for each QB depends on: 1. Sample size (more plays = less shrinkage) 2. Between-QB variance `sigma_qb` (more variance = less shrinkage) 3. Within-QB noise `sigma_y` (more noise = more shrinkage) **Why This Matters**: A backup QB with 50 attempts and 0.25 raw EPA might have hierarchical EPA of 0.15—reflecting that with limited data, we should be skeptical of extreme performance. Meanwhile, a starter with 500 attempts and 0.25 raw EPA gets minimal shrinkage because we have strong evidence of his ability.

The hierarchical model shrinks extreme estimates (especially for QBs with fewer plays) toward the league mean. This is automatic regularization: QBs with strong evidence stay close to their raw EPA, while those with weak evidence are pulled toward average. This protects us from ranking backup QBs with lucky small samples ahead of established starters with larger, more representative samples.

#| label: fig-shrinkage-r
#| fig-cap: "Hierarchical shrinkage effect on QB EPA estimates"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

qb_summary %>%
  ggplot(aes(x = n_plays, y = shrinkage)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_point(aes(size = abs(shrinkage), color = shrinkage > 0), alpha = 0.7) +
  geom_smooth(method = "loess", se = TRUE, color = "black") +
  scale_color_manual(
    values = c("TRUE" = "#F8766D", "FALSE" = "#00BA38"),
    labels = c("Shrunk Down", "Shrunk Up")
  ) +
  scale_size_continuous(range = c(1, 5), guide = "none") +
  labs(
    title = "Hierarchical Shrinkage Effect by Sample Size",
    subtitle = "QBs with fewer plays are shrunk more toward league mean",
    x = "Number of Pass Plays",
    y = "Shrinkage (Raw EPA - Hierarchical EPA)",
    color = "Direction"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )

#| label: fig-shrinkage-py
#| fig-cap: "Hierarchical shrinkage effect on QB EPA estimates (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

# Calculate shrinkage
qb_summary_py['shrinkage'] = qb_summary_py['raw_epa'] - qb_summary_py['hierarchical_epa']

# Create plot
fig, ax = plt.subplots(figsize=(10, 6))

# Separate shrinkage up and down
shrink_up = qb_summary_py[qb_summary_py['shrinkage'] > 0]
shrink_down = qb_summary_py[qb_summary_py['shrinkage'] <= 0]

ax.scatter(shrink_up['n_plays'], shrink_up['shrinkage'],
          s=np.abs(shrink_up['shrinkage'])*500, alpha=0.7,
          color='#F8766D', label='Shrunk Down')
ax.scatter(shrink_down['n_plays'], shrink_down['shrinkage'],
          s=np.abs(shrink_down['shrinkage'])*500, alpha=0.7,
          color='#00BA38', label='Shrunk Up')

ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
ax.set_xlabel('Number of Pass Plays', fontsize=12)
ax.set_ylabel('Shrinkage (Raw EPA - Hierarchical EPA)', fontsize=12)
ax.set_title('Hierarchical Shrinkage Effect by Sample Size\n' +
             'QBs with fewer plays are shrunk more toward league mean',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Markov Chain Monte Carlo (MCMC)

For most real-world Bayesian models, we cannot compute the posterior analytically. The Beta-Binomial conjugacy we used earlier is a special case—most models don't have this convenient property. Instead, we use Markov Chain Monte Carlo (MCMC) algorithms to generate samples from the posterior distribution.

MCMC is both conceptually elegant and practically powerful. Rather than computing the posterior distribution directly (which requires solving difficult integrals), MCMC generates samples from the posterior. These samples can then be used to approximate any posterior quantity we care about—means, medians, quantiles, probabilities, predictions, etc.

Why MCMC?

The core challenge in Bayesian inference is computing the posterior:

$$ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{\int P(D | \theta') P(\theta') d\theta'} $$

The numerator $P(D | \theta) P(\theta)$ is usually easy to compute—we specify the likelihood and prior. The denominator $\int P(D | \theta') P(\theta') d\theta'$ is the problem. This integral:

Marginalizes over all possible parameter values
Sums/integrates in potentially high-dimensional spaces
Rarely has a closed form except for conjugate prior families
Cannot be computed numerically when models have many parameters

For example, our hierarchical QB model has 40+ parameters (one per QB plus hyperparameters). The denominator requires integrating over a 40+ dimensional space—computationally infeasible with standard numerical integration.

MCMC cleverly avoids computing this integral by generating samples $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}$ that follow the posterior distribution without ever computing the normalizing constant. The key insight: we only need to evaluate the numerator (likelihood × prior) at different parameter values. MCMC algorithms construct a Markov chain whose stationary distribution is the posterior, so after a burn-in period, samples from the chain are samples from the posterior.

Monte Carlo: Named After a Casino

The term "Monte Carlo" comes from the Monte Carlo Casino in Monaco and refers to using randomness to solve deterministic problems. The "Markov Chain" part refers to the sequential nature of the sampling—each new sample depends only on the previous sample, not the entire history (the Markov property). Together, MCMC uses random walks (Markov chains) to generate samples (Monte Carlo) from complex distributions.

Common MCMC Algorithms

Metropolis-Hastings: Classic algorithm, proposes moves and accepts/rejects
Gibbs Sampling: Updates one parameter at a time conditionally
Hamiltonian Monte Carlo (HMC): Uses gradient information (Stan, PyMC)
NUTS (No-U-Turn Sampler): Adaptive version of HMC (default in Stan/PyMC)

Modern tools like Stan and PyMC use NUTS, which is highly efficient for complex models.

MCMC Diagnostics

After running MCMC, we must check:

Convergence: Did chains reach the stationary distribution?
Mixing: Do chains explore the posterior efficiently?
Autocorrelation: How independent are sequential samples?

R (brms diagnostics)
Python (PyMC diagnostics)

#| label: mcmc-diagnostics-r
#| fig-cap: "MCMC diagnostics for QB hierarchical model"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false

# Trace plots for key parameters
plot(qb_model, variable = c("b_Intercept", "sd_passer_player_name__Intercept", "sigma"),
     ask = FALSE)

# Summary with convergence diagnostics
summary(qb_model)

# More detailed diagnostics
# R-hat: should be < 1.01 (measures convergence across chains)
# Effective sample size: should be > 400 per chain ideally
cat("\nConvergence Diagnostics:\n")
cat(sprintf("All R-hat values < 1.01: %s\n",
            all(rhat(qb_model) < 1.01, na.rm = TRUE)))
cat(sprintf("Min effective sample size: %.0f\n",
            min(neff_ratio(qb_model) * 4000, na.rm = TRUE)))

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: mcmc-diagnostics-py
#| fig-cap: "MCMC diagnostics for QB hierarchical model (Python)"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false

# Trace plots
az.plot_trace(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y'])
plt.tight_layout()
plt.show()

# Summary statistics
print("\nPosterior Summary:")
print(az.summary(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y']))

# Convergence diagnostics
print("\nConvergence Diagnostics:")
summary_df = az.summary(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y'])
print(f"All R-hat values < 1.01: {all(summary_df['r_hat'] < 1.01)}")
print(f"Min effective sample size: {summary_df['ess_bulk'].min():.0f}")

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

Key Diagnostics:

R-hat: Measures convergence across chains. Should be < 1.01.
Effective Sample Size (ESS): Accounts for autocorrelation. Want > 400 per chain.
Trace plots: Should look like "hairy caterpillars" with no trends or stuck sections
Divergences: Warnings about numerical issues. Should be 0 or very few.

Bayesian Regression for Football Metrics

Linear regression is a workhorse of football analytics. Bayesian regression provides:

Full posterior distributions for coefficients
Natural regularization through priors
Principled uncertainty quantification
Easy incorporation of domain knowledge

Example: Predicting EPA from Play Characteristics

Let's build a Bayesian regression model predicting EPA from:

Down (1st, 2nd, 3rd, 4th)
Yards to go
Field position (yardline)
Score differential
Time remaining

R
Python

#| label: bayesian-regression-r
#| message: false
#| warning: false
#| cache: true
#| results: hide

# Prepare regression data
reg_data <- pbp_2023 %>%
  filter(
    !is.na(epa),
    play_type %in% c("pass", "run"),
    !is.na(down), !is.na(ydstogo), !is.na(yardline_100),
    !is.na(score_differential)
  ) %>%
  mutate(
    down = factor(down),
    qtr = factor(qtr),
    # Standardize continuous predictors for better sampling
    ydstogo_std = scale(ydstogo)[,1],
    yardline_std = scale(yardline_100)[,1],
    score_diff_std = scale(score_differential)[,1]
  ) %>%
  select(epa, down, ydstogo_std, yardline_std, score_diff_std)

# Fit Bayesian regression
epa_reg_model <- brm(
  epa ~ down + ydstogo_std + yardline_std + score_diff_std,
  data = reg_data,
  family = gaussian(),
  prior = c(
    prior(normal(0, 1), class = Intercept),
    prior(normal(0, 0.5), class = b),
    prior(exponential(0.5), class = sigma)
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  cores = 4,
  seed = 2024,
  backend = "cmdstanr",
  silent = 2,
  refresh = 0
)

#| label: bayesian-regression-r-results
#| message: false
#| warning: false

# Coefficient plot with uncertainty
mcmc_plot(epa_reg_model, type = "intervals", prob = 0.95) +
  labs(
    title = "Bayesian Regression Coefficients for EPA",
    subtitle = "95% Credible Intervals",
    x = "Coefficient Value"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

# Detailed summary
summary(epa_reg_model)

# Posterior predictive check
pp_check(epa_reg_model, ndraws = 100) +
  labs(title = "Posterior Predictive Check",
       subtitle = "Model predictions (light blue) vs observed data (dark blue)")

#| label: bayesian-regression-py
#| message: false
#| warning: false
#| cache: true

# Prepare regression data
reg_data_py = (pbp_2023
    .query("epa.notna() & play_type.isin(['pass', 'run']) & " +
           "down.notna() & ydstogo.notna() & yardline_100.notna() & " +
           "score_differential.notna()")
    .assign(
        down_cat=lambda x: pd.Categorical(x['down']),
        down_idx=lambda x: x['down_cat'].codes,
        ydstogo_std=lambda x: (x['ydstogo'] - x['ydstogo'].mean()) / x['ydstogo'].std(),
        yardline_std=lambda x: (x['yardline_100'] - x['yardline_100'].mean()) / x['yardline_100'].std(),
        score_diff_std=lambda x: (x['score_differential'] - x['score_differential'].mean()) / x['score_differential'].std()
    )
)

# Build Bayesian regression model
with pm.Model() as epa_reg_model_py:
    # Priors
    intercept = pm.Normal('intercept', mu=0, sigma=1)
    beta_down = pm.Normal('beta_down', mu=0, sigma=0.5, shape=3)  # 3 down dummies
    beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
    beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
    beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
    sigma = pm.Exponential('sigma', lam=0.5)

    # Linear model
    mu = (intercept +
          beta_down[reg_data_py['down_idx'].values] +
          beta_ydstogo * reg_data_py['ydstogo_std'].values +
          beta_yardline * reg_data_py['yardline_std'].values +
          beta_score * reg_data_py['score_diff_std'].values)

    # Likelihood
    epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma,
                       observed=reg_data_py['epa'].values)

    # Sample
    trace_reg = pm.sample(2000, tune=1000, chains=4, cores=4,
                         random_seed=2024, return_inferencedata=True,
                         progressbar=False)

# Plot coefficients
az.plot_forest(trace_reg, var_names=['beta_ydstogo', 'beta_yardline', 'beta_score'],
               combined=True, hdi_prob=0.95)
plt.title('Bayesian Regression Coefficients for EPA\n95% Credible Intervals',
          fontweight='bold')
plt.tight_layout()
plt.show()

# Summary
print("\nPosterior Summary:")
print(az.summary(trace_reg, var_names=['intercept', 'beta_ydstogo',
                                       'beta_yardline', 'beta_score']))

Interpreting Results:

Yards to go: Negative coefficient (longer distance = lower EPA)
Yardline: Positive coefficient (closer to end zone = higher EPA)
Score differential: Tells us about game script effects
Down effects: Compare downs' impact on EPA

Unlike frequentist regression, we get full probability distributions for each coefficient, allowing statements like "There's a 95% probability the yardline effect is between 0.15 and 0.21."

Model Comparison and Selection

How do we choose between competing models? Bayesian methods offer several approaches:

1. Leave-One-Out Cross-Validation (LOO-CV)

LOO approximates leave-one-out cross-validation using Pareto-smoothed importance sampling (PSIS). It provides:

ELPD (Expected Log Predictive Density): Higher is better
LOO-IC: Lower is better (analogous to AIC/BIC)
Pareto k diagnostics: Identifies influential observations

2. WAIC (Widely Applicable Information Criterion)

WAIC is a fully Bayesian alternative to AIC:

$$ \text{WAIC} = -2 \sum_{i=1}^{n} \log \left( \mathbb{E}_{\text{post}}[p(y_i | \theta)] \right) + 2 \sum_{i=1}^{n} \text{Var}_{\text{post}}[\log p(y_i | \theta)] $$

Lower WAIC indicates better predictive performance.

R
Python

#| label: model-comparison-r
#| message: false
#| warning: false
#| cache: true

# Fit simpler model without down
epa_reg_model_simple <- brm(
  epa ~ ydstogo_std + yardline_std + score_diff_std,
  data = reg_data,
  family = gaussian(),
  prior = c(
    prior(normal(0, 1), class = Intercept),
    prior(normal(0, 0.5), class = b),
    prior(exponential(0.5), class = sigma)
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  cores = 4,
  seed = 2024,
  backend = "cmdstanr",
  silent = 2,
  refresh = 0
)

# Compute LOO for both models
loo_full <- loo(epa_reg_model)
loo_simple <- loo(epa_reg_model_simple)

# Compare models
loo_compare(loo_full, loo_simple)

# Also compute WAIC
waic_full <- waic(epa_reg_model)
waic_simple <- waic(epa_reg_model_simple)

# Display comparison
cat("\nModel Comparison (LOO):\n")
print(loo_compare(loo_full, loo_simple))

cat("\n\nModel Comparison (WAIC):\n")
cat(sprintf("Full model WAIC: %.2f\n", waic_full$estimates["waic", "Estimate"]))
cat(sprintf("Simple model WAIC: %.2f\n", waic_simple$estimates["waic", "Estimate"]))
cat(sprintf("Difference: %.2f (lower is better)\n",
            waic_simple$estimates["waic", "Estimate"] -
            waic_full$estimates["waic", "Estimate"]))

#| label: model-comparison-py
#| message: false
#| warning: false
#| cache: true

# Fit simpler model without down effects
with pm.Model() as epa_reg_model_simple_py:
    # Priors
    intercept = pm.Normal('intercept', mu=0, sigma=1)
    beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
    beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
    beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
    sigma = pm.Exponential('sigma', lam=0.5)

    # Linear model (no down effects)
    mu = (intercept +
          beta_ydstogo * reg_data_py['ydstogo_std'].values +
          beta_yardline * reg_data_py['yardline_std'].values +
          beta_score * reg_data_py['score_diff_std'].values)

    # Likelihood
    epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma,
                       observed=reg_data_py['epa'].values)

    # Sample
    trace_reg_simple = pm.sample(2000, tune=1000, chains=4, cores=4,
                                random_seed=2024, return_inferencedata=True,
                                progressbar=False)

# Compute LOO for both models
loo_full = az.loo(trace_reg)
loo_simple = az.loo(trace_reg_simple)

# Compare models
loo_compare = az.compare({'Full': trace_reg, 'Simple': trace_reg_simple})

print("\nModel Comparison (LOO-CV):")
print(loo_compare)

# Also compute WAIC
waic_full = az.waic(trace_reg)
waic_simple = az.waic(trace_reg_simple)

print(f"\nFull model WAIC: {waic_full.waic:.2f}")
print(f"Simple model WAIC: {waic_simple.waic:.2f}")
print(f"Difference: {waic_simple.waic - waic_full.waic:.2f} (lower is better)")

The full model (with down effects) should substantially outperform the simple model, confirming that down is an important predictor of EPA.

Uncertainty Quantification

A key advantage of Bayesian inference is comprehensive uncertainty quantification. We can:

Credible intervals: "There's a 95% probability the parameter is in this range"
Posterior predictive distributions: Predictions with uncertainty
Probability statements: Direct answers to "What's the probability that...?"

Posterior Predictive Distributions

The posterior predictive distribution answers: "Given our model and observed data, what do we predict for new data?"

$$ p(\tilde{y} | D) = \int p(\tilde{y} | \theta) p(\theta | D) d\theta $$

This integrates over all posterior uncertainty in $\theta$.

R
Python

#| label: posterior-predictive-r
#| fig-cap: "Posterior predictive distribution for EPA"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

# Generate posterior predictions for specific scenarios
new_scenarios <- tibble(
  scenario = c("1st & 10, Own 25", "3rd & 5, Opp 30", "2nd & 15, Midfield"),
  down = factor(c(1, 3, 2)),
  ydstogo_std = scale(c(10, 5, 15),
                     center = attr(reg_data$ydstogo_std, "scaled:center"),
                     scale = attr(reg_data$ydstogo_std, "scaled:scale"))[,1],
  yardline_std = scale(c(75, 30, 50),
                      center = attr(reg_data$yardline_std, "scaled:center"),
                      scale = attr(reg_data$yardline_std, "scaled:scale"))[,1],
  score_diff_std = scale(c(0, 0, 0),
                        center = attr(reg_data$score_diff_std, "scaled:center"),
                        scale = attr(reg_data$score_diff_std, "scaled:scale"))[,1]
)

# Generate posterior predictions
posterior_preds <- posterior_predict(epa_reg_model, newdata = new_scenarios)

# Convert to tidy format
pred_df <- as_tibble(t(posterior_preds)) %>%
  set_names(new_scenarios$scenario) %>%
  pivot_longer(everything(), names_to = "scenario", values_to = "epa_pred")

# Visualize
pred_df %>%
  ggplot(aes(x = epa_pred, fill = scenario)) +
  geom_density(alpha = 0.6) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Posterior Predictive Distributions for EPA",
    subtitle = "Predictions with full uncertainty for different scenarios",
    x = "Predicted EPA",
    y = "Density",
    fill = "Scenario"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )

# Summary statistics
pred_summary <- pred_df %>%
  group_by(scenario) %>%
  summarise(
    mean = mean(epa_pred),
    median = median(epa_pred),
    lower_95 = quantile(epa_pred, 0.025),
    upper_95 = quantile(epa_pred, 0.975),
    prob_positive = mean(epa_pred > 0),
    .groups = "drop"
  )

pred_summary %>%
  gt::gt() %>%
  gt::cols_label(
    scenario = "Scenario",
    mean = "Mean",
    median = "Median",
    lower_95 = "Lower 95%",
    upper_95 = "Upper 95%",
    prob_positive = "P(EPA > 0)"
  ) %>%
  gt::fmt_number(
    columns = c(mean, median, lower_95, upper_95, prob_positive),
    decimals = 3
  ) %>%
  gt::tab_header(
    title = "Posterior Predictive Summary"
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: posterior-predictive-py
#| fig-cap: "Posterior predictive distribution for EPA (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

# Generate posterior predictions for specific scenarios
ydstogo_mean = reg_data_py['ydstogo'].mean()
ydstogo_std_val = reg_data_py['ydstogo'].std()
yardline_mean = reg_data_py['yardline_100'].mean()
yardline_std_val = reg_data_py['yardline_100'].std()
score_mean = reg_data_py['score_differential'].mean()
score_std_val = reg_data_py['score_differential'].std()

scenarios = [
    {"name": "1st & 10, Own 25", "down": 0, "ydstogo": 10, "yardline": 75, "score": 0},
    {"name": "3rd & 5, Opp 30", "down": 2, "ydstogo": 5, "yardline": 30, "score": 0},
    {"name": "2nd & 15, Midfield", "down": 1, "ydstogo": 15, "yardline": 50, "score": 0}
]

# Generate predictions
fig, ax = plt.subplots(figsize=(10, 6))

for scenario in scenarios:
    # Standardize inputs
    ydstogo_std = (scenario["ydstogo"] - ydstogo_mean) / ydstogo_std_val
    yardline_std = (scenario["yardline"] - yardline_mean) / yardline_std_val
    score_std = (scenario["score"] - score_mean) / score_std_val

    # Sample from posterior predictive
    intercept_samples = trace_reg.posterior['intercept'].values.flatten()
    beta_down_samples = trace_reg.posterior['beta_down'].values.reshape(-1, 3)[:, scenario["down"]]
    beta_ydstogo_samples = trace_reg.posterior['beta_ydstogo'].values.flatten()
    beta_yardline_samples = trace_reg.posterior['beta_yardline'].values.flatten()
    beta_score_samples = trace_reg.posterior['beta_score'].values.flatten()
    sigma_samples = trace_reg.posterior['sigma'].values.flatten()

    # Posterior predictive samples
    mu_samples = (intercept_samples + beta_down_samples +
                 beta_ydstogo_samples * ydstogo_std +
                 beta_yardline_samples * yardline_std +
                 beta_score_samples * score_std)

    # Add sampling variation
    pred_samples = np.random.normal(mu_samples, sigma_samples)

    # Plot
    ax.hist(pred_samples, bins=50, alpha=0.6, density=True, label=scenario["name"])

ax.axvline(x=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('Predicted EPA', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Posterior Predictive Distributions for EPA\n' +
             'Predictions with full uncertainty for different scenarios',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

These distributions show not just point predictions, but the full range of plausible outcomes accounting for all sources of uncertainty.

Prior Selection and Sensitivity Analysis

Choosing priors is both an art and science, and it's often cited as the most subjective aspect of Bayesian analysis. However, with football data, we usually have substantial domain knowledge that makes prior selection straightforward. The key is to be transparent about prior choices and assess their influence on conclusions.

Prior selection involves balancing several competing concerns:

Incorporating knowledge: Using what we know about football to improve estimates
Avoiding overconfidence: Not being so strong that data can't change our minds
Ensuring computational stability: Ruling out impossible or implausible values
Maintaining transparency: Being clear about assumptions

Types of Priors

The spectrum of prior informativeness ranges from very strong (encoding substantial knowledge) to very weak (minimal influence). Here's how to think about different types:

Informative priors: Encode substantial domain knowledge with meaningful precision
- Example: QB EPA likely between -0.2 and 0.4 based on historical data
- When to use: You have strong prior information (historical data, expert consensus)
- Risk: If prior is wrong, it can bias conclusions
- Benefit: Improves estimates with limited data, incorporates decades of football knowledge
Weakly informative priors: Provide gentle regularization without strong constraints
- Example: Normal(0, 1) on standardized predictors
- When to use: Default choice when you want some regularization but not strong assumptions
- Risk: Minimal—these priors rule out extreme values but let data speak
- Benefit: Stabilizes estimation, prevents nonsensical values, improves out-of-sample prediction
Uninformative/vague priors: Minimal influence on posterior, "let the data decide"
- Example: Uniform(-∞, ∞) or Normal(0, 1000)
- When to use: Rarely! These can cause numerical problems
- Risk: No regularization, models may not converge, extreme estimates possible
- Benefit: None in practice—weakly informative priors are always better

The Myth of "Objective" Priors

Some practitioners seek "objective" or "uninformative" priors to avoid "subjective" Bayesian choices. This is misguided for several reasons: 1. **No prior is truly uninformative**: Even a "flat" prior encodes assumptions (e.g., all values equally likely) 2. **Vague priors cause problems**: They can lead to poor sampling, nonsensical estimates, and worse predictions 3. **Domain knowledge is valuable**: Football has been analyzed for decades—ignoring this knowledge is wasteful 4. **Frequentist methods aren't "objective"**: They also make assumptions, just implicitly The solution: Use weakly informative priors based on domain knowledge, and check sensitivity to prior choices.

Prior Predictive Checks

Before fitting the model to data, you should always perform prior predictive checks: simulate data from your prior to ensure it produces reasonable predictions. This reveals whether your priors are sensible before you invest time in inference.

The logic: If your prior predicts EPA values of ±50 or field goal success rates of 10%, something is wrong with your prior specification. Fix it before seeing the data.

Here's how to conduct prior predictive checks:

R
Python

#| label: prior-predictive-r
#| fig-cap: "Prior predictive check"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

# Sample from prior predictive distribution (before seeing data)
prior_pred <- brm(
  epa ~ down + ydstogo_std + yardline_std + score_diff_std,
  data = reg_data,
  family = gaussian(),
  prior = c(
    prior(normal(0, 1), class = Intercept),
    prior(normal(0, 0.5), class = b),
    prior(exponential(0.5), class = sigma)
  ),
  sample_prior = "only",  # Sample only from prior, ignore data
  chains = 4,
  iter = 2000,
  cores = 4,
  seed = 2024,
  backend = "cmdstanr",
  silent = 2,
  refresh = 0
)

# Generate predictions from prior
prior_preds <- posterior_predict(prior_pred)

# Compare prior predictions to observed data
tibble(
  epa_sim = prior_preds[sample(1:nrow(prior_preds), 1000), 1]
) %>%
  ggplot(aes(x = epa_sim)) +
  geom_density(fill = "gray", alpha = 0.6) +
  geom_density(data = reg_data, aes(x = epa), fill = "blue", alpha = 0.4) +
  scale_x_continuous(limits = c(-10, 10)) +
  labs(
    title = "Prior Predictive Check",
    subtitle = "Gray = Prior predictions, Blue = Observed data",
    x = "EPA",
    y = "Density"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: prior-predictive-py
#| fig-cap: "Prior predictive check (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false

# Sample from prior predictive distribution
with pm.Model() as prior_model:
    # Priors (same as before)
    intercept = pm.Normal('intercept', mu=0, sigma=1)
    beta_down = pm.Normal('beta_down', mu=0, sigma=0.5, shape=3)
    beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
    beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
    beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
    sigma = pm.Exponential('sigma', lam=0.5)

    # Use first observation's predictors
    mu = (intercept +
          beta_down[0] +  # arbitrary down
          beta_ydstogo * 0 +
          beta_yardline * 0 +
          beta_score * 0)

    # Prior predictive
    epa_prior = pm.Normal('epa_prior', mu=mu, sigma=sigma)

    # Sample from prior
    prior_samples = pm.sample_prior_predictive(samples=1000, random_seed=2024)

# Plot prior predictive vs observed
fig, ax = plt.subplots(figsize=(10, 6))

prior_epa = prior_samples.prior['epa_prior'].values.flatten()
ax.hist(prior_epa, bins=50, density=True, alpha=0.6,
        color='gray', label='Prior predictions')
ax.hist(reg_data_py['epa'], bins=50, density=True, alpha=0.4,
        color='blue', label='Observed data')

ax.set_xlim(-10, 10)
ax.set_xlabel('EPA', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Prior Predictive Check\nGray = Prior predictions, Blue = Observed data',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

If prior predictions produce impossible values (e.g., EPA of 50), the prior is too vague and should be tightened.

Sensitivity Analysis

Check how results change with different priors:

Prior Sensitivity Guidelines

1. **Vary prior width**: Try σ = 0.5, 1.0, 2.0 for Normal(0, σ) priors 2. **Compare posteriors**: If posteriors are similar, priors have little influence (data dominates) 3. **Document choices**: Always report and justify priors 4. **Use domain knowledge**: Expert opinion should inform prior selection

Practical Applications in Football

Having covered the theoretical foundations and core techniques, let's explore how Bayesian methods solve real football analytics problems. These applications demonstrate why teams increasingly use Bayesian approaches for everything from in-game decisions to long-term roster construction.

Application 1: Fourth Down Decision Making

Fourth down decisions are among the highest-leverage in-game choices coaches make. Should you go for it on 4th-and-2 from midfield when trailing by 7 in the third quarter? Traditional analytics provides point estimates of conversion probability and expected value, but doesn't quantify uncertainty or account for team-specific abilities.

Bayesian methods can dramatically improve 4th down decision-making by:

Modeling team-specific conversion rates: Not all teams convert 4th-and-2 at the same rate. Elite offensive lines and mobile QBs convert more often. Hierarchical models estimate team-specific rates while borrowing strength across teams.
Updating beliefs as season progresses: After Week 1, you have minimal team-specific data. By Week 10, you have substantial evidence about each team's short-yardage ability. Bayesian models naturally incorporate accumulating evidence.
Quantifying uncertainty in recommendations: Saying "you have a 55% chance to convert" is less useful than "you have a 55% chance with 95% credible interval [45%, 65%]." The wide interval suggests substantial uncertainty, which might affect the decision.
Incorporating situational factors: You can include yardage, field position, opponent quality, and other context in a regression framework while maintaining proper uncertainty quantification.

The hierarchical structure is key: you're estimating league-average conversion rates, team-specific deviations from the average, and situation-specific modifiers, all simultaneously with proper uncertainty propagation.

Real-Time Bayesian Updating During Games

Imagine a team that's 0-for-3 on fourth down attempts today. Should this change your conversion probability estimate? A simple frequency approach says "they're 0% today!" A Bayesian hierarchical model says "they've had bad luck today, but their true ability is probably close to their season-long rate, shrunk slightly toward zero based on today's failures." This is exactly the rational updating coaches should do—and hierarchical models do it automatically.

R
Python

#| label: fourth-down-app-r
#| message: false
#| warning: false
#| cache: true

# Prepare 4th down data
fourth_down <- pbp_2023 %>%
  filter(
    down == 4,
    !is.na(go_boost),  # 4th down situation
    ydstogo <= 5,  # Short yardage
    play_type %in% c("pass", "run"),
    !is.na(posteam)
  ) %>%
  mutate(
    conversion = if_else(first_down_pass == 1 | first_down_rush == 1, 1, 0),
    ydstogo_group = cut(ydstogo, breaks = c(0, 2, 5),
                       labels = c("0-2 yards", "3-5 yards"))
  )

# Hierarchical model: team effects + yardage effects
fourth_model <- brm(
  conversion ~ ydstogo_group + (1 | posteam),
  data = fourth_down,
  family = bernoulli(),
  prior = c(
    prior(normal(0, 1.5), class = Intercept),
    prior(normal(0, 1), class = b),
    prior(exponential(1), class = sd)
  ),
  chains = 4,
  iter = 2000,
  cores = 4,
  seed = 2024,
  backend = "cmdstanr",
  silent = 2,
  refresh = 0
)

# Extract team-specific conversion probabilities
team_probs <- fourth_model %>%
  spread_draws(r_posteam[posteam,]) %>%
  mutate(
    # Add fixed effects
    prob_short = plogis(fixef(fourth_model)[1, 1] + r_posteam),
    prob_medium = plogis(fixef(fourth_model)[1, 1] +
                        fixef(fourth_model)[2, 1] + r_posteam)
  ) %>%
  group_by(posteam) %>%
  summarise(
    prob_0_2 = mean(prob_short),
    prob_3_5 = mean(prob_medium),
    lower_0_2 = quantile(prob_short, 0.025),
    upper_0_2 = quantile(prob_short, 0.975),
    .groups = "drop"
  ) %>%
  arrange(desc(prob_0_2))

# Show top teams
team_probs %>%
  head(10) %>%
  gt::gt() %>%
  gt::cols_label(
    posteam = "Team",
    prob_0_2 = "P(Convert 0-2 yds)",
    lower_0_2 = "Lower 95%",
    upper_0_2 = "Upper 95%",
    prob_3_5 = "P(Convert 3-5 yds)"
  ) %>%
  gt::fmt_percent(
    columns = c(prob_0_2, lower_0_2, upper_0_2, prob_3_5),
    decimals = 1
  ) %>%
  gt::tab_header(
    title = "4th Down Conversion Probabilities by Team",
    subtitle = "Short yardage situations (2023)"
  )

#| label: fourth-down-app-py
#| message: false
#| warning: false
#| cache: true

# Prepare 4th down data
fourth_down_py = (pbp_2023
    .query("down == 4 & go_boost.notna() & ydstogo <= 5 & " +
           "play_type.isin(['pass', 'run']) & posteam.notna()")
    .assign(
        conversion=lambda x: ((x['first_down_pass'] == 1) |
                             (x['first_down_rush'] == 1)).astype(int),
        ydstogo_group=lambda x: pd.cut(x['ydstogo'], bins=[0, 2, 5],
                                       labels=['0-2 yards', '3-5 yards'])
    )
    .dropna(subset=['ydstogo_group'])
)

# Convert to numeric
fourth_down_py['ydstogo_binary'] = (fourth_down_py['ydstogo_group'] == '3-5 yards').astype(int)
fourth_down_py['team_idx'] = pd.Categorical(fourth_down_py['posteam']).codes
n_teams_4th = fourth_down_py['team_idx'].nunique()

# Hierarchical logistic regression
with pm.Model() as fourth_model_py:
    # Hyperpriors
    mu_intercept = pm.Normal('mu_intercept', mu=0, sigma=1.5)
    sigma_team = pm.Exponential('sigma_team', lam=1)

    # Team effects
    team_effect = pm.Normal('team_effect', mu=0, sigma=sigma_team, shape=n_teams_4th)

    # Yardage effect
    beta_yardage = pm.Normal('beta_yardage', mu=0, sigma=1)

    # Linear predictor
    logit_p = (mu_intercept + team_effect[fourth_down_py['team_idx'].values] +
              beta_yardage * fourth_down_py['ydstogo_binary'].values)

    # Likelihood
    conversion_obs = pm.Bernoulli('conversion_obs', logit_p=logit_p,
                                 observed=fourth_down_py['conversion'].values)

    # Sample
    trace_4th = pm.sample(2000, tune=1000, chains=4, cores=4,
                         random_seed=2024, return_inferencedata=True,
                         progressbar=False)

print("\n4th Down Conversion Model Summary:")
print(az.summary(trace_4th, var_names=['mu_intercept', 'beta_yardage', 'sigma_team']))

Application 2: Player Development Trajectories

One of the most valuable applications of Bayesian methods is tracking how players develop over time, particularly rookies. A rookie QB might start hot, struggle mid-season, then finish strong. How should we interpret this performance arc? Is the early success real talent or luck? Is the mid-season struggle growing pains or exposure of weakness?

Sequential Bayesian updating provides the ideal framework:

Start with a prior based on draft position, college stats, and historical rookie performance
Update after each game by combining the prior with new performance data
Track the posterior over time to see when beliefs stabilize
Quantify uncertainty at each point—are we confident this is the QB's true level?

This approach mirrors how coaches actually evaluate players: they have initial expectations (priors), update them based on performance (likelihood), and eventually form stable beliefs (posterior). The Bayesian framework formalizes this process and quantifies confidence at each stage.

Application 3: Injury Risk Prediction

Predicting injury risk is critical for roster management, contract negotiations, and game planning. However, injuries are rare events with substantial uncertainty. A player might have no injuries for 5 years, then suffer two in one season. How do we distinguish bad luck from true injury proneness?

Bayesian survival models are ideal because they:

Handle censoring: Some players leave the league before getting injured (censored observations)
Incorporate covariates: Age, position, playing style, workload all affect injury risk
Pool information: Borrow strength across similar players to estimate risk for players with limited history
Quantify uncertainty: Provide probability distributions over time-to-injury, not just point estimates

Hierarchical Bayesian survival models can estimate baseline injury hazards by position, then player-specific deviations, accounting for age, usage patterns, and injury history. This gives roster managers probabilistic forecasts: "This RB has a 30% chance [CI: 20-45%] of missing 3+ games to injury this season."

Application 4: Draft Pick Valuation

How much is a 2nd-round pick worth relative to a 1st-rounder? What's the value of a specific draft position (e.g., pick #15) relative to another (e.g., pick #25)? These questions are critical for trade negotiations and draft strategy.

Hierarchical Bayesian models can estimate draft pick value by:

Modeling career outcomes (e.g., career WAR, years as starter) as a function of draft position
Pooling across positions: QBs and RBs have different value curves, but share common structure
Accounting for draft class variation: Some drafts are strong, others weak—hierarchical models separate these sources of variation
Incorporating uncertainty: Late-round picks might bust or become Pro Bowlers—capture this with full posterior distributions

The result is a draft value chart with uncertainty quantification. Rather than saying "pick #15 is worth 1050 points," you get "pick #15 is worth 1050 [CI: 950-1200] points in expected career value." The uncertainty matters when evaluating trades—is trading up worth the cost given the uncertainty?

Bayesian Models for Multi-Year Contracts

When evaluating free agent contracts, Bayesian methods shine. You need to project player performance over 3-5 years, accounting for age decline, injury risk, and regression to the mean. Hierarchical Bayesian models that pool information across similar players while respecting individual differences provide the most realistic projections with proper uncertainty quantification. This helps teams avoid overpaying for recent peak performance that won't persist.

Bayesian vs Frequentist: A Comparison

Aspect	Frequentist	Bayesian
Parameters	Fixed but unknown	Random variables with distributions
Inference	Confidence intervals, p-values	Posterior distributions, credible intervals
Interpretation	"95% of intervals contain true value"	"95% probability parameter is in this interval"
Prior knowledge	Not formally incorporated	Naturally incorporated via priors
Sample size	Relies on asymptotic theory	Works with any sample size
Uncertainty	Confidence intervals	Full posterior distribution
Computation	Often closed-form	Usually requires MCMC
Overfitting	Regularization via penalties	Natural regularization via priors

:Comparison of Bayesian and Frequentist approaches

When to Use Bayesian Methods

Bayesian methods are ideal when:

You have limited data (regularization helps)
Hierarchical structure exists (players in teams)
You want to incorporate prior knowledge
Uncertainty quantification is critical
You need to update predictions sequentially

Frequentist methods may be preferred when:

Computational resources are limited
Sample sizes are very large (results converge)
You want to avoid specifying priors
Simple hypothesis testing suffices

In practice, modern football analytics often blends both approaches, using whichever is most appropriate for the problem at hand.

Summary

Bayesian methods offer a powerful and increasingly essential framework for football analytics. As the field matures, teams recognize that point estimates without uncertainty quantification are insufficient for high-stakes decisions. Bayesian methods provide the rigorous probabilistic reasoning needed for modern football analytics.

Key Takeaways:

Bayes' theorem provides a principled way to update beliefs with new evidence, formalizing how we naturally think about learning from data while incorporating prior knowledge
Hierarchical models naturally handle football's nested structure (players within teams, teams within divisions), allowing information pooling that improves estimates for all groups while respecting individual differences
MCMC algorithms enable inference for complex models that would be intractable with analytical methods, making sophisticated hierarchical and regression models practical
Uncertainty quantification is natural and comprehensive in the Bayesian framework—every estimate comes with a full posterior distribution showing the range of plausible values
Prior knowledge can be formally incorporated through prior distributions, allowing decades of football understanding to inform modern analyses rather than starting from scratch with each dataset
Model comparison uses principled tools like LOO-CV and WAIC that balance fit and complexity, helping choose models that predict well rather than just fit well
Practical applications span the full range of football decisions: player evaluation, game-time tactics, roster construction, contract negotiation, and draft strategy

The Complete Bayesian Workflow:

Specify model: Choose appropriate likelihood for your data and prior distributions for parameters
Prior predictive checks: Simulate from the prior to ensure it produces reasonable predictions before seeing data
Fit using MCMC: Use modern tools like Stan, PyMC, or brms to sample from the posterior
Check diagnostics: Verify convergence (R-hat < 1.01), adequate sampling (ESS > 400), and no pathologies (divergences)
Validate with posterior predictive checks: Ensure the model generates data consistent with observed data
Compare models: Use LOO-CV or WAIC to compare competing models
Perform sensitivity analysis: Check how conclusions change with different prior choices
Make inferences and predictions: Extract posterior summaries, compute probabilities, generate predictions with uncertainty

When Bayesian Methods Excel:

Bayesian methods require more computational effort and statistical sophistication than traditional approaches, but provide richer, more nuanced insights. They are especially valuable in football analytics where:

Sample sizes are limited: Players and teams have finite games—regularization via priors helps
Hierarchical structure exists: Pooling information across players/teams improves all estimates
Prior knowledge is available: Decades of football analysis can inform modern models
Uncertainty matters: High-stakes decisions require knowing confidence levels, not just point estimates
Sequential updating is needed: As the season progresses, beliefs should rationally update
Rare events occur: Injuries, fourth downs, two-point conversions benefit from information pooling

Looking Forward:

Bayesian methods are not just academic exercises—they're increasingly used by NFL teams for practical decision-making. As computational power grows and software improves, Bayesian analyses that once took hours now take minutes. The barrier to adoption is no longer computational, but educational. Mastering these methods provides a significant competitive advantage in modern football analytics.

The Bayesian Mindset

Beyond specific techniques, Bayesian methods encourage a valuable mindset for analytics: 1. **Think probabilistically**: Express beliefs as distributions, not points 2. **Update rationally**: Combine evidence using Bayes' theorem, not ad-hoc adjustments 3. **Quantify uncertainty**: Always know how confident you should be 4. **Regularize appropriately**: Shrink toward reasonable values when data is limited 5. **Be transparent**: Make assumptions explicit through prior specification This mindset—thinking clearly about uncertainty and updating beliefs rationally—is valuable far beyond the specific models in this chapter.

Exercises

Conceptual Questions

Priors and Posteriors: A kicker makes 3 of 3 field goals in his first game. Compare:
- Frequentist estimate (3/3 = 100%)
- Bayesian estimate with Beta(20, 5) prior (encoding that kickers are typically around 80%)

Which seems more reasonable? Why?

Hierarchical Shrinkage: Explain in your own words why hierarchical models shrink extreme estimates toward the group mean. Why is this beneficial for player evaluation?
Credible vs Confidence Intervals:
- A 95% confidence interval for a QB's EPA is [0.10, 0.30]
- A 95% credible interval for a QB's EPA is [0.10, 0.30]

Explain the difference in interpretation. Which is more intuitive for decision-making?

Coding Exercises

Exercise 1: Bayesian Kicker Evaluation

Build a hierarchical Beta-binomial model for NFL kickers' field goal success rates in 2023. **Tasks:** a) Filter to kickers with at least 10 attempts b) Fit a hierarchical model with team-level effects c) Compare raw success rates to hierarchical estimates d) Visualize shrinkage as a function of attempts e) Rank kickers by posterior mean success probability **Bonus**: Compare short (< 40 yards) vs long (≥ 40 yards) field goals.

Exercise 2: Hierarchical Team Strength Model

Create a hierarchical model estimating team strength based on point differential. **Model:** $$ \begin{align} \text{Point Diff}_{ij} &\sim \text{Normal}(\theta_i - \theta_j, \sigma_y) \\ \theta_i &\sim \text{Normal}(\mu_{\text{league}}, \sigma_{\text{team}}) \end{align} $$ Where $\theta_i$ is team $i$'s strength. **Tasks:** a) Fit the model using 2023 data b) Rank teams by posterior mean strength c) Compute probability that Team A > Team B for selected pairs d) Visualize posterior distributions for top 5 teams

Exercise 3: Bayesian EPA Regression

Extend the EPA regression model from the chapter to include interactions. **Tasks:** a) Add interaction between down and yards to go b) Include play type (pass vs run) c) Fit the model and interpret coefficients d) Use LOO-CV to compare to the simpler model e) Generate posterior predictive distributions for: - 3rd & 3, pass, midfield - 3rd & 3, run, midfield - Which has higher expected EPA? **Bonus**: Include team-specific random effects.

Exercise 4: Sequential Bayesian Updating

Track a quarterback's EPA through the season using sequential Bayesian updates. **Tasks:** a) Pick a QB who played full 2023 season b) Start with weakly informative prior: Normal(0.05, 0.3) c) After each game, update posterior using that game's EPA values d) Plot how posterior mean and uncertainty evolve over season e) Compare final posterior to raw season-end EPA **Interpretation**: When did the posterior stabilize? How much did early games influence final estimate?

References

:::

Learning ObjectivesBy the end of this chapter, you will be able to:

Introduction

Why Bayesian Methods for Football?

Bayes' Theorem and the Bayesian Framework

The Foundation: Bayes' Theorem

The Key Insight: Learning as Belief Updating

The Bayesian Workflow

Prior, Likelihood, and Posterior

Understanding Credible Intervals

Bayesian Inference for Simple Problems

Example 1: Beta-Binomial Model for Conversion Rates

Why Conjugate Priors Are Valuable

Example 2: Comparing Two Teams

Hierarchical Models for Football Data

Why Hierarchical Models?

The Shrinkage Principle

Example: Hierarchical QB Rating Model

Visualizing Hierarchical Shrinkage

Markov Chain Monte Carlo (MCMC)

Why MCMC?

Monte Carlo: Named After a Casino

Common MCMC Algorithms

MCMC Diagnostics

📊 Visualization Output

📊 Visualization Output

Bayesian Regression for Football Metrics

Example: Predicting EPA from Play Characteristics

Model Comparison and Selection

1. Leave-One-Out Cross-Validation (LOO-CV)

2. WAIC (Widely Applicable Information Criterion)

Uncertainty Quantification

Posterior Predictive Distributions

📊 Visualization Output

📊 Visualization Output

Prior Selection and Sensitivity Analysis

Types of Priors

The Myth of "Objective" Priors

Prior Predictive Checks

📊 Visualization Output

Sensitivity Analysis

Prior Sensitivity Guidelines

Practical Applications in Football

Application 1: Fourth Down Decision Making

Real-Time Bayesian Updating During Games

Application 2: Player Development Trajectories

Application 3: Injury Risk Prediction

Application 4: Draft Pick Valuation

Bayesian Models for Multi-Year Contracts

Bayesian vs Frequentist: A Comparison

When to Use Bayesian Methods

Summary

The Bayesian Mindset

Exercises

Conceptual Questions

Coding Exercises

Exercise 1: Bayesian Kicker Evaluation

Exercise 2: Hierarchical Team Strength Model

Exercise 3: Bayesian EPA Regression

Exercise 4: Sequential Bayesian Updating

Further Reading

Books

Papers

Software Documentation

Tutorials

References