Learning ObjectivesBy the end of this chapter, you will be able to:
- Understand Bayesian inference fundamentals and philosophical differences from frequentist approaches
- Build hierarchical models for football data that account for nested structure
- Update beliefs with new evidence using Bayes' theorem
- Quantify uncertainty in predictions using posterior distributions
- Compare Bayesian and frequentist approaches for football analytics problems
Introduction
In traditional (frequentist) statistics, we estimate parameters as fixed but unknown values and quantify uncertainty through confidence intervals and p-values. Bayesian statistics offers a fundamentally different paradigm: parameters are treated as random variables with probability distributions that reflect our uncertainty about their true values.
This philosophical difference has profound practical implications for football analytics. In the frequentist view, a player's true ability is a fixed number that we're trying to estimate with increasing precision as we collect more data. In the Bayesian view, our uncertainty about a player's ability is represented by a probability distribution that evolves as we observe more performance data. This distribution captures not just our best guess, but the entire range of plausible values weighted by their likelihood.
For football analytics, the Bayesian approach offers several compelling advantages:
- Natural uncertainty quantification: Rather than point estimates, we get full probability distributions that show the range of plausible values and their relative likelihoods
- Hierarchical modeling: Perfect for nested structures (players within teams, teams within divisions, games within seasons) that are endemic to football data
- Information pooling: Weak data can borrow strength from strong data, allowing us to make reasonable inferences even for backup players with limited snaps
- Sequential learning: Easy to update predictions as new games occur, reflecting how coaches and analysts actually think about player evaluation
- Prior knowledge incorporation: Can formally use expert judgment, historical data, or domain knowledge to improve estimates
Consider a practical example: You want to evaluate a rookie quarterback's true ability after 3 games where he averaged 0.15 EPA per play. A frequentist approach gives you a point estimate (0.15 EPA/play) with a wide confidence interval—perhaps [-0.05, 0.35]. This tells you the data is consistent with a wide range of abilities, but doesn't help you decide if this QB is genuinely good or just got lucky.
A Bayesian approach gives you a probability distribution over possible abilities, incorporating both the limited data you've observed and prior knowledge about typical rookie performance. If historical data shows most rookies average around 0.0 EPA/play, your posterior distribution might center around 0.08 EPA/play—partway between the observed 0.15 and the historical 0.0. Moreover, it naturally shrinks extreme estimates toward more reasonable values, protecting you from overreacting to small samples.
Perhaps most importantly, Bayesian methods let you answer questions that matter for decision-making: "What's the probability this QB is above average?" or "How likely is he to be a top-10 QB?" These probabilistic statements are natural in the Bayesian framework but awkward or impossible in the frequentist framework.
Why Bayesian Methods for Football?
Football presents ideal conditions for Bayesian methods: 1. **Limited sample sizes**: Players and teams have finite games 2. **Hierarchical structure**: Players nest within teams, teams within leagues 3. **Expert knowledge**: Decades of football understanding can inform priors 4. **Sequential updates**: Games happen weekly, allowing belief updates 5. **Uncertainty matters**: Decision-making requires knowing confidence levelsBayes' Theorem and the Bayesian Framework
The Foundation: Bayes' Theorem
At the heart of Bayesian inference lies Bayes' theorem, a mathematical formula that describes how to rationally update beliefs in light of new evidence. Despite its reputation as an advanced statistical technique, the core idea is intuitive: we start with some initial beliefs (the prior), observe some data, and update our beliefs to account for what we've learned (the posterior).
Bayes' theorem describes how to update probabilities based on new evidence:
$$ P(\theta | D) = \frac{P(D | \theta) \cdot P(\theta)}{P(D)} $$
Where:
- $P(\theta | D)$ = Posterior: Our updated belief about parameter $\theta$ after seeing data $D$. This is what we ultimately care about—what should we believe about the parameter given everything we know?
- $P(D | \theta)$ = Likelihood: Probability of observing data $D$ given parameter $\theta$. This captures how consistent the observed data is with different parameter values.
- $P(\theta)$ = Prior: Our initial belief about $\theta$ before seeing data. This encodes our background knowledge, expert opinion, or simply our state of uncertainty before collecting data.
- $P(D)$ = Marginal likelihood: Total probability of the data (normalizing constant). This ensures the posterior is a valid probability distribution (sums/integrates to 1).
The marginal likelihood $P(D)$ is often difficult to compute, but we rarely need it for inference. Instead, we work with the proportional form:
$$ P(\theta | D) \propto P(D | \theta) \cdot P(\theta) $$
This says: Posterior is proportional to likelihood times prior. The posterior combines what the data tells us (likelihood) with what we knew beforehand (prior). When we have lots of data, the likelihood dominates and the prior has little influence. When we have little data, the prior prevents us from drawing extreme conclusions from noise.
The Key Insight: Learning as Belief Updating
Bayes' theorem formalizes how we naturally think about learning. If you believe a kicker is "probably good" (prior), then observe him make 3 of 3 field goals (likelihood), you update to "definitely good, but maybe he got lucky" (posterior). Conversely, if you observe him miss 3 of 3 (different likelihood), you update to "probably not that good" (different posterior). Bayes' theorem tells us exactly how to balance prior beliefs and new evidence.The Bayesian Workflow
- Specify the model: Choose likelihood function for your data
- Choose priors: Encode prior knowledge about parameters
- Compute posterior: Combine prior and likelihood (usually via MCMC)
- Check the model: Posterior predictive checks, convergence diagnostics
- Make inference: Extract summaries, predictions, decisions from posterior
Prior, Likelihood, and Posterior
Let's illustrate these concepts with a concrete example: estimating a kicker's true field goal success probability. This is an ideal starting example because:
- The parameter is intuitive: We're estimating a probability (between 0 and 1)
- The data is simple: Success/failure outcomes on field goal attempts
- We have prior knowledge: We know NFL kickers typically make 80-85% of field goals
- The math is tractable: This is a "conjugate prior" situation where we can compute the posterior analytically
In this example, we'll observe a kicker make 18 of 20 field goal attempts and use Bayesian inference to estimate his true ability. The key question: Should we believe his true ability is 90% (the observed rate), or should we be more conservative given the small sample? Bayesian inference gives us a principled answer.
We'll use a Beta-Binomial model, which is the standard Bayesian approach for estimating proportions or probabilities. The Beta distribution is a flexible probability distribution on [0, 1], making it perfect for representing beliefs about success probabilities. The Binomial distribution models the number of successes in a fixed number of independent trials—exactly what we have with field goal attempts.
#| label: beta-binomial-setup-r
#| message: false
#| warning: false
library(tidyverse)
library(patchwork)
# Set up grid for plotting
# We'll evaluate the distributions at 1000 points from 0 to 1
# This gives smooth curves for visualization
theta <- seq(0, 1, length.out = 1000)
# Observed data: kicker made 18 of 20 field goals
successes <- 18
attempts <- 20
# Prior: Beta(10, 3) - encoding belief that kickers are generally good
# Beta(α, β) has mean = α/(α+β), so Beta(10, 3) has mean ≈ 0.77
# This captures our prior belief that NFL kickers typically make ~77% of FGs
# The parameters α=10 and β=3 represent equivalent to having seen
# 10 makes and 3 misses before observing this kicker
prior <- dbeta(theta, 10, 3)
# Likelihood: Binomial(20, theta)
# For each possible value of theta (success probability),
# what's the probability of observing exactly 18 successes in 20 attempts?
likelihood <- dbinom(successes, attempts, theta)
# Scale likelihood to match prior height for visualization purposes only
likelihood_scaled <- likelihood / max(likelihood) * max(prior)
# Posterior: Beta(10+18, 3+2) = Beta(28, 5)
# With conjugate priors, the posterior is also Beta with updated parameters:
# α_post = α_prior + successes
# β_post = β_prior + failures
# This is the magic of conjugacy—we can compute the posterior analytically!
posterior <- dbeta(theta, 10 + successes, 3 + (attempts - successes))
# Combine into dataframe for plotting
bayes_df <- tibble(
theta = rep(theta, 3),
density = c(prior, likelihood_scaled, posterior),
distribution = rep(c("Prior", "Likelihood", "Posterior"), each = length(theta))
) %>%
mutate(distribution = factor(distribution,
levels = c("Prior", "Likelihood", "Posterior")))
# Create visualization showing all three components
ggplot(bayes_df, aes(x = theta, y = density, color = distribution)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Prior" = "#619CFF",
"Likelihood" = "#F8766D",
"Posterior" = "#00BA38")) +
labs(
title = "Bayesian Inference for Kicker Success Rate",
subtitle = "Observed: 18/20 field goals made",
x = "Success Probability (θ)",
y = "Density",
color = "Distribution"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
#| label: beta-binomial-setup-py
#| message: false
#| warning: false
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Set up grid for plotting
# Evaluate distributions at 1000 points between 0 and 1
theta = np.linspace(0, 1, 1000)
# Observed data: kicker made 18 of 20 field goals
successes = 18
attempts = 20
# Prior: Beta(10, 3) - encoding belief that kickers are generally good
# Beta(α, β) distribution with α=10, β=3 gives mean = 10/13 ≈ 0.77
# This represents our prior belief before seeing this kicker's data
prior = stats.beta(10, 3).pdf(theta)
# Likelihood: Binomial(20, theta)
# For each possible success probability theta, calculate the probability
# of observing 18 successes in 20 attempts
likelihood = stats.binom(attempts, theta).pmf(successes)
# Scale for visualization (doesn't affect the actual inference)
likelihood_scaled = likelihood / likelihood.max() * prior.max()
# Posterior: Beta(10+18, 3+2) = Beta(28, 5)
# The posterior combines prior and likelihood with simple addition:
# New α = old α + observed successes
# New β = old β + observed failures
posterior = stats.beta(10 + successes, 3 + (attempts - successes)).pdf(theta)
# Create visualization showing the Bayesian updating process
plt.figure(figsize=(10, 6))
plt.plot(theta, prior, label='Prior', linewidth=2, color='#619CFF')
plt.plot(theta, likelihood_scaled, label='Likelihood (scaled)',
linewidth=2, color='#F8766D')
plt.plot(theta, posterior, label='Posterior', linewidth=2, color='#00BA38')
plt.xlabel('Success Probability (θ)', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Inference for Kicker Success Rate\nObserved: 18/20 field goals made',
fontsize=14, fontweight='bold')
plt.legend(loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate and display posterior summary statistics
# These tell us what we should believe about the kicker's true ability
post_mean = (10 + successes) / (13 + attempts)
post_median = stats.beta(10 + successes, 3 + (attempts - successes)).median()
post_95 = stats.beta(10 + successes, 3 + (attempts - successes)).interval(0.95)
print(f"\nPosterior Summary:")
print(f" Mean: {post_mean:.3f}")
print(f" Median: {post_median:.3f}")
print(f" 95% Credible Interval: ({post_95[0]:.3f}, {post_95[1]:.3f})")
The posterior represents our updated belief: it's a compromise between the prior (centered around 0.77) and the data (18/20 = 0.90). Notice how the posterior (green curve) lies between the prior (blue) and likelihood (red). This is Bayesian learning in action—we've moved from our initial belief toward what the data suggests, but not all the way there given the limited sample.
With more data, the posterior would move closer to the observed rate. If this kicker maintains a 90% success rate over 100 attempts, the posterior would center much closer to 0.90, because the large sample provides strong evidence that overwhelms the prior. Conversely, if he made 18 of 20 but then regressed to 75% over the next 100 attempts, our posterior would reflect that he's probably around league average, and his hot start was likely luck.
Understanding Credible Intervals
The 95% credible interval [0.78, 0.91] means: "Given our model and data, there's a 95% probability the kicker's true success rate is between 78% and 91%." This is much more intuitive than a frequentist confidence interval, which has a more convoluted interpretation about long-run frequencies of interval-construction procedures.Bayesian Inference for Simple Problems
Example 1: Beta-Binomial Model for Conversion Rates
Now that we understand the basic mechanics of Bayesian inference, let's apply it to a real football analytics problem: estimating team-specific third down conversion rates. This is a more complex example than the single kicker because we'll be estimating rates for all 32 NFL teams simultaneously.
The Beta-binomial model is a conjugate prior setup perfect for any success/failure data in football:
- Third down conversion rates: Did the offense gain a first down? (Yes/No)
- Red zone touchdown rates: Did a red zone drive end in a TD? (Yes/No)
- Two-point conversion success: Did the attempt succeed? (Yes/No)
- Fourth down conversion rates: Did the offense convert? (Yes/No)
- Completion percentage: Was the pass completed? (Yes/No)
The key advantage of this model is its simplicity and interpretability. We're estimating a probability for each team, and the Beta-binomial conjugacy means we can compute posteriors analytically without MCMC sampling.
Model Setup:
$$ \begin{align} \text{Successes} | \theta &\sim \text{Binomial}(n, \theta) \\ \theta &\sim \text{Beta}(\alpha, \beta) \end{align} $$
Where:
- $\theta$ is the true (unknown) conversion probability for a team
- $n$ is the number of attempts
- The Beta prior has parameters $\alpha$ and $\beta$ that encode our prior belief
Conjugacy: If the prior is Beta($\alpha$, $\beta$) and we observe $k$ successes in $n$ trials, the posterior is Beta($\alpha + k$, $\beta + n - k$). This means:
- The posterior has the same distributional form as the prior (both Beta)
- We update by simply adding observed data to prior parameters
- No complex computation or sampling required
- We get exact results instantly
Why Conjugate Priors Are Valuable
Conjugate priors like Beta-Binomial are special because they allow analytical posterior computation. While modern MCMC methods can handle any prior-likelihood combination, conjugate priors provide: 1. **Instant results**: No sampling, no convergence diagnostics 2. **Intuitive interpretation**: Prior parameters can be thought of as "pseudo-observations" 3. **Computational efficiency**: Useful for real-time applications 4. **Mathematical elegance**: Clear understanding of how prior and data combine However, they're limited to specific model families. For more complex models, we'll need MCMC.#| label: third-down-analysis-r
#| message: false
#| warning: false
#| cache: true
library(nflfastR)
# Load 2023 data
pbp_2023 <- load_pbp(2023)
# Calculate third down conversion rates by team
third_down_data <- pbp_2023 %>%
filter(down == 3, !is.na(posteam), play_type %in% c("pass", "run")) %>%
mutate(conversion = if_else(first_down_rush == 1 | first_down_pass == 1, 1, 0)) %>%
group_by(posteam) %>%
summarise(
attempts = n(),
conversions = sum(conversion, na.rm = TRUE),
raw_rate = conversions / attempts,
.groups = "drop"
)
# Bayesian estimation with weakly informative prior
# Prior: Beta(4, 6) centered at 0.4 (league average roughly)
alpha_prior <- 4
beta_prior <- 6
third_down_bayes <- third_down_data %>%
mutate(
# Posterior parameters
alpha_post = alpha_prior + conversions,
beta_post = beta_prior + (attempts - conversions),
# Posterior mean (point estimate)
bayes_rate = alpha_post / (alpha_post + beta_post),
# Posterior standard deviation
bayes_sd = sqrt(alpha_post * beta_post /
((alpha_post + beta_post)^2 * (alpha_post + beta_post + 1))),
# 95% credible interval
lower_95 = qbeta(0.025, alpha_post, beta_post),
upper_95 = qbeta(0.975, alpha_post, beta_post)
) %>%
arrange(desc(bayes_rate))
# Show top 10 teams
third_down_bayes %>%
select(posteam, attempts, conversions, raw_rate, bayes_rate, lower_95, upper_95) %>%
head(10) %>%
gt::gt() %>%
gt::cols_label(
posteam = "Team",
attempts = "Attempts",
conversions = "Conv.",
raw_rate = "Raw Rate",
bayes_rate = "Bayes Rate",
lower_95 = "Lower 95%",
upper_95 = "Upper 95%"
) %>%
gt::fmt_number(
columns = c(raw_rate, bayes_rate, lower_95, upper_95),
decimals = 3
) %>%
gt::tab_header(
title = "Third Down Conversion Rates (2023)",
subtitle = "Bayesian Estimates with 95% Credible Intervals"
)
#| label: third-down-analysis-py
#| message: false
#| warning: false
#| cache: true
import nfl_data_py as nfl
# Load 2023 data
pbp_2023 = nfl.import_pbp_data([2023])
# Calculate third down conversion rates by team
third_down_data = (pbp_2023
.query("down == 3 & posteam.notna() & play_type.isin(['pass', 'run'])")
.assign(conversion=lambda x: ((x['first_down_rush'] == 1) |
(x['first_down_pass'] == 1)).astype(int))
.groupby('posteam')
.agg(
attempts=('conversion', 'count'),
conversions=('conversion', 'sum')
)
.reset_index()
.assign(raw_rate=lambda x: x['conversions'] / x['attempts'])
)
# Bayesian estimation with weakly informative prior
# Prior: Beta(4, 6) centered at 0.4 (league average roughly)
alpha_prior = 4
beta_prior = 6
third_down_bayes = (third_down_data
.assign(
alpha_post=lambda x: alpha_prior + x['conversions'],
beta_post=lambda x: beta_prior + (x['attempts'] - x['conversions']),
bayes_rate=lambda x: x['alpha_post'] / (x['alpha_post'] + x['beta_post']),
bayes_sd=lambda x: np.sqrt(
x['alpha_post'] * x['beta_post'] /
((x['alpha_post'] + x['beta_post'])**2 *
(x['alpha_post'] + x['beta_post'] + 1))
),
lower_95=lambda x: stats.beta(x['alpha_post'], x['beta_post']).ppf(0.025),
upper_95=lambda x: stats.beta(x['alpha_post'], x['beta_post']).ppf(0.975)
)
.sort_values('bayes_rate', ascending=False)
)
# Show top 10 teams
print("\nThird Down Conversion Rates (2023)")
print("Bayesian Estimates with 95% Credible Intervals\n")
print(third_down_bayes[['posteam', 'attempts', 'conversions',
'raw_rate', 'bayes_rate', 'lower_95', 'upper_95']]
.head(10)
.to_string(index=False, float_format=lambda x: f'{x:.3f}'))
Notice how Bayesian estimates "shrink" toward the prior mean (0.4), especially for teams with fewer attempts. This is called regularization and prevents overfitting to small samples. A team that converts 8 of 10 third downs (80%) probably isn't truly an 80% team—they likely got lucky in a small sample. The Bayesian estimate of perhaps 55-60% better reflects their true ability, accounting for both the impressive performance and the limited evidence.
This shrinkage is adaptive: it depends on sample size. Teams with 200+ attempts are barely shrunk, while teams with 50 attempts are shrunk substantially. This is exactly what we want—a data-driven regularization that protects us from small-sample noise without ignoring large-sample signal.
Example 2: Comparing Two Teams
One advantage of Bayesian inference is that comparing parameters is straightforward—we can directly compute probabilities like "What's the probability Team A has a higher conversion rate than Team B?"
#| label: team-comparison-r
#| message: false
#| warning: false
# Compare two teams: let's pick two interesting teams
team_a <- "SF"
team_b <- "DAL"
# Get posterior parameters
team_a_params <- third_down_bayes %>%
filter(posteam == team_a) %>%
select(alpha_post, beta_post)
team_b_params <- third_down_bayes %>%
filter(posteam == team_b) %>%
select(alpha_post, beta_post)
# Monte Carlo simulation from posteriors
set.seed(2024)
n_sims <- 10000
theta_a <- rbeta(n_sims, team_a_params$alpha_post, team_a_params$beta_post)
theta_b <- rbeta(n_sims, team_b_params$alpha_post, team_b_params$beta_post)
# Calculate probability that Team A > Team B
prob_a_better <- mean(theta_a > theta_b)
# Expected difference
diff_samples <- theta_a - theta_b
# Visualize
tibble(
Team_A = theta_a,
Team_B = theta_b,
Difference = diff_samples
) %>%
pivot_longer(everything(), names_to = "Parameter", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Parameter)) +
geom_density(alpha = 0.6) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
scale_fill_manual(
values = c("Team_A" = "#AA0000", "Team_B" = "#003594",
"Difference" = "#999999"),
labels = c(paste(team_a, "3rd Down %"),
paste(team_b, "3rd Down %"),
"Difference (SF - DAL)")
) +
labs(
title = "Posterior Distributions for Third Down Conversion Rate",
subtitle = sprintf("P(%s > %s) = %.3f", team_a, team_b, prob_a_better),
x = "Conversion Rate",
y = "Density",
fill = NULL
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
# Print summary
cat(sprintf("\nComparison Summary:\n"))
cat(sprintf("P(%s > %s) = %.3f\n", team_a, team_b, prob_a_better))
cat(sprintf("Expected difference: %.3f\n", mean(diff_samples)))
cat(sprintf("95%% CI for difference: [%.3f, %.3f]\n",
quantile(diff_samples, 0.025),
quantile(diff_samples, 0.975)))
#| label: team-comparison-py
#| message: false
#| warning: false
# Compare two teams
team_a = "SF"
team_b = "DAL"
# Get posterior parameters
team_a_params = third_down_bayes[third_down_bayes['posteam'] == team_a][['alpha_post', 'beta_post']].iloc[0]
team_b_params = third_down_bayes[third_down_bayes['posteam'] == team_b][['alpha_post', 'beta_post']].iloc[0]
# Monte Carlo simulation from posteriors
np.random.seed(2024)
n_sims = 10000
theta_a = stats.beta(team_a_params['alpha_post'], team_a_params['beta_post']).rvs(n_sims)
theta_b = stats.beta(team_b_params['alpha_post'], team_b_params['beta_post']).rvs(n_sims)
# Calculate probability that Team A > Team B
prob_a_better = np.mean(theta_a > theta_b)
# Expected difference
diff_samples = theta_a - theta_b
# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(theta_a, bins=50, alpha=0.6, density=True,
label=f'{team_a} 3rd Down %', color='#AA0000')
ax.hist(theta_b, bins=50, alpha=0.6, density=True,
label=f'{team_b} 3rd Down %', color='#003594')
ax.axvline(x=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('Conversion Rate', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title(f'Posterior Distributions for Third Down Conversion Rate\n' +
f'P({team_a} > {team_b}) = {prob_a_better:.3f}',
fontsize=14, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()
# Print summary
print(f"\nComparison Summary:")
print(f"P({team_a} > {team_b}) = {prob_a_better:.3f}")
print(f"Expected difference: {diff_samples.mean():.3f}")
print(f"95% CI for difference: [{np.percentile(diff_samples, 2.5):.3f}, " +
f"{np.percentile(diff_samples, 97.5):.3f}]")
This probabilistic comparison is more nuanced than a simple hypothesis test. We get a direct answer to "How much better is Team A?" with full uncertainty quantification.
Hierarchical Models for Football Data
Hierarchical (multilevel) models are ideal for football's nested structure: players within teams, teams within divisions, games within seasons. They provide a principled way to share information across groups while respecting individual differences. This is perhaps the most powerful application of Bayesian methods to football analytics.
The fundamental insight is that football data exhibits natural hierarchies. Players are nested within teams. Teams are nested within divisions. Games are nested within seasons. When we estimate player performance, we should use not just that player's data, but also information from similar players, players on the same team, and the league as a whole. Hierarchical models formalize this intuition.
Why Hierarchical Models?
Consider the problem of rating NFL quarterbacks. Every season, we have about 40-50 QBs who throw meaningful numbers of passes. Some are starters with 500+ attempts, others are backups with 50 attempts. How should we estimate their abilities?
We could:
- Complete pooling: Assume all QBs have the same ability (clearly wrong—Patrick Mahomes is not the same as a third-string backup)
- No pooling: Estimate each QB independently using only his data (inefficient and noisy, especially for backups with limited attempts)
- Partial pooling (hierarchical): Share information across QBs while allowing individual differences (optimal balance)
Hierarchical models implement partial pooling through a two-level structure:
Level 1 (Individual): Each QB has his own true ability $\mu_i$
Level 2 (Group): All QB abilities come from a common distribution with mean $\mu_{\text{league}}$ and spread $\sigma_{\text{QB}}$
This creates automatic regularization that:
- Shrinks estimates with less data toward the group mean (protecting against noise)
- Respects estimates with more data (trusting solid evidence)
- Quantifies both within-group variation (play-to-play noise) and between-group variation (true differences in QB ability)
- Handles imbalanced data naturally (starters and backups in the same model)
- Pools information across players (backup QBs borrow strength from similar players)
The Shrinkage Principle
Shrinkage is not a flaw—it's a feature! When we have limited data, extreme estimates are usually wrong. A backup QB who averages 0.30 EPA over 50 attempts probably isn't an elite QB who just never got a chance. He's more likely an average-ish backup who had a lucky stretch. Hierarchical models automatically and optimally shrink such estimates toward the mean, with the amount of shrinkage determined by: 1. **Sample size**: More data = less shrinkage 2. **Group variance**: More spread in true abilities = less shrinkage 3. **Within-group noise**: Higher play-to-play variation = more shrinkage This is exactly what domain experts do intuitively—hierarchical models formalize it mathematically.Example: Hierarchical QB Rating Model
We'll build a hierarchical model for QB performance measured by EPA per play.
Model Structure:
$$ \begin{align} \text{EPA}_{ij} &\sim \text{Normal}(\mu_i, \sigma_y) && \text{[Likelihood]} \\ \mu_i &\sim \text{Normal}(\mu_{\text{league}}, \sigma_{\text{QB}}) && \text{[QB effects]} \\ \mu_{\text{league}} &\sim \text{Normal}(0.05, 0.5) && \text{[League mean prior]} \\ \sigma_y &\sim \text{Exponential}(1/2) && \text{[Within-QB variation]} \\ \sigma_{\text{QB}} &\sim \text{Exponential}(1/0.5) && \text{[Between-QB variation]} \end{align} $$
Where:
- $\text{EPA}_{ij}$ is EPA for QB $i$ on play $j$
- $\mu_i$ is the true ability of QB $i$
- $\mu_{\text{league}}$ is the league average
- $\sigma_{\text{QB}}$ controls how much QBs differ from league average
- $\sigma_y$ is play-to-play variation
#| label: hierarchical-qb-r
#| message: false
#| warning: false
#| cache: true
#| results: hide
library(brms)
library(tidybayes)
# Prepare QB data (regular passers with at least 100 attempts)
qb_data <- pbp_2023 %>%
filter(
!is.na(epa),
play_type == "pass",
!is.na(passer_player_name),
sack == 0
) %>%
group_by(passer_player_name, passer_player_id) %>%
filter(n() >= 100) %>%
ungroup() %>%
select(passer_player_name, passer_player_id, epa)
# Fit hierarchical model
qb_model <- brm(
epa ~ (1 | passer_player_name),
data = qb_data,
family = gaussian(),
prior = c(
prior(normal(0.05, 0.5), class = Intercept),
prior(exponential(2), class = sd),
prior(exponential(0.5), class = sigma)
),
chains = 4,
iter = 2000,
warmup = 1000,
cores = 4,
seed = 2024,
backend = "cmdstanr",
silent = 2,
refresh = 0
)
#| label: hierarchical-qb-r-results
#| message: false
#| warning: false
# Extract QB-specific intercepts (random effects)
qb_effects <- qb_model %>%
spread_draws(r_passer_player_name[passer_player_name, ]) %>%
mean_qi(.width = 0.95) %>%
rename(qb_effect = r_passer_player_name)
# Get league-level intercept
league_mean <- fixef(qb_model)[1, "Estimate"]
# Add raw means for comparison
qb_summary <- qb_data %>%
group_by(passer_player_name) %>%
summarise(
n_plays = n(),
raw_epa = mean(epa),
.groups = "drop"
) %>%
left_join(qb_effects, by = "passer_player_name") %>%
mutate(
hierarchical_epa = league_mean + qb_effect,
shrinkage = raw_epa - hierarchical_epa
) %>%
arrange(desc(hierarchical_epa))
# Plot comparison: raw vs hierarchical estimates
qb_summary %>%
filter(n_plays >= 200) %>%
head(15) %>%
ggplot(aes(y = reorder(passer_player_name, hierarchical_epa))) +
geom_point(aes(x = raw_epa, color = "Raw"), size = 3) +
geom_point(aes(x = hierarchical_epa, color = "Hierarchical"), size = 3) +
geom_segment(aes(x = raw_epa, xend = hierarchical_epa,
yend = reorder(passer_player_name, hierarchical_epa)),
arrow = arrow(length = unit(0.2, "cm"))) +
scale_color_manual(values = c("Raw" = "#F8766D", "Hierarchical" = "#00BA38")) +
labs(
title = "QB EPA: Raw vs Hierarchical Estimates",
subtitle = "Arrows show shrinkage toward league mean",
x = "EPA per Play",
y = NULL,
color = "Estimate Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
# Show top QBs
qb_summary %>%
head(10) %>%
select(passer_player_name, n_plays, raw_epa, hierarchical_epa, .lower, .upper) %>%
gt::gt() %>%
gt::cols_label(
passer_player_name = "QB",
n_plays = "Plays",
raw_epa = "Raw EPA",
hierarchical_epa = "Bayes EPA",
.lower = "Lower 95%",
.upper = "Upper 95%"
) %>%
gt::fmt_number(
columns = c(raw_epa, hierarchical_epa, .lower, .upper),
decimals = 3
) %>%
gt::tab_header(
title = "Top 10 QBs by Hierarchical EPA (2023)",
subtitle = "With 95% Credible Intervals"
)
#| label: hierarchical-qb-py
#| message: false
#| warning: false
#| cache: true
import pymc as pm
import arviz as az
# Prepare QB data (regular passers with at least 100 attempts)
qb_data_py = (pbp_2023
.query("epa.notna() & play_type == 'pass' & passer_player_name.notna() & sack == 0")
.groupby('passer_player_name')
.filter(lambda x: len(x) >= 100)
.reset_index(drop=True)
)
# Create numeric indices for QBs
qb_data_py['qb_idx'] = pd.Categorical(qb_data_py['passer_player_name']).codes
n_qbs = qb_data_py['qb_idx'].nunique()
# Build hierarchical model
with pm.Model() as qb_model_py:
# Hyperpriors
mu_league = pm.Normal('mu_league', mu=0.05, sigma=0.5)
sigma_qb = pm.Exponential('sigma_qb', lam=2.0)
sigma_y = pm.Exponential('sigma_y', lam=0.5)
# QB-specific effects
qb_effect = pm.Normal('qb_effect', mu=0, sigma=sigma_qb, shape=n_qbs)
# Expected value for each play
mu = mu_league + qb_effect[qb_data_py['qb_idx'].values]
# Likelihood
epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma_y,
observed=qb_data_py['epa'].values)
# Sample from posterior
trace = pm.sample(2000, tune=1000, chains=4, cores=4,
random_seed=2024, return_inferencedata=True,
progressbar=False)
# Extract QB effects
qb_names = pd.Categorical(qb_data_py['passer_player_name']).categories
qb_effects_post = trace.posterior['qb_effect'].values.reshape(-1, n_qbs)
league_mean_post = trace.posterior['mu_league'].values.flatten()
# Calculate hierarchical EPA for each QB
qb_results = pd.DataFrame({
'passer_player_name': qb_names,
'qb_effect_mean': qb_effects_post.mean(axis=0),
'qb_effect_lower': np.percentile(qb_effects_post, 2.5, axis=0),
'qb_effect_upper': np.percentile(qb_effects_post, 97.5, axis=0)
}).assign(
hierarchical_epa=lambda x: league_mean_post.mean() + x['qb_effect_mean']
)
# Add raw EPA
raw_epa = (qb_data_py
.groupby('passer_player_name')
.agg(n_plays=('epa', 'count'), raw_epa=('epa', 'mean'))
.reset_index()
)
qb_summary_py = qb_results.merge(raw_epa, on='passer_player_name').sort_values(
'hierarchical_epa', ascending=False
)
# Show top 10 QBs
print("\nTop 10 QBs by Hierarchical EPA (2023)")
print("With 95% Credible Intervals\n")
print(qb_summary_py[['passer_player_name', 'n_plays', 'raw_epa',
'hierarchical_epa', 'qb_effect_lower', 'qb_effect_upper']]
.head(10)
.to_string(index=False, float_format=lambda x: f'{x:.3f}'))
The hierarchical model shrinks extreme estimates (especially for QBs with fewer plays) toward the league mean. This is automatic regularization: QBs with strong evidence stay close to their raw EPA, while those with weak evidence are pulled toward average. This protects us from ranking backup QBs with lucky small samples ahead of established starters with larger, more representative samples.
Visualizing Hierarchical Shrinkage
#| label: fig-shrinkage-r
#| fig-cap: "Hierarchical shrinkage effect on QB EPA estimates"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
qb_summary %>%
ggplot(aes(x = n_plays, y = shrinkage)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_point(aes(size = abs(shrinkage), color = shrinkage > 0), alpha = 0.7) +
geom_smooth(method = "loess", se = TRUE, color = "black") +
scale_color_manual(
values = c("TRUE" = "#F8766D", "FALSE" = "#00BA38"),
labels = c("Shrunk Down", "Shrunk Up")
) +
scale_size_continuous(range = c(1, 5), guide = "none") +
labs(
title = "Hierarchical Shrinkage Effect by Sample Size",
subtitle = "QBs with fewer plays are shrunk more toward league mean",
x = "Number of Pass Plays",
y = "Shrinkage (Raw EPA - Hierarchical EPA)",
color = "Direction"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
#| label: fig-shrinkage-py
#| fig-cap: "Hierarchical shrinkage effect on QB EPA estimates (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
# Calculate shrinkage
qb_summary_py['shrinkage'] = qb_summary_py['raw_epa'] - qb_summary_py['hierarchical_epa']
# Create plot
fig, ax = plt.subplots(figsize=(10, 6))
# Separate shrinkage up and down
shrink_up = qb_summary_py[qb_summary_py['shrinkage'] > 0]
shrink_down = qb_summary_py[qb_summary_py['shrinkage'] <= 0]
ax.scatter(shrink_up['n_plays'], shrink_up['shrinkage'],
s=np.abs(shrink_up['shrinkage'])*500, alpha=0.7,
color='#F8766D', label='Shrunk Down')
ax.scatter(shrink_down['n_plays'], shrink_down['shrinkage'],
s=np.abs(shrink_down['shrinkage'])*500, alpha=0.7,
color='#00BA38', label='Shrunk Up')
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
ax.set_xlabel('Number of Pass Plays', fontsize=12)
ax.set_ylabel('Shrinkage (Raw EPA - Hierarchical EPA)', fontsize=12)
ax.set_title('Hierarchical Shrinkage Effect by Sample Size\n' +
'QBs with fewer plays are shrunk more toward league mean',
fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Markov Chain Monte Carlo (MCMC)
For most real-world Bayesian models, we cannot compute the posterior analytically. The Beta-Binomial conjugacy we used earlier is a special case—most models don't have this convenient property. Instead, we use Markov Chain Monte Carlo (MCMC) algorithms to generate samples from the posterior distribution.
MCMC is both conceptually elegant and practically powerful. Rather than computing the posterior distribution directly (which requires solving difficult integrals), MCMC generates samples from the posterior. These samples can then be used to approximate any posterior quantity we care about—means, medians, quantiles, probabilities, predictions, etc.
Why MCMC?
The core challenge in Bayesian inference is computing the posterior:
$$ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{\int P(D | \theta') P(\theta') d\theta'} $$
The numerator $P(D | \theta) P(\theta)$ is usually easy to compute—we specify the likelihood and prior. The denominator $\int P(D | \theta') P(\theta') d\theta'$ is the problem. This integral:
- Marginalizes over all possible parameter values
- Sums/integrates in potentially high-dimensional spaces
- Rarely has a closed form except for conjugate prior families
- Cannot be computed numerically when models have many parameters
For example, our hierarchical QB model has 40+ parameters (one per QB plus hyperparameters). The denominator requires integrating over a 40+ dimensional space—computationally infeasible with standard numerical integration.
MCMC cleverly avoids computing this integral by generating samples $\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}$ that follow the posterior distribution without ever computing the normalizing constant. The key insight: we only need to evaluate the numerator (likelihood × prior) at different parameter values. MCMC algorithms construct a Markov chain whose stationary distribution is the posterior, so after a burn-in period, samples from the chain are samples from the posterior.
Monte Carlo: Named After a Casino
The term "Monte Carlo" comes from the Monte Carlo Casino in Monaco and refers to using randomness to solve deterministic problems. The "Markov Chain" part refers to the sequential nature of the sampling—each new sample depends only on the previous sample, not the entire history (the Markov property). Together, MCMC uses random walks (Markov chains) to generate samples (Monte Carlo) from complex distributions.Common MCMC Algorithms
- Metropolis-Hastings: Classic algorithm, proposes moves and accepts/rejects
- Gibbs Sampling: Updates one parameter at a time conditionally
- Hamiltonian Monte Carlo (HMC): Uses gradient information (Stan, PyMC)
- NUTS (No-U-Turn Sampler): Adaptive version of HMC (default in Stan/PyMC)
Modern tools like Stan and PyMC use NUTS, which is highly efficient for complex models.
MCMC Diagnostics
After running MCMC, we must check:
- Convergence: Did chains reach the stationary distribution?
- Mixing: Do chains explore the posterior efficiently?
- Autocorrelation: How independent are sequential samples?
#| label: mcmc-diagnostics-r
#| fig-cap: "MCMC diagnostics for QB hierarchical model"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false
# Trace plots for key parameters
plot(qb_model, variable = c("b_Intercept", "sd_passer_player_name__Intercept", "sigma"),
ask = FALSE)
# Summary with convergence diagnostics
summary(qb_model)
# More detailed diagnostics
# R-hat: should be < 1.01 (measures convergence across chains)
# Effective sample size: should be > 400 per chain ideally
cat("\nConvergence Diagnostics:\n")
cat(sprintf("All R-hat values < 1.01: %s\n",
all(rhat(qb_model) < 1.01, na.rm = TRUE)))
cat(sprintf("Min effective sample size: %.0f\n",
min(neff_ratio(qb_model) * 4000, na.rm = TRUE)))
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: mcmc-diagnostics-py
#| fig-cap: "MCMC diagnostics for QB hierarchical model (Python)"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false
# Trace plots
az.plot_trace(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y'])
plt.tight_layout()
plt.show()
# Summary statistics
print("\nPosterior Summary:")
print(az.summary(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y']))
# Convergence diagnostics
print("\nConvergence Diagnostics:")
summary_df = az.summary(trace, var_names=['mu_league', 'sigma_qb', 'sigma_y'])
print(f"All R-hat values < 1.01: {all(summary_df['r_hat'] < 1.01)}")
print(f"Min effective sample size: {summary_df['ess_bulk'].min():.0f}")
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Key Diagnostics:
- R-hat: Measures convergence across chains. Should be < 1.01.
- Effective Sample Size (ESS): Accounts for autocorrelation. Want > 400 per chain.
- Trace plots: Should look like "hairy caterpillars" with no trends or stuck sections
- Divergences: Warnings about numerical issues. Should be 0 or very few.
Bayesian Regression for Football Metrics
Linear regression is a workhorse of football analytics. Bayesian regression provides:
- Full posterior distributions for coefficients
- Natural regularization through priors
- Principled uncertainty quantification
- Easy incorporation of domain knowledge
Example: Predicting EPA from Play Characteristics
Let's build a Bayesian regression model predicting EPA from:
- Down (1st, 2nd, 3rd, 4th)
- Yards to go
- Field position (yardline)
- Score differential
- Time remaining
#| label: bayesian-regression-r
#| message: false
#| warning: false
#| cache: true
#| results: hide
# Prepare regression data
reg_data <- pbp_2023 %>%
filter(
!is.na(epa),
play_type %in% c("pass", "run"),
!is.na(down), !is.na(ydstogo), !is.na(yardline_100),
!is.na(score_differential)
) %>%
mutate(
down = factor(down),
qtr = factor(qtr),
# Standardize continuous predictors for better sampling
ydstogo_std = scale(ydstogo)[,1],
yardline_std = scale(yardline_100)[,1],
score_diff_std = scale(score_differential)[,1]
) %>%
select(epa, down, ydstogo_std, yardline_std, score_diff_std)
# Fit Bayesian regression
epa_reg_model <- brm(
epa ~ down + ydstogo_std + yardline_std + score_diff_std,
data = reg_data,
family = gaussian(),
prior = c(
prior(normal(0, 1), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(0.5), class = sigma)
),
chains = 4,
iter = 2000,
warmup = 1000,
cores = 4,
seed = 2024,
backend = "cmdstanr",
silent = 2,
refresh = 0
)
#| label: bayesian-regression-r-results
#| message: false
#| warning: false
# Coefficient plot with uncertainty
mcmc_plot(epa_reg_model, type = "intervals", prob = 0.95) +
labs(
title = "Bayesian Regression Coefficients for EPA",
subtitle = "95% Credible Intervals",
x = "Coefficient Value"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
# Detailed summary
summary(epa_reg_model)
# Posterior predictive check
pp_check(epa_reg_model, ndraws = 100) +
labs(title = "Posterior Predictive Check",
subtitle = "Model predictions (light blue) vs observed data (dark blue)")
#| label: bayesian-regression-py
#| message: false
#| warning: false
#| cache: true
# Prepare regression data
reg_data_py = (pbp_2023
.query("epa.notna() & play_type.isin(['pass', 'run']) & " +
"down.notna() & ydstogo.notna() & yardline_100.notna() & " +
"score_differential.notna()")
.assign(
down_cat=lambda x: pd.Categorical(x['down']),
down_idx=lambda x: x['down_cat'].codes,
ydstogo_std=lambda x: (x['ydstogo'] - x['ydstogo'].mean()) / x['ydstogo'].std(),
yardline_std=lambda x: (x['yardline_100'] - x['yardline_100'].mean()) / x['yardline_100'].std(),
score_diff_std=lambda x: (x['score_differential'] - x['score_differential'].mean()) / x['score_differential'].std()
)
)
# Build Bayesian regression model
with pm.Model() as epa_reg_model_py:
# Priors
intercept = pm.Normal('intercept', mu=0, sigma=1)
beta_down = pm.Normal('beta_down', mu=0, sigma=0.5, shape=3) # 3 down dummies
beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
sigma = pm.Exponential('sigma', lam=0.5)
# Linear model
mu = (intercept +
beta_down[reg_data_py['down_idx'].values] +
beta_ydstogo * reg_data_py['ydstogo_std'].values +
beta_yardline * reg_data_py['yardline_std'].values +
beta_score * reg_data_py['score_diff_std'].values)
# Likelihood
epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma,
observed=reg_data_py['epa'].values)
# Sample
trace_reg = pm.sample(2000, tune=1000, chains=4, cores=4,
random_seed=2024, return_inferencedata=True,
progressbar=False)
# Plot coefficients
az.plot_forest(trace_reg, var_names=['beta_ydstogo', 'beta_yardline', 'beta_score'],
combined=True, hdi_prob=0.95)
plt.title('Bayesian Regression Coefficients for EPA\n95% Credible Intervals',
fontweight='bold')
plt.tight_layout()
plt.show()
# Summary
print("\nPosterior Summary:")
print(az.summary(trace_reg, var_names=['intercept', 'beta_ydstogo',
'beta_yardline', 'beta_score']))
Interpreting Results:
- Yards to go: Negative coefficient (longer distance = lower EPA)
- Yardline: Positive coefficient (closer to end zone = higher EPA)
- Score differential: Tells us about game script effects
- Down effects: Compare downs' impact on EPA
Unlike frequentist regression, we get full probability distributions for each coefficient, allowing statements like "There's a 95% probability the yardline effect is between 0.15 and 0.21."
Model Comparison and Selection
How do we choose between competing models? Bayesian methods offer several approaches:
1. Leave-One-Out Cross-Validation (LOO-CV)
LOO approximates leave-one-out cross-validation using Pareto-smoothed importance sampling (PSIS). It provides:
- ELPD (Expected Log Predictive Density): Higher is better
- LOO-IC: Lower is better (analogous to AIC/BIC)
- Pareto k diagnostics: Identifies influential observations
2. WAIC (Widely Applicable Information Criterion)
WAIC is a fully Bayesian alternative to AIC:
$$ \text{WAIC} = -2 \sum_{i=1}^{n} \log \left( \mathbb{E}_{\text{post}}[p(y_i | \theta)] \right) + 2 \sum_{i=1}^{n} \text{Var}_{\text{post}}[\log p(y_i | \theta)] $$
Lower WAIC indicates better predictive performance.
#| label: model-comparison-r
#| message: false
#| warning: false
#| cache: true
# Fit simpler model without down
epa_reg_model_simple <- brm(
epa ~ ydstogo_std + yardline_std + score_diff_std,
data = reg_data,
family = gaussian(),
prior = c(
prior(normal(0, 1), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(0.5), class = sigma)
),
chains = 4,
iter = 2000,
warmup = 1000,
cores = 4,
seed = 2024,
backend = "cmdstanr",
silent = 2,
refresh = 0
)
# Compute LOO for both models
loo_full <- loo(epa_reg_model)
loo_simple <- loo(epa_reg_model_simple)
# Compare models
loo_compare(loo_full, loo_simple)
# Also compute WAIC
waic_full <- waic(epa_reg_model)
waic_simple <- waic(epa_reg_model_simple)
# Display comparison
cat("\nModel Comparison (LOO):\n")
print(loo_compare(loo_full, loo_simple))
cat("\n\nModel Comparison (WAIC):\n")
cat(sprintf("Full model WAIC: %.2f\n", waic_full$estimates["waic", "Estimate"]))
cat(sprintf("Simple model WAIC: %.2f\n", waic_simple$estimates["waic", "Estimate"]))
cat(sprintf("Difference: %.2f (lower is better)\n",
waic_simple$estimates["waic", "Estimate"] -
waic_full$estimates["waic", "Estimate"]))
#| label: model-comparison-py
#| message: false
#| warning: false
#| cache: true
# Fit simpler model without down effects
with pm.Model() as epa_reg_model_simple_py:
# Priors
intercept = pm.Normal('intercept', mu=0, sigma=1)
beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
sigma = pm.Exponential('sigma', lam=0.5)
# Linear model (no down effects)
mu = (intercept +
beta_ydstogo * reg_data_py['ydstogo_std'].values +
beta_yardline * reg_data_py['yardline_std'].values +
beta_score * reg_data_py['score_diff_std'].values)
# Likelihood
epa_obs = pm.Normal('epa_obs', mu=mu, sigma=sigma,
observed=reg_data_py['epa'].values)
# Sample
trace_reg_simple = pm.sample(2000, tune=1000, chains=4, cores=4,
random_seed=2024, return_inferencedata=True,
progressbar=False)
# Compute LOO for both models
loo_full = az.loo(trace_reg)
loo_simple = az.loo(trace_reg_simple)
# Compare models
loo_compare = az.compare({'Full': trace_reg, 'Simple': trace_reg_simple})
print("\nModel Comparison (LOO-CV):")
print(loo_compare)
# Also compute WAIC
waic_full = az.waic(trace_reg)
waic_simple = az.waic(trace_reg_simple)
print(f"\nFull model WAIC: {waic_full.waic:.2f}")
print(f"Simple model WAIC: {waic_simple.waic:.2f}")
print(f"Difference: {waic_simple.waic - waic_full.waic:.2f} (lower is better)")
The full model (with down effects) should substantially outperform the simple model, confirming that down is an important predictor of EPA.
Uncertainty Quantification
A key advantage of Bayesian inference is comprehensive uncertainty quantification. We can:
- Credible intervals: "There's a 95% probability the parameter is in this range"
- Posterior predictive distributions: Predictions with uncertainty
- Probability statements: Direct answers to "What's the probability that...?"
Posterior Predictive Distributions
The posterior predictive distribution answers: "Given our model and observed data, what do we predict for new data?"
$$ p(\tilde{y} | D) = \int p(\tilde{y} | \theta) p(\theta | D) d\theta $$
This integrates over all posterior uncertainty in $\theta$.
#| label: posterior-predictive-r
#| fig-cap: "Posterior predictive distribution for EPA"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
# Generate posterior predictions for specific scenarios
new_scenarios <- tibble(
scenario = c("1st & 10, Own 25", "3rd & 5, Opp 30", "2nd & 15, Midfield"),
down = factor(c(1, 3, 2)),
ydstogo_std = scale(c(10, 5, 15),
center = attr(reg_data$ydstogo_std, "scaled:center"),
scale = attr(reg_data$ydstogo_std, "scaled:scale"))[,1],
yardline_std = scale(c(75, 30, 50),
center = attr(reg_data$yardline_std, "scaled:center"),
scale = attr(reg_data$yardline_std, "scaled:scale"))[,1],
score_diff_std = scale(c(0, 0, 0),
center = attr(reg_data$score_diff_std, "scaled:center"),
scale = attr(reg_data$score_diff_std, "scaled:scale"))[,1]
)
# Generate posterior predictions
posterior_preds <- posterior_predict(epa_reg_model, newdata = new_scenarios)
# Convert to tidy format
pred_df <- as_tibble(t(posterior_preds)) %>%
set_names(new_scenarios$scenario) %>%
pivot_longer(everything(), names_to = "scenario", values_to = "epa_pred")
# Visualize
pred_df %>%
ggplot(aes(x = epa_pred, fill = scenario)) +
geom_density(alpha = 0.6) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Posterior Predictive Distributions for EPA",
subtitle = "Predictions with full uncertainty for different scenarios",
x = "Predicted EPA",
y = "Density",
fill = "Scenario"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top"
)
# Summary statistics
pred_summary <- pred_df %>%
group_by(scenario) %>%
summarise(
mean = mean(epa_pred),
median = median(epa_pred),
lower_95 = quantile(epa_pred, 0.025),
upper_95 = quantile(epa_pred, 0.975),
prob_positive = mean(epa_pred > 0),
.groups = "drop"
)
pred_summary %>%
gt::gt() %>%
gt::cols_label(
scenario = "Scenario",
mean = "Mean",
median = "Median",
lower_95 = "Lower 95%",
upper_95 = "Upper 95%",
prob_positive = "P(EPA > 0)"
) %>%
gt::fmt_number(
columns = c(mean, median, lower_95, upper_95, prob_positive),
decimals = 3
) %>%
gt::tab_header(
title = "Posterior Predictive Summary"
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: posterior-predictive-py
#| fig-cap: "Posterior predictive distribution for EPA (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
# Generate posterior predictions for specific scenarios
ydstogo_mean = reg_data_py['ydstogo'].mean()
ydstogo_std_val = reg_data_py['ydstogo'].std()
yardline_mean = reg_data_py['yardline_100'].mean()
yardline_std_val = reg_data_py['yardline_100'].std()
score_mean = reg_data_py['score_differential'].mean()
score_std_val = reg_data_py['score_differential'].std()
scenarios = [
{"name": "1st & 10, Own 25", "down": 0, "ydstogo": 10, "yardline": 75, "score": 0},
{"name": "3rd & 5, Opp 30", "down": 2, "ydstogo": 5, "yardline": 30, "score": 0},
{"name": "2nd & 15, Midfield", "down": 1, "ydstogo": 15, "yardline": 50, "score": 0}
]
# Generate predictions
fig, ax = plt.subplots(figsize=(10, 6))
for scenario in scenarios:
# Standardize inputs
ydstogo_std = (scenario["ydstogo"] - ydstogo_mean) / ydstogo_std_val
yardline_std = (scenario["yardline"] - yardline_mean) / yardline_std_val
score_std = (scenario["score"] - score_mean) / score_std_val
# Sample from posterior predictive
intercept_samples = trace_reg.posterior['intercept'].values.flatten()
beta_down_samples = trace_reg.posterior['beta_down'].values.reshape(-1, 3)[:, scenario["down"]]
beta_ydstogo_samples = trace_reg.posterior['beta_ydstogo'].values.flatten()
beta_yardline_samples = trace_reg.posterior['beta_yardline'].values.flatten()
beta_score_samples = trace_reg.posterior['beta_score'].values.flatten()
sigma_samples = trace_reg.posterior['sigma'].values.flatten()
# Posterior predictive samples
mu_samples = (intercept_samples + beta_down_samples +
beta_ydstogo_samples * ydstogo_std +
beta_yardline_samples * yardline_std +
beta_score_samples * score_std)
# Add sampling variation
pred_samples = np.random.normal(mu_samples, sigma_samples)
# Plot
ax.hist(pred_samples, bins=50, alpha=0.6, density=True, label=scenario["name"])
ax.axvline(x=0, color='black', linestyle='--', alpha=0.7)
ax.set_xlabel('Predicted EPA', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Posterior Predictive Distributions for EPA\n' +
'Predictions with full uncertainty for different scenarios',
fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
These distributions show not just point predictions, but the full range of plausible outcomes accounting for all sources of uncertainty.
Prior Selection and Sensitivity Analysis
Choosing priors is both an art and science, and it's often cited as the most subjective aspect of Bayesian analysis. However, with football data, we usually have substantial domain knowledge that makes prior selection straightforward. The key is to be transparent about prior choices and assess their influence on conclusions.
Prior selection involves balancing several competing concerns:
- Incorporating knowledge: Using what we know about football to improve estimates
- Avoiding overconfidence: Not being so strong that data can't change our minds
- Ensuring computational stability: Ruling out impossible or implausible values
- Maintaining transparency: Being clear about assumptions
Types of Priors
The spectrum of prior informativeness ranges from very strong (encoding substantial knowledge) to very weak (minimal influence). Here's how to think about different types:
-
Informative priors: Encode substantial domain knowledge with meaningful precision
- Example: QB EPA likely between -0.2 and 0.4 based on historical data
- When to use: You have strong prior information (historical data, expert consensus)
- Risk: If prior is wrong, it can bias conclusions
- Benefit: Improves estimates with limited data, incorporates decades of football knowledge -
Weakly informative priors: Provide gentle regularization without strong constraints
- Example: Normal(0, 1) on standardized predictors
- When to use: Default choice when you want some regularization but not strong assumptions
- Risk: Minimal—these priors rule out extreme values but let data speak
- Benefit: Stabilizes estimation, prevents nonsensical values, improves out-of-sample prediction -
Uninformative/vague priors: Minimal influence on posterior, "let the data decide"
- Example: Uniform(-∞, ∞) or Normal(0, 1000)
- When to use: Rarely! These can cause numerical problems
- Risk: No regularization, models may not converge, extreme estimates possible
- Benefit: None in practice—weakly informative priors are always better
The Myth of "Objective" Priors
Some practitioners seek "objective" or "uninformative" priors to avoid "subjective" Bayesian choices. This is misguided for several reasons: 1. **No prior is truly uninformative**: Even a "flat" prior encodes assumptions (e.g., all values equally likely) 2. **Vague priors cause problems**: They can lead to poor sampling, nonsensical estimates, and worse predictions 3. **Domain knowledge is valuable**: Football has been analyzed for decades—ignoring this knowledge is wasteful 4. **Frequentist methods aren't "objective"**: They also make assumptions, just implicitly The solution: Use weakly informative priors based on domain knowledge, and check sensitivity to prior choices.Prior Predictive Checks
Before fitting the model to data, you should always perform prior predictive checks: simulate data from your prior to ensure it produces reasonable predictions. This reveals whether your priors are sensible before you invest time in inference.
The logic: If your prior predicts EPA values of ±50 or field goal success rates of 10%, something is wrong with your prior specification. Fix it before seeing the data.
Here's how to conduct prior predictive checks:
#| label: prior-predictive-r
#| fig-cap: "Prior predictive check"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
# Sample from prior predictive distribution (before seeing data)
prior_pred <- brm(
epa ~ down + ydstogo_std + yardline_std + score_diff_std,
data = reg_data,
family = gaussian(),
prior = c(
prior(normal(0, 1), class = Intercept),
prior(normal(0, 0.5), class = b),
prior(exponential(0.5), class = sigma)
),
sample_prior = "only", # Sample only from prior, ignore data
chains = 4,
iter = 2000,
cores = 4,
seed = 2024,
backend = "cmdstanr",
silent = 2,
refresh = 0
)
# Generate predictions from prior
prior_preds <- posterior_predict(prior_pred)
# Compare prior predictions to observed data
tibble(
epa_sim = prior_preds[sample(1:nrow(prior_preds), 1000), 1]
) %>%
ggplot(aes(x = epa_sim)) +
geom_density(fill = "gray", alpha = 0.6) +
geom_density(data = reg_data, aes(x = epa), fill = "blue", alpha = 0.4) +
scale_x_continuous(limits = c(-10, 10)) +
labs(
title = "Prior Predictive Check",
subtitle = "Gray = Prior predictions, Blue = Observed data",
x = "EPA",
y = "Density"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: prior-predictive-py
#| fig-cap: "Prior predictive check (Python)"
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
# Sample from prior predictive distribution
with pm.Model() as prior_model:
# Priors (same as before)
intercept = pm.Normal('intercept', mu=0, sigma=1)
beta_down = pm.Normal('beta_down', mu=0, sigma=0.5, shape=3)
beta_ydstogo = pm.Normal('beta_ydstogo', mu=0, sigma=0.5)
beta_yardline = pm.Normal('beta_yardline', mu=0, sigma=0.5)
beta_score = pm.Normal('beta_score', mu=0, sigma=0.5)
sigma = pm.Exponential('sigma', lam=0.5)
# Use first observation's predictors
mu = (intercept +
beta_down[0] + # arbitrary down
beta_ydstogo * 0 +
beta_yardline * 0 +
beta_score * 0)
# Prior predictive
epa_prior = pm.Normal('epa_prior', mu=mu, sigma=sigma)
# Sample from prior
prior_samples = pm.sample_prior_predictive(samples=1000, random_seed=2024)
# Plot prior predictive vs observed
fig, ax = plt.subplots(figsize=(10, 6))
prior_epa = prior_samples.prior['epa_prior'].values.flatten()
ax.hist(prior_epa, bins=50, density=True, alpha=0.6,
color='gray', label='Prior predictions')
ax.hist(reg_data_py['epa'], bins=50, density=True, alpha=0.4,
color='blue', label='Observed data')
ax.set_xlim(-10, 10)
ax.set_xlabel('EPA', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Prior Predictive Check\nGray = Prior predictions, Blue = Observed data',
fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
If prior predictions produce impossible values (e.g., EPA of 50), the prior is too vague and should be tightened.
Sensitivity Analysis
Check how results change with different priors:
Prior Sensitivity Guidelines
1. **Vary prior width**: Try σ = 0.5, 1.0, 2.0 for Normal(0, σ) priors 2. **Compare posteriors**: If posteriors are similar, priors have little influence (data dominates) 3. **Document choices**: Always report and justify priors 4. **Use domain knowledge**: Expert opinion should inform prior selectionPractical Applications in Football
Having covered the theoretical foundations and core techniques, let's explore how Bayesian methods solve real football analytics problems. These applications demonstrate why teams increasingly use Bayesian approaches for everything from in-game decisions to long-term roster construction.
Application 1: Fourth Down Decision Making
Fourth down decisions are among the highest-leverage in-game choices coaches make. Should you go for it on 4th-and-2 from midfield when trailing by 7 in the third quarter? Traditional analytics provides point estimates of conversion probability and expected value, but doesn't quantify uncertainty or account for team-specific abilities.
Bayesian methods can dramatically improve 4th down decision-making by:
-
Modeling team-specific conversion rates: Not all teams convert 4th-and-2 at the same rate. Elite offensive lines and mobile QBs convert more often. Hierarchical models estimate team-specific rates while borrowing strength across teams.
-
Updating beliefs as season progresses: After Week 1, you have minimal team-specific data. By Week 10, you have substantial evidence about each team's short-yardage ability. Bayesian models naturally incorporate accumulating evidence.
-
Quantifying uncertainty in recommendations: Saying "you have a 55% chance to convert" is less useful than "you have a 55% chance with 95% credible interval [45%, 65%]." The wide interval suggests substantial uncertainty, which might affect the decision.
-
Incorporating situational factors: You can include yardage, field position, opponent quality, and other context in a regression framework while maintaining proper uncertainty quantification.
The hierarchical structure is key: you're estimating league-average conversion rates, team-specific deviations from the average, and situation-specific modifiers, all simultaneously with proper uncertainty propagation.
Real-Time Bayesian Updating During Games
Imagine a team that's 0-for-3 on fourth down attempts today. Should this change your conversion probability estimate? A simple frequency approach says "they're 0% today!" A Bayesian hierarchical model says "they've had bad luck today, but their true ability is probably close to their season-long rate, shrunk slightly toward zero based on today's failures." This is exactly the rational updating coaches should do—and hierarchical models do it automatically.#| label: fourth-down-app-r
#| message: false
#| warning: false
#| cache: true
# Prepare 4th down data
fourth_down <- pbp_2023 %>%
filter(
down == 4,
!is.na(go_boost), # 4th down situation
ydstogo <= 5, # Short yardage
play_type %in% c("pass", "run"),
!is.na(posteam)
) %>%
mutate(
conversion = if_else(first_down_pass == 1 | first_down_rush == 1, 1, 0),
ydstogo_group = cut(ydstogo, breaks = c(0, 2, 5),
labels = c("0-2 yards", "3-5 yards"))
)
# Hierarchical model: team effects + yardage effects
fourth_model <- brm(
conversion ~ ydstogo_group + (1 | posteam),
data = fourth_down,
family = bernoulli(),
prior = c(
prior(normal(0, 1.5), class = Intercept),
prior(normal(0, 1), class = b),
prior(exponential(1), class = sd)
),
chains = 4,
iter = 2000,
cores = 4,
seed = 2024,
backend = "cmdstanr",
silent = 2,
refresh = 0
)
# Extract team-specific conversion probabilities
team_probs <- fourth_model %>%
spread_draws(r_posteam[posteam,]) %>%
mutate(
# Add fixed effects
prob_short = plogis(fixef(fourth_model)[1, 1] + r_posteam),
prob_medium = plogis(fixef(fourth_model)[1, 1] +
fixef(fourth_model)[2, 1] + r_posteam)
) %>%
group_by(posteam) %>%
summarise(
prob_0_2 = mean(prob_short),
prob_3_5 = mean(prob_medium),
lower_0_2 = quantile(prob_short, 0.025),
upper_0_2 = quantile(prob_short, 0.975),
.groups = "drop"
) %>%
arrange(desc(prob_0_2))
# Show top teams
team_probs %>%
head(10) %>%
gt::gt() %>%
gt::cols_label(
posteam = "Team",
prob_0_2 = "P(Convert 0-2 yds)",
lower_0_2 = "Lower 95%",
upper_0_2 = "Upper 95%",
prob_3_5 = "P(Convert 3-5 yds)"
) %>%
gt::fmt_percent(
columns = c(prob_0_2, lower_0_2, upper_0_2, prob_3_5),
decimals = 1
) %>%
gt::tab_header(
title = "4th Down Conversion Probabilities by Team",
subtitle = "Short yardage situations (2023)"
)
#| label: fourth-down-app-py
#| message: false
#| warning: false
#| cache: true
# Prepare 4th down data
fourth_down_py = (pbp_2023
.query("down == 4 & go_boost.notna() & ydstogo <= 5 & " +
"play_type.isin(['pass', 'run']) & posteam.notna()")
.assign(
conversion=lambda x: ((x['first_down_pass'] == 1) |
(x['first_down_rush'] == 1)).astype(int),
ydstogo_group=lambda x: pd.cut(x['ydstogo'], bins=[0, 2, 5],
labels=['0-2 yards', '3-5 yards'])
)
.dropna(subset=['ydstogo_group'])
)
# Convert to numeric
fourth_down_py['ydstogo_binary'] = (fourth_down_py['ydstogo_group'] == '3-5 yards').astype(int)
fourth_down_py['team_idx'] = pd.Categorical(fourth_down_py['posteam']).codes
n_teams_4th = fourth_down_py['team_idx'].nunique()
# Hierarchical logistic regression
with pm.Model() as fourth_model_py:
# Hyperpriors
mu_intercept = pm.Normal('mu_intercept', mu=0, sigma=1.5)
sigma_team = pm.Exponential('sigma_team', lam=1)
# Team effects
team_effect = pm.Normal('team_effect', mu=0, sigma=sigma_team, shape=n_teams_4th)
# Yardage effect
beta_yardage = pm.Normal('beta_yardage', mu=0, sigma=1)
# Linear predictor
logit_p = (mu_intercept + team_effect[fourth_down_py['team_idx'].values] +
beta_yardage * fourth_down_py['ydstogo_binary'].values)
# Likelihood
conversion_obs = pm.Bernoulli('conversion_obs', logit_p=logit_p,
observed=fourth_down_py['conversion'].values)
# Sample
trace_4th = pm.sample(2000, tune=1000, chains=4, cores=4,
random_seed=2024, return_inferencedata=True,
progressbar=False)
print("\n4th Down Conversion Model Summary:")
print(az.summary(trace_4th, var_names=['mu_intercept', 'beta_yardage', 'sigma_team']))
Application 2: Player Development Trajectories
One of the most valuable applications of Bayesian methods is tracking how players develop over time, particularly rookies. A rookie QB might start hot, struggle mid-season, then finish strong. How should we interpret this performance arc? Is the early success real talent or luck? Is the mid-season struggle growing pains or exposure of weakness?
Sequential Bayesian updating provides the ideal framework:
- Start with a prior based on draft position, college stats, and historical rookie performance
- Update after each game by combining the prior with new performance data
- Track the posterior over time to see when beliefs stabilize
- Quantify uncertainty at each point—are we confident this is the QB's true level?
This approach mirrors how coaches actually evaluate players: they have initial expectations (priors), update them based on performance (likelihood), and eventually form stable beliefs (posterior). The Bayesian framework formalizes this process and quantifies confidence at each stage.
Application 3: Injury Risk Prediction
Predicting injury risk is critical for roster management, contract negotiations, and game planning. However, injuries are rare events with substantial uncertainty. A player might have no injuries for 5 years, then suffer two in one season. How do we distinguish bad luck from true injury proneness?
Bayesian survival models are ideal because they:
- Handle censoring: Some players leave the league before getting injured (censored observations)
- Incorporate covariates: Age, position, playing style, workload all affect injury risk
- Pool information: Borrow strength across similar players to estimate risk for players with limited history
- Quantify uncertainty: Provide probability distributions over time-to-injury, not just point estimates
Hierarchical Bayesian survival models can estimate baseline injury hazards by position, then player-specific deviations, accounting for age, usage patterns, and injury history. This gives roster managers probabilistic forecasts: "This RB has a 30% chance [CI: 20-45%] of missing 3+ games to injury this season."
Application 4: Draft Pick Valuation
How much is a 2nd-round pick worth relative to a 1st-rounder? What's the value of a specific draft position (e.g., pick #15) relative to another (e.g., pick #25)? These questions are critical for trade negotiations and draft strategy.
Hierarchical Bayesian models can estimate draft pick value by:
- Modeling career outcomes (e.g., career WAR, years as starter) as a function of draft position
- Pooling across positions: QBs and RBs have different value curves, but share common structure
- Accounting for draft class variation: Some drafts are strong, others weak—hierarchical models separate these sources of variation
- Incorporating uncertainty: Late-round picks might bust or become Pro Bowlers—capture this with full posterior distributions
The result is a draft value chart with uncertainty quantification. Rather than saying "pick #15 is worth 1050 points," you get "pick #15 is worth 1050 [CI: 950-1200] points in expected career value." The uncertainty matters when evaluating trades—is trading up worth the cost given the uncertainty?
Bayesian Models for Multi-Year Contracts
When evaluating free agent contracts, Bayesian methods shine. You need to project player performance over 3-5 years, accounting for age decline, injury risk, and regression to the mean. Hierarchical Bayesian models that pool information across similar players while respecting individual differences provide the most realistic projections with proper uncertainty quantification. This helps teams avoid overpaying for recent peak performance that won't persist.Bayesian vs Frequentist: A Comparison
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Parameters | Fixed but unknown | Random variables with distributions |
| Inference | Confidence intervals, p-values | Posterior distributions, credible intervals |
| Interpretation | "95% of intervals contain true value" | "95% probability parameter is in this interval" |
| Prior knowledge | Not formally incorporated | Naturally incorporated via priors |
| Sample size | Relies on asymptotic theory | Works with any sample size |
| Uncertainty | Confidence intervals | Full posterior distribution |
| Computation | Often closed-form | Usually requires MCMC |
| Overfitting | Regularization via penalties | Natural regularization via priors |
:Comparison of Bayesian and Frequentist approaches
When to Use Bayesian Methods
Bayesian methods are ideal when:
- You have limited data (regularization helps)
- Hierarchical structure exists (players in teams)
- You want to incorporate prior knowledge
- Uncertainty quantification is critical
- You need to update predictions sequentially
Frequentist methods may be preferred when:
- Computational resources are limited
- Sample sizes are very large (results converge)
- You want to avoid specifying priors
- Simple hypothesis testing suffices
In practice, modern football analytics often blends both approaches, using whichever is most appropriate for the problem at hand.
Summary
Bayesian methods offer a powerful and increasingly essential framework for football analytics. As the field matures, teams recognize that point estimates without uncertainty quantification are insufficient for high-stakes decisions. Bayesian methods provide the rigorous probabilistic reasoning needed for modern football analytics.
Key Takeaways:
-
Bayes' theorem provides a principled way to update beliefs with new evidence, formalizing how we naturally think about learning from data while incorporating prior knowledge
-
Hierarchical models naturally handle football's nested structure (players within teams, teams within divisions), allowing information pooling that improves estimates for all groups while respecting individual differences
-
MCMC algorithms enable inference for complex models that would be intractable with analytical methods, making sophisticated hierarchical and regression models practical
-
Uncertainty quantification is natural and comprehensive in the Bayesian framework—every estimate comes with a full posterior distribution showing the range of plausible values
-
Prior knowledge can be formally incorporated through prior distributions, allowing decades of football understanding to inform modern analyses rather than starting from scratch with each dataset
-
Model comparison uses principled tools like LOO-CV and WAIC that balance fit and complexity, helping choose models that predict well rather than just fit well
-
Practical applications span the full range of football decisions: player evaluation, game-time tactics, roster construction, contract negotiation, and draft strategy
The Complete Bayesian Workflow:
- Specify model: Choose appropriate likelihood for your data and prior distributions for parameters
- Prior predictive checks: Simulate from the prior to ensure it produces reasonable predictions before seeing data
- Fit using MCMC: Use modern tools like Stan, PyMC, or brms to sample from the posterior
- Check diagnostics: Verify convergence (R-hat < 1.01), adequate sampling (ESS > 400), and no pathologies (divergences)
- Validate with posterior predictive checks: Ensure the model generates data consistent with observed data
- Compare models: Use LOO-CV or WAIC to compare competing models
- Perform sensitivity analysis: Check how conclusions change with different prior choices
- Make inferences and predictions: Extract posterior summaries, compute probabilities, generate predictions with uncertainty
When Bayesian Methods Excel:
Bayesian methods require more computational effort and statistical sophistication than traditional approaches, but provide richer, more nuanced insights. They are especially valuable in football analytics where:
- Sample sizes are limited: Players and teams have finite games—regularization via priors helps
- Hierarchical structure exists: Pooling information across players/teams improves all estimates
- Prior knowledge is available: Decades of football analysis can inform modern models
- Uncertainty matters: High-stakes decisions require knowing confidence levels, not just point estimates
- Sequential updating is needed: As the season progresses, beliefs should rationally update
- Rare events occur: Injuries, fourth downs, two-point conversions benefit from information pooling
Looking Forward:
Bayesian methods are not just academic exercises—they're increasingly used by NFL teams for practical decision-making. As computational power grows and software improves, Bayesian analyses that once took hours now take minutes. The barrier to adoption is no longer computational, but educational. Mastering these methods provides a significant competitive advantage in modern football analytics.
The Bayesian Mindset
Beyond specific techniques, Bayesian methods encourage a valuable mindset for analytics: 1. **Think probabilistically**: Express beliefs as distributions, not points 2. **Update rationally**: Combine evidence using Bayes' theorem, not ad-hoc adjustments 3. **Quantify uncertainty**: Always know how confident you should be 4. **Regularize appropriately**: Shrink toward reasonable values when data is limited 5. **Be transparent**: Make assumptions explicit through prior specification This mindset—thinking clearly about uncertainty and updating beliefs rationally—is valuable far beyond the specific models in this chapter.Exercises
Conceptual Questions
- Priors and Posteriors: A kicker makes 3 of 3 field goals in his first game. Compare:
- Frequentist estimate (3/3 = 100%)
- Bayesian estimate with Beta(20, 5) prior (encoding that kickers are typically around 80%)
Which seems more reasonable? Why?
-
Hierarchical Shrinkage: Explain in your own words why hierarchical models shrink extreme estimates toward the group mean. Why is this beneficial for player evaluation?
-
Credible vs Confidence Intervals:
- A 95% confidence interval for a QB's EPA is [0.10, 0.30]
- A 95% credible interval for a QB's EPA is [0.10, 0.30]
Explain the difference in interpretation. Which is more intuitive for decision-making?
Coding Exercises
Exercise 1: Bayesian Kicker Evaluation
Build a hierarchical Beta-binomial model for NFL kickers' field goal success rates in 2023. **Tasks:** a) Filter to kickers with at least 10 attempts b) Fit a hierarchical model with team-level effects c) Compare raw success rates to hierarchical estimates d) Visualize shrinkage as a function of attempts e) Rank kickers by posterior mean success probability **Bonus**: Compare short (< 40 yards) vs long (≥ 40 yards) field goals.Exercise 2: Hierarchical Team Strength Model
Create a hierarchical model estimating team strength based on point differential. **Model:** $$ \begin{align} \text{Point Diff}_{ij} &\sim \text{Normal}(\theta_i - \theta_j, \sigma_y) \\ \theta_i &\sim \text{Normal}(\mu_{\text{league}}, \sigma_{\text{team}}) \end{align} $$ Where $\theta_i$ is team $i$'s strength. **Tasks:** a) Fit the model using 2023 data b) Rank teams by posterior mean strength c) Compute probability that Team A > Team B for selected pairs d) Visualize posterior distributions for top 5 teamsExercise 3: Bayesian EPA Regression
Extend the EPA regression model from the chapter to include interactions. **Tasks:** a) Add interaction between down and yards to go b) Include play type (pass vs run) c) Fit the model and interpret coefficients d) Use LOO-CV to compare to the simpler model e) Generate posterior predictive distributions for: - 3rd & 3, pass, midfield - 3rd & 3, run, midfield - Which has higher expected EPA? **Bonus**: Include team-specific random effects.Exercise 4: Sequential Bayesian Updating
Track a quarterback's EPA through the season using sequential Bayesian updates. **Tasks:** a) Pick a QB who played full 2023 season b) Start with weakly informative prior: Normal(0.05, 0.3) c) After each game, update posterior using that game's EPA values d) Plot how posterior mean and uncertainty evolve over season e) Compare final posterior to raw season-end EPA **Interpretation**: When did the posterior stabilize? How much did early games influence final estimate?Further Reading
Books
-
Gelman, A., et al. (2020). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. [The definitive reference]
-
McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press. [Excellent intuitive introduction]
-
Kruschke, J. (2015). Doing Bayesian Data Analysis (2nd ed.). Academic Press. [Accessible, with many examples]
Papers
-
Yurko, R., Ventura, S., & Horowitz, M. (2020). "Going deep: models for continuous-time within-play valuation of game outcomes in American football with tracking data." Journal of Quantitative Analysis in Sports, 16(2), 163-182.
-
Lopez, M. (2019). "Bigger data, better questions, and a return to judgment: Comments on football analytics." Harvard Data Science Review.
-
Carpenter, B., et al. (2017). "Stan: A probabilistic programming language." Journal of Statistical Software, 76(1).
Software Documentation
- brms: https://paul-buerkner.github.io/brms/
- PyMC: https://docs.pymc.io/
- Stan: https://mc-stan.org/
- ArviZ (Bayesian visualization): https://python.arviz.org/
Tutorials
- Bayesian modeling with brms: https://mvuorre.github.io/posts/2016-09-29-bayesian-meta-analysis/
- PyMC case studies: https://www.pymc.io/projects/examples/en/latest/
References
:::