Learning ObjectivesBy the end of this chapter, you will be able to:
- Apply supervised learning methods to football prediction problems
- Build classification models for categorical outcomes (wins, play types)
- Develop regression models for continuous metrics (EPA, player performance)
- Engineer football-specific features for improved model performance
- Evaluate and validate machine learning models using appropriate metrics
- Interpret model predictions using SHAP values and feature importance
- Select appropriate algorithms for different football analytics tasks
Introduction
Machine learning has revolutionized football analytics, enabling teams to make predictions and identify patterns that would be impossible through traditional statistical methods. From predicting game outcomes to forecasting player performance, from classifying play types to optimizing personnel decisions, ML algorithms power many of the most sophisticated analytics applications in modern football.
Unlike traditional statistical models that rely on predefined relationships and assumptions, machine learning algorithms can automatically discover complex, non-linear patterns in data. This capability is particularly valuable in football, where interactions between variables—field position, down and distance, personnel groupings, game situation—create intricate relationships that simple models struggle to capture.
This chapter provides a comprehensive introduction to machine learning for football analytics. We'll explore both classification and regression problems, implement multiple algorithms, develop football-specific features, validate model performance, and interpret predictions. By the end, you'll have a complete toolkit for building, evaluating, and deploying ML models for football.
What is Machine Learning?
Machine learning is a branch of artificial intelligence focused on building systems that learn from data without being explicitly programmed. Instead of writing rules to solve a problem, we provide examples and let the algorithm discover the patterns. **Key characteristics:** - **Data-driven**: Learn from examples rather than rules - **Pattern discovery**: Automatically identify relationships - **Generalization**: Make predictions on new, unseen data - **Adaptability**: Improve as more data becomes availableSupervised vs Unsupervised Learning
Machine learning encompasses several paradigms, but two dominate football analytics applications:
Supervised Learning
Supervised learning uses labeled training data where the outcome is known. The algorithm learns to map inputs (features) to outputs (labels).
Classification tasks (categorical outcomes):
- Will the team win this game? (binary: yes/no)
- What play type will be called? (multi-class: pass/run/punt/FG)
- Will this be a successful play? (binary: success/failure)
- Will the player make the Pro Bowl? (binary: yes/no)
Regression tasks (continuous outcomes):
- What will the EPA of this play be?
- How many yards will this player gain next season?
- What's the expected point differential?
- What will this player's PFF grade be?
Unsupervised Learning
Unsupervised learning finds patterns in unlabeled data without predefined outcomes.
Common applications:
- Clustering: Group similar plays, players, or teams
- Dimensionality reduction: Simplify high-dimensional tracking data
- Anomaly detection: Identify unusual plays or performances
This chapter focuses on supervised learning, which addresses most practical football analytics problems.
Choosing Between Classification and Regression
- Use **classification** when predicting categories (win/loss, play type) - Use **regression** when predicting quantities (EPA, yards, points) - Some problems can be framed either way (e.g., "Will EPA be positive?" vs "What will EPA be?")Classification Problems in Football
Classification models predict categorical outcomes. Let's explore two fundamental football classification problems.
Win Prediction with Logistic Regression
The most basic classification task is predicting game outcomes. We'll build a logistic regression model using pre-game team statistics.
#| label: win-prediction-r
#| message: false
#| warning: false
#| cache: true
library(tidyverse)
library(nflfastR)
library(caret)
library(gt)
# Load multiple seasons
pbp <- load_pbp(2018:2023)
# Calculate team stats by game
team_game_stats <- pbp %>%
filter(!is.na(posteam), !is.na(epa)) %>%
group_by(season, week, game_id, posteam) %>%
summarise(
off_epa_play = mean(epa, na.rm = TRUE),
off_success_rate = mean(epa > 0, na.rm = TRUE),
off_pass_epa = mean(epa[pass == 1], na.rm = TRUE),
off_rush_epa = mean(epa[rush == 1], na.rm = TRUE),
plays = n(),
.groups = "drop"
)
# Get game results
game_results <- pbp %>%
filter(!is.na(posteam), !is.na(result)) %>%
group_by(season, week, game_id, posteam) %>%
summarise(
won = first(result) > 0,
result = first(result),
.groups = "drop"
)
# Calculate rolling averages (last 4 games)
team_rolling <- team_game_stats %>%
arrange(posteam, season, week) %>%
group_by(posteam) %>%
mutate(
avg_off_epa = lag(zoo::rollmean(off_epa_play, k = 4, fill = NA, align = "right"), 1),
avg_success_rate = lag(zoo::rollmean(off_success_rate, k = 4, fill = NA, align = "right"), 1),
avg_pass_epa = lag(zoo::rollmean(off_pass_epa, k = 4, fill = NA, align = "right"), 1),
avg_rush_epa = lag(zoo::rollmean(off_rush_epa, k = 4, fill = NA, align = "right"), 1)
) %>%
ungroup()
# Join with results and remove NAs
model_data <- team_rolling %>%
inner_join(game_results, by = c("season", "week", "game_id", "posteam")) %>%
filter(!is.na(avg_off_epa)) %>%
mutate(won = factor(won, levels = c(FALSE, TRUE), labels = c("Loss", "Win")))
# Split into train/test (2018-2022 train, 2023 test)
train_data <- model_data %>% filter(season < 2023)
test_data <- model_data %>% filter(season == 2023)
# Train logistic regression
set.seed(42)
logistic_model <- glm(
won ~ avg_off_epa + avg_success_rate + avg_pass_epa + avg_rush_epa,
data = train_data,
family = binomial(link = "logit")
)
# Model summary
summary(logistic_model)
# Predictions on test set
test_predictions <- test_data %>%
mutate(
pred_prob = predict(logistic_model, newdata = test_data, type = "response"),
pred_class = factor(ifelse(pred_prob > 0.5, "Win", "Loss"),
levels = c("Loss", "Win"))
)
# Calculate accuracy
accuracy <- mean(test_predictions$pred_class == test_predictions$won)
cat("\nTest Set Accuracy:", round(accuracy * 100, 1), "%\n")
#| label: win-prediction-py
#| message: false
#| warning: false
#| cache: true
import pandas as pd
import numpy as np
import nfl_data_py as nfl
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')
# Load multiple seasons
pbp = nfl.import_pbp_data(list(range(2018, 2024)))
# Calculate team stats by game
team_game_stats = (pbp
.query("posteam.notna() & epa.notna()")
.groupby(['season', 'week', 'game_id', 'posteam'])
.agg(
off_epa_play=('epa', 'mean'),
off_success_rate=('epa', lambda x: (x > 0).mean()),
off_pass_epa=('epa', lambda x: x[pbp.loc[x.index, 'pass'] == 1].mean()),
off_rush_epa=('epa', lambda x: x[pbp.loc[x.index, 'rush'] == 1].mean()),
plays=('epa', 'count')
)
.reset_index()
)
# Get game results
game_results = (pbp
.query("posteam.notna() & result.notna()")
.groupby(['season', 'week', 'game_id', 'posteam'])
.agg(
won=('result', lambda x: x.iloc[0] > 0),
result=('result', 'first')
)
.reset_index()
)
# Calculate rolling averages
team_game_stats = team_game_stats.sort_values(['posteam', 'season', 'week'])
team_rolling = (team_game_stats
.groupby('posteam')
.apply(lambda x: x.assign(
avg_off_epa=x['off_epa_play'].rolling(window=4, min_periods=1).mean().shift(1),
avg_success_rate=x['off_success_rate'].rolling(window=4, min_periods=1).mean().shift(1),
avg_pass_epa=x['off_pass_epa'].rolling(window=4, min_periods=1).mean().shift(1),
avg_rush_epa=x['off_rush_epa'].rolling(window=4, min_periods=1).mean().shift(1)
))
.reset_index(drop=True)
)
# Join with results
model_data = (team_rolling
.merge(game_results, on=['season', 'week', 'game_id', 'posteam'])
.dropna(subset=['avg_off_epa'])
)
# Split train/test
train_data = model_data[model_data['season'] < 2023]
test_data = model_data[model_data['season'] == 2023]
# Prepare features
feature_cols = ['avg_off_epa', 'avg_success_rate', 'avg_pass_epa', 'avg_rush_epa']
X_train = train_data[feature_cols]
y_train = train_data['won']
X_test = test_data[feature_cols]
y_test = test_data['won']
# Train logistic regression
logistic_model = LogisticRegression(random_state=42, max_iter=1000)
logistic_model.fit(X_train, y_train)
# Predictions
y_pred = logistic_model.predict(X_test)
y_pred_proba = logistic_model.predict_proba(X_test)[:, 1]
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Set Accuracy: {accuracy * 100:.1f}%")
# Coefficients
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': logistic_model.coef_[0]
})
print("\nModel Coefficients:")
print(coef_df.to_string(index=False))
Play Type Classification with Random Forest
Predicting offensive play-calling is a multi-class classification problem. Random forests excel at capturing complex interactions between game situation variables.
#| label: play-type-classification-r
#| message: false
#| warning: false
#| cache: true
library(randomForest)
# Prepare play-level data
play_data <- pbp %>%
filter(
season == 2023,
!is.na(down),
down <= 3,
play_type %in% c("pass", "run")
) %>%
select(
play_type,
down,
ydstogo,
yardline_100,
qtr,
score_differential,
half_seconds_remaining,
shotgun,
no_huddle
) %>%
mutate(
play_type = factor(play_type),
shotgun = as.integer(shotgun),
no_huddle = as.integer(no_huddle)
) %>%
drop_na()
# Split data
set.seed(42)
train_idx <- sample(1:nrow(play_data), 0.8 * nrow(play_data))
train_plays <- play_data[train_idx, ]
test_plays <- play_data[-train_idx, ]
# Train random forest
rf_model <- randomForest(
play_type ~ .,
data = train_plays,
ntree = 100,
mtry = 3,
importance = TRUE
)
# Predictions
rf_predictions <- predict(rf_model, newdata = test_plays)
rf_accuracy <- mean(rf_predictions == test_plays$play_type)
cat("\nRandom Forest Accuracy:", round(rf_accuracy * 100, 1), "%\n")
# Confusion matrix
conf_matrix <- table(Predicted = rf_predictions, Actual = test_plays$play_type)
print("\nConfusion Matrix:")
print(conf_matrix)
# Feature importance
importance_df <- as.data.frame(importance(rf_model)) %>%
rownames_to_column("Feature") %>%
arrange(desc(MeanDecreaseAccuracy))
print("\nTop Features by Importance:")
print(head(importance_df, 5))
#| label: play-type-classification-py
#| message: false
#| warning: false
#| cache: true
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
# Prepare play-level data
play_data = (pbp
.query("season == 2023 & down.notna() & down <= 3 & play_type.isin(['pass', 'run'])")
[['play_type', 'down', 'ydstogo', 'yardline_100', 'qtr',
'score_differential', 'half_seconds_remaining', 'shotgun', 'no_huddle']]
.assign(
shotgun=lambda x: x['shotgun'].astype(int),
no_huddle=lambda x: x['no_huddle'].astype(int)
)
.dropna()
)
# Features and target
X = play_data.drop('play_type', axis=1)
y = play_data['play_type']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train random forest
rf_model = RandomForestClassifier(
n_estimators=100,
max_features=3,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train, y_train)
# Predictions
y_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred)
print(f"\nRandom Forest Accuracy: {rf_accuracy * 100:.1f}%")
# Confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(
conf_mat,
index=['Actual Pass', 'Actual Run'],
columns=['Predicted Pass', 'Predicted Run']
))
# Feature importance
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop Features by Importance:")
print(importance_df.head(5).to_string(index=False))
Regression Problems in Football
Regression models predict continuous numerical outcomes. Let's explore EPA prediction and player performance modeling.
EPA Prediction with Gradient Boosting
Gradient boosting machines (GBM) excel at regression tasks with complex feature interactions. We'll predict EPA for upcoming plays.
#| label: epa-prediction-r
#| message: false
#| warning: false
#| cache: true
library(xgboost)
# Prepare regression data
epa_data <- pbp %>%
filter(
season %in% 2022:2023,
!is.na(epa),
!is.na(down),
down <= 4,
play_type %in% c("pass", "run")
) %>%
select(
epa,
down,
ydstogo,
yardline_100,
qtr,
score_differential,
half_seconds_remaining,
shotgun,
no_huddle,
play_type
) %>%
mutate(
shotgun = as.integer(shotgun),
no_huddle = as.integer(no_huddle),
is_pass = as.integer(play_type == "pass")
) %>%
select(-play_type) %>%
drop_na()
# Split by season
train_epa <- epa_data %>% filter(half_seconds_remaining > 900 | qtr < 4)
test_epa <- epa_data %>% filter(half_seconds_remaining <= 900 & qtr == 4)
# Prepare matrices for xgboost
X_train <- train_epa %>% select(-epa) %>% as.matrix()
y_train <- train_epa$epa
X_test <- test_epa %>% select(-epa) %>% as.matrix()
y_test <- test_epa$epa
# Train gradient boosting model
dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest <- xgb.DMatrix(data = X_test, label = y_test)
params <- list(
objective = "reg:squarederror",
eta = 0.1,
max_depth = 6,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(train = dtrain, test = dtest),
verbose = 0
)
# Predictions
predictions <- predict(xgb_model, dtest)
# Evaluation metrics
rmse <- sqrt(mean((predictions - y_test)^2))
mae <- mean(abs(predictions - y_test))
r2 <- cor(predictions, y_test)^2
cat("\nGradient Boosting Performance:\n")
cat("RMSE:", round(rmse, 3), "\n")
cat("MAE:", round(mae, 3), "\n")
cat("R²:", round(r2, 3), "\n")
# Feature importance
importance_matrix <- xgb.importance(
feature_names = colnames(X_train),
model = xgb_model
)
print("\nTop 5 Features:")
print(head(importance_matrix, 5))
#| label: epa-prediction-py
#| message: false
#| warning: false
#| cache: true
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Prepare regression data
epa_data = (pbp
.query("season in [2022, 2023] & epa.notna() & down.notna() & down <= 4 & play_type.isin(['pass', 'run'])")
[['epa', 'down', 'ydstogo', 'yardline_100', 'qtr', 'score_differential',
'half_seconds_remaining', 'shotgun', 'no_huddle', 'play_type']]
.assign(
shotgun=lambda x: x['shotgun'].astype(int),
no_huddle=lambda x: x['no_huddle'].astype(int),
is_pass=lambda x: (x['play_type'] == 'pass').astype(int)
)
.drop('play_type', axis=1)
.dropna()
)
# Split data
train_epa = epa_data.query("half_seconds_remaining > 900 | qtr < 4")
test_epa = epa_data.query("half_seconds_remaining <= 900 & qtr == 4")
X_train = train_epa.drop('epa', axis=1)
y_train = train_epa['epa']
X_test = test_epa.drop('epa', axis=1)
y_test = test_epa['epa']
# Train gradient boosting model
gbm_model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=6,
subsample=0.8,
random_state=42
)
gbm_model.fit(X_train, y_train)
# Predictions
predictions = gbm_model.predict(X_test)
# Evaluation metrics
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("\nGradient Boosting Performance:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R²: {r2:.3f}")
# Feature importance
importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': gbm_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 5 Features:")
print(importance_df.head(5).to_string(index=False))
Player Performance Regression
Predicting future player performance requires careful feature engineering and handling of small sample sizes.
#| label: player-performance-r
#| message: false
#| warning: false
#| cache: true
# Calculate QB season stats
qb_stats <- pbp %>%
filter(
season %in% 2019:2023,
!is.na(passer_player_name),
!is.na(epa),
pass == 1
) %>%
group_by(season, passer_player_id, passer_player_name) %>%
summarise(
attempts = n(),
epa_per_play = mean(epa),
success_rate = mean(epa > 0),
cpoe = mean(cpoe, na.rm = TRUE),
air_yards_per_att = mean(air_yards, na.rm = TRUE),
.groups = "drop"
) %>%
filter(attempts >= 200)
# Create lagged features (predict next season)
qb_model_data <- qb_stats %>%
arrange(passer_player_id, season) %>%
group_by(passer_player_id) %>%
mutate(
next_epa = lead(epa_per_play),
seasons_played = row_number()
) %>%
ungroup() %>%
filter(!is.na(next_epa)) %>%
select(
next_epa,
epa_per_play,
success_rate,
cpoe,
air_yards_per_att,
attempts,
seasons_played
)
# Split data
set.seed(42)
train_idx <- sample(1:nrow(qb_model_data), 0.8 * nrow(qb_model_data))
train_qb <- qb_model_data[train_idx, ]
test_qb <- qb_model_data[-train_idx, ]
# Linear regression with regularization
library(glmnet)
X_train_qb <- train_qb %>% select(-next_epa) %>% as.matrix()
y_train_qb <- train_qb$next_epa
X_test_qb <- test_qb %>% select(-next_epa) %>% as.matrix()
y_test_qb <- test_qb$next_epa
# Ridge regression
ridge_model <- cv.glmnet(
X_train_qb,
y_train_qb,
alpha = 0,
nfolds = 5
)
# Predictions
ridge_pred <- predict(ridge_model, newx = X_test_qb, s = "lambda.min")
# Evaluation
qb_rmse <- sqrt(mean((ridge_pred - y_test_qb)^2))
qb_r2 <- cor(ridge_pred, y_test_qb)^2
cat("\nQB Performance Prediction:\n")
cat("RMSE:", round(qb_rmse, 3), "EPA per play\n")
cat("R²:", round(qb_r2, 3), "\n")
#| label: player-performance-py
#| message: false
#| warning: false
#| cache: true
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
# Calculate QB season stats
qb_stats = (pbp
.query("season in [2019, 2020, 2021, 2022, 2023] & passer_player_name.notna() & epa.notna() & pass == 1")
.groupby(['season', 'passer_player_id', 'passer_player_name'])
.agg(
attempts=('epa', 'count'),
epa_per_play=('epa', 'mean'),
success_rate=('epa', lambda x: (x > 0).mean()),
cpoe=('cpoe', 'mean'),
air_yards_per_att=('air_yards', 'mean')
)
.reset_index()
.query("attempts >= 200")
)
# Create lagged features
qb_stats = qb_stats.sort_values(['passer_player_id', 'season'])
qb_model_data = (qb_stats
.assign(
next_epa=lambda x: x.groupby('passer_player_id')['epa_per_play'].shift(-1),
seasons_played=lambda x: x.groupby('passer_player_id').cumcount() + 1
)
.dropna(subset=['next_epa'])
[['next_epa', 'epa_per_play', 'success_rate', 'cpoe',
'air_yards_per_att', 'attempts', 'seasons_played']]
)
# Split data
from sklearn.model_selection import train_test_split
X_qb = qb_model_data.drop('next_epa', axis=1)
y_qb = qb_model_data['next_epa']
X_train_qb, X_test_qb, y_train_qb, y_test_qb = train_test_split(
X_qb, y_qb, test_size=0.2, random_state=42
)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_qb)
X_test_scaled = scaler.transform(X_test_qb)
# Ridge regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train_qb)
# Predictions
ridge_pred = ridge_model.predict(X_test_scaled)
# Evaluation
qb_rmse = np.sqrt(mean_squared_error(y_test_qb, ridge_pred))
qb_r2 = r2_score(y_test_qb, ridge_pred)
print("\nQB Performance Prediction:")
print(f"RMSE: {qb_rmse:.3f} EPA per play")
print(f"R²: {qb_r2:.3f}")
Feature Engineering for Football Data
Feature engineering—creating meaningful variables from raw data—often matters more than algorithm selection. Football provides rich opportunities for creative feature development.
Temporal Features
#| label: feature-engineering-r
#| message: false
#| warning: false
# Example: Create time-based features
pbp_features <- pbp %>%
filter(season == 2023, !is.na(epa)) %>%
mutate(
# Game situation features
is_first_half = qtr <= 2,
is_two_minute_drill = half_seconds_remaining <= 120,
is_goal_to_go = yardline_100 <= 10 & ydstogo >= yardline_100,
is_red_zone = yardline_100 <= 20,
is_third_down = down == 3,
is_fourth_down = down == 4,
# Score situation
is_close_game = abs(score_differential) <= 7,
is_winning = score_differential > 0,
is_losing = score_differential < 0,
score_diff_bucket = cut(
score_differential,
breaks = c(-Inf, -14, -7, 0, 7, 14, Inf),
labels = c("Down 14+", "Down 7-13", "Down 1-6", "Up 1-6", "Up 7-13", "Up 14+")
),
# Field position
field_zone = cut(
yardline_100,
breaks = c(0, 20, 50, 80, 100),
labels = c("Red Zone", "Mid-Field-Near", "Mid-Field-Far", "Own Territory")
),
# Down-distance combinations
down_distance_combo = case_when(
down == 1 ~ "First Down",
down == 2 & ydstogo <= 3 ~ "Second & Short",
down == 2 & ydstogo >= 7 ~ "Second & Long",
down == 3 & ydstogo <= 3 ~ "Third & Short",
down == 3 & ydstogo >= 7 ~ "Third & Long",
TRUE ~ "Other"
)
)
# Show summary of engineered features
feature_summary <- pbp_features %>%
summarise(
pct_first_half = mean(is_first_half),
pct_two_min = mean(is_two_minute_drill),
pct_red_zone = mean(is_red_zone),
pct_close_game = mean(is_close_game),
pct_third_down = mean(is_third_down)
)
print("Feature Statistics:")
print(feature_summary)
#| label: feature-engineering-py
#| message: false
#| warning: false
# Create time-based features
pbp_features = (pbp
.query("season == 2023 & epa.notna()")
.assign(
# Game situation features
is_first_half=lambda x: x['qtr'] <= 2,
is_two_minute_drill=lambda x: x['half_seconds_remaining'] <= 120,
is_goal_to_go=lambda x: (x['yardline_100'] <= 10) & (x['ydstogo'] >= x['yardline_100']),
is_red_zone=lambda x: x['yardline_100'] <= 20,
is_third_down=lambda x: x['down'] == 3,
is_fourth_down=lambda x: x['down'] == 4,
# Score situation
is_close_game=lambda x: x['score_differential'].abs() <= 7,
is_winning=lambda x: x['score_differential'] > 0,
is_losing=lambda x: x['score_differential'] < 0,
score_diff_bucket=lambda x: pd.cut(
x['score_differential'],
bins=[-np.inf, -14, -7, 0, 7, 14, np.inf],
labels=['Down 14+', 'Down 7-13', 'Down 1-6', 'Up 1-6', 'Up 7-13', 'Up 14+']
),
# Field position
field_zone=lambda x: pd.cut(
x['yardline_100'],
bins=[0, 20, 50, 80, 100],
labels=['Red Zone', 'Mid-Field-Near', 'Mid-Field-Far', 'Own Territory']
),
# Down-distance combinations
down_distance_combo=lambda x: np.select(
[
x['down'] == 1,
(x['down'] == 2) & (x['ydstogo'] <= 3),
(x['down'] == 2) & (x['ydstogo'] >= 7),
(x['down'] == 3) & (x['ydstogo'] <= 3),
(x['down'] == 3) & (x['ydstogo'] >= 7)
],
[
'First Down',
'Second & Short',
'Second & Long',
'Third & Short',
'Third & Long'
],
default='Other'
)
)
)
# Feature statistics
feature_summary = pbp_features[[
'is_first_half', 'is_two_minute_drill', 'is_red_zone',
'is_close_game', 'is_third_down'
]].mean()
print("Feature Statistics:")
print(feature_summary)
Aggregated Historical Features
#| label: historical-features-r
#| message: false
#| warning: false
# Team tendency features (rolling windows)
team_tendencies <- pbp %>%
filter(season == 2023, !is.na(posteam)) %>%
arrange(posteam, week, game_id) %>%
group_by(posteam) %>%
mutate(
# Rolling pass rate
pass_rate_L3 = lag(zoo::rollmean(pass, k = 50, fill = NA, align = "right"), 1),
# Rolling EPA
epa_L3 = lag(zoo::rollmean(epa, k = 50, fill = NA, align = "right"), 1),
# Situational tendencies
third_down_conv_rate = lag(
zoo::rollmean(ifelse(down == 3, as.numeric(first_down), NA),
k = 20, fill = NA, align = "right", na.rm = TRUE),
1
)
) %>%
ungroup()
print("Historical features created")
#| label: historical-features-py
#| message: false
#| warning: false
# Team tendency features
team_tendencies = (pbp
.query("season == 2023 & posteam.notna()")
.sort_values(['posteam', 'week', 'game_id'])
.assign(
pass_int=lambda x: x['pass'].astype(float)
)
.groupby('posteam', group_keys=False)
.apply(lambda x: x.assign(
pass_rate_L3=x['pass_int'].rolling(window=50, min_periods=1).mean().shift(1),
epa_L3=x['epa'].rolling(window=50, min_periods=1).mean().shift(1)
))
)
print("Historical features created")
Model Selection and Algorithms
Different algorithms have different strengths for football analytics problems:
Algorithm Comparison
| Algorithm | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Logistic Regression | Interpretable, fast, probabilistic | Linear decision boundaries | Win prediction, simple classification |
| Random Forest | Handles interactions, robust | Less interpretable, can overfit | Play-calling, complex classification |
| Gradient Boosting | High accuracy, handles non-linearity | Slower training, tuning required | EPA prediction, player metrics |
| Neural Networks | Extremely flexible, handles complex patterns | Black box, needs lots of data | Tracking data, computer vision |
| Ridge/Lasso | Regularization prevents overfitting | Assumes linearity | Player projections, small samples |
Cross-Validation and Model Selection
#| label: cross-validation-r
#| message: false
#| warning: false
#| cache: true
library(caret)
# Prepare data for model comparison
model_comp_data <- pbp %>%
filter(
season == 2023,
!is.na(epa),
down <= 3,
play_type %in% c("pass", "run")
) %>%
select(
epa,
down,
ydstogo,
yardline_100,
score_differential,
half_seconds_remaining
) %>%
drop_na() %>%
slice_sample(n = 5000) # Sample for speed
# Set up cross-validation
train_control <- trainControl(
method = "cv",
number = 5,
verboseIter = FALSE
)
# Train multiple models
set.seed(42)
# Linear model
lm_cv <- train(
epa ~ .,
data = model_comp_data,
method = "lm",
trControl = train_control
)
# Ridge regression
ridge_cv <- train(
epa ~ .,
data = model_comp_data,
method = "ridge",
trControl = train_control,
tuneGrid = expand.grid(lambda = seq(0, 1, by = 0.1))
)
# Random forest
rf_cv <- train(
epa ~ .,
data = model_comp_data,
method = "rf",
trControl = train_control,
tuneGrid = expand.grid(mtry = c(2, 3, 4)),
ntree = 50
)
# Compare models
model_comparison <- resamples(list(
Linear = lm_cv,
Ridge = ridge_cv,
RandomForest = rf_cv
))
summary(model_comparison)
#| label: cross-validation-py
#| message: false
#| warning: false
#| cache: true
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
# Prepare data
model_comp_data = (pbp
.query("season == 2023 & epa.notna() & down <= 3 & play_type.isin(['pass', 'run'])")
[['epa', 'down', 'ydstogo', 'yardline_100', 'score_differential', 'half_seconds_remaining']]
.dropna()
.sample(n=5000, random_state=42)
)
X = model_comp_data.drop('epa', axis=1)
y = model_comp_data['epa']
# Define models
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'RandomForest': RandomForestRegressor(n_estimators=50, random_state=42)
}
# Cross-validation
cv_results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
cv_results[name] = {
'mean_r2': scores.mean(),
'std_r2': scores.std()
}
# Display results
cv_df = pd.DataFrame(cv_results).T
print("\nCross-Validation Results:")
print(cv_df.to_string())
Hyperparameter Tuning
Hyperparameters control model complexity and learning. Systematic tuning improves performance.
#| label: hyperparameter-tuning-r
#| message: false
#| warning: false
#| cache: true
# Grid search for Random Forest
rf_grid <- expand.grid(
mtry = c(2, 3, 4, 5),
splitrule = c("variance", "extratrees"),
min.node.size = c(5, 10, 20)
)
# Use ranger (faster RF implementation)
library(ranger)
rf_tuned <- train(
epa ~ .,
data = model_comp_data,
method = "ranger",
trControl = train_control,
tuneGrid = rf_grid,
num.trees = 100,
importance = "impurity"
)
# Best parameters
cat("\nBest Hyperparameters:\n")
print(rf_tuned$bestTune)
cat("\nBest RMSE:", round(min(rf_tuned$results$RMSE), 3), "\n")
#| label: hyperparameter-tuning-py
#| message: false
#| warning: false
#| cache: true
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
# Grid search
rf_grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=3,
scoring='r2',
n_jobs=-1
)
rf_grid_search.fit(X, y)
# Best parameters
print("\nBest Hyperparameters:")
print(rf_grid_search.best_params_)
print(f"\nBest R² Score: {rf_grid_search.best_score_:.3f}")
Model Evaluation Metrics
Choosing the right evaluation metric depends on the problem and business context.
Classification Metrics
#| label: classification-metrics-r
#| message: false
#| warning: false
#| fig-width: 10
#| fig-height: 5
library(pROC)
library(yardstick)
# Get predictions from win prediction model
eval_data <- test_predictions %>%
mutate(
won_binary = as.integer(won == "Win"),
pred_binary = as.integer(pred_class == "Win")
)
# Confusion matrix
conf_mat <- table(Predicted = eval_data$pred_class, Actual = eval_data$won)
# Calculate metrics
accuracy <- mean(eval_data$pred_binary == eval_data$won_binary)
precision <- sum(eval_data$pred_binary == 1 & eval_data$won_binary == 1) /
sum(eval_data$pred_binary == 1)
recall <- sum(eval_data$pred_binary == 1 & eval_data$won_binary == 1) /
sum(eval_data$won_binary == 1)
f1 <- 2 * (precision * recall) / (precision + recall)
cat("Classification Metrics:\n")
cat("Accuracy: ", round(accuracy, 3), "\n")
cat("Precision:", round(precision, 3), "\n")
cat("Recall: ", round(recall, 3), "\n")
cat("F1 Score: ", round(f1, 3), "\n")
# ROC curve
roc_obj <- roc(eval_data$won_binary, eval_data$pred_prob)
auc_value <- auc(roc_obj)
cat("\nAUC:", round(auc_value, 3), "\n")
#| label: classification-metrics-py
#| message: false
#| warning: false
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, roc_curve
)
# Classification metrics (using win prediction from earlier)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
print("Classification Metrics:")
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"\nAUC: {auc:.3f}")
Visualizing Model Performance
#| label: fig-roc-curve-r
#| fig-cap: "ROC Curve for Win Prediction Model"
#| fig-width: 8
#| fig-height: 6
#| message: false
#| warning: false
# ROC Curve
plot(roc_obj,
col = "#0072B2",
lwd = 2,
main = "ROC Curve: Win Prediction Model",
print.auc = TRUE,
print.auc.x = 0.4,
print.auc.y = 0.4)
abline(a = 0, b = 1, lty = 2, col = "gray")
#| label: fig-roc-curve-py
#| fig-cap: "ROC Curve for Win Prediction Model (Python)"
#| fig-width: 8
#| fig-height: 6
#| message: false
#| warning: false
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='#0072B2', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve: Win Prediction Model', fontsize=14, fontweight='bold')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Confusion Matrix Visualization
#| label: fig-confusion-matrix-r
#| fig-cap: "Confusion Matrix for Win Prediction"
#| fig-width: 7
#| fig-height: 6
#| message: false
#| warning: false
library(ggplot2)
# Create confusion matrix dataframe
conf_df <- as.data.frame(conf_mat) %>%
mutate(
Percentage = Freq / sum(Freq) * 100,
Label = paste0(Freq, "\n(", round(Percentage, 1), "%)")
)
ggplot(conf_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", size = 1.5) +
geom_text(aes(label = Label), size = 5, fontface = "bold") +
scale_fill_gradient(low = "#E8F4F8", high = "#0072B2") +
labs(
title = "Confusion Matrix: Win Prediction Model",
subtitle = "2023 Season Test Set",
x = "Actual Outcome",
y = "Predicted Outcome"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 11),
legend.position = "none"
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-confusion-matrix-py
#| fig-cap: "Confusion Matrix for Win Prediction (Python)"
#| fig-width: 7
#| fig-height: 6
#| message: false
#| warning: false
import seaborn as sns
# Create confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(7, 6))
sns.heatmap(
conf_mat,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=['Loss', 'Win'],
yticklabels=['Loss', 'Win'],
cbar_kws={'label': 'Count'}
)
plt.xlabel('Actual Outcome', fontsize=12)
plt.ylabel('Predicted Outcome', fontsize=12)
plt.title('Confusion Matrix: Win Prediction Model\n2023 Season Test Set',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Model Interpretability with SHAP
SHAP (SHapley Additive exPlanations) values provide model-agnostic interpretability by measuring each feature's contribution to predictions.
#| label: shap-analysis-r
#| message: false
#| warning: false
#| cache: true
library(shapr)
# Note: SHAP in R requires specific setup
# Using variable importance as alternative
importance_plot_data <- importance_df %>%
head(10) %>%
mutate(Feature = fct_reorder(Feature, MeanDecreaseAccuracy))
ggplot(importance_plot_data, aes(x = MeanDecreaseAccuracy, y = Feature)) +
geom_col(fill = "#0072B2") +
labs(
title = "Feature Importance: Play Type Prediction",
subtitle = "Random Forest Model",
x = "Mean Decrease in Accuracy",
y = NULL
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text.y = element_text(size = 11)
)
#| label: shap-analysis-py
#| message: false
#| warning: false
#| cache: true
#| fig-width: 10
#| fig-height: 6
# SHAP analysis
try:
import shap
# Sample data for SHAP (computationally expensive)
X_sample = X_test.sample(n=min(100, len(X_test)), random_state=42)
# Create explainer
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_sample)
# Summary plot
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values[1], X_sample, show=False)
plt.title("SHAP Summary: Play Type Prediction", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
except ImportError:
print("SHAP library not available. Install with: pip install shap")
# Alternative: Feature importance
importance_df_top = importance_df.head(10)
plt.figure(figsize=(10, 6))
plt.barh(range(len(importance_df_top)), importance_df_top['Importance'])
plt.yticks(range(len(importance_df_top)), importance_df_top['Feature'])
plt.xlabel('Feature Importance', fontsize=12)
plt.title('Feature Importance: Play Type Prediction',
fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Handling Football-Specific Data Challenges
Football data presents unique challenges that require specialized approaches.
Class Imbalance
Some outcomes are rare (4th down conversions, touchdowns), creating imbalanced datasets.
Solutions for Class Imbalance:
1. **Resampling**: Oversample minority class or undersample majority class 2. **Class weights**: Penalize misclassification of rare events more heavily 3. **SMOTE**: Synthetic Minority Over-sampling Technique 4. **Ensemble methods**: Use algorithms robust to imbalance (e.g., balanced random forest) 5. **Appropriate metrics**: Use F1, precision-recall, or AUC instead of accuracySmall Sample Sizes
Player-level predictions often suffer from limited observations (rookies, backups).
Solutions for Small Samples:
1. **Regularization**: Ridge/Lasso regression prevents overfitting 2. **Hierarchical models**: Borrow strength across similar players 3. **Feature engineering**: Create stable aggregate features 4. **Domain priors**: Incorporate expert knowledge (Bayesian methods) 5. **Ensemble predictions**: Combine multiple weak learnersTemporal Dependencies
Football data has strong time dependencies (player improvement, scheme changes).
Solutions for Temporal Dependencies:
1. **Time-series CV**: Use rolling or expanding window cross-validation 2. **Recurrent networks**: LSTM/GRU for sequential patterns 3. **Temporal features**: Include time trends, seasonal effects 4. **Separate test period**: Never test on past dataProduction Deployment Considerations
Moving models from development to production requires careful planning.
Model Versioning and Monitoring
#| eval: false
# Example model versioning structure
model_metadata <- list(
model_name = "win_probability_v2",
algorithm = "logistic_regression",
training_date = Sys.Date(),
training_seasons = "2018-2022",
features = c("avg_off_epa", "avg_success_rate", "avg_pass_epa", "avg_rush_epa"),
performance = list(
train_accuracy = 0.68,
test_accuracy = 0.66,
auc = 0.72
),
hyperparameters = list(
regularization = "none"
)
)
# Save model and metadata
saveRDS(logistic_model, "models/win_prob_v2.rds")
saveRDS(model_metadata, "models/win_prob_v2_metadata.rds")
Retraining Strategy
Model Drift and Retraining
Football strategies evolve. Models must be retrained regularly: - **Weekly updates**: For in-season predictions - **Seasonal retraining**: After each season completes - **Rule change updates**: When NFL rules change significantly - **Performance monitoring**: Track prediction accuracy over time - **A/B testing**: Compare new models against production baselineAPI Deployment Example
#| eval: false
# Example Flask API for model serving
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
# Load model
model = joblib.load('models/win_prob_v2.pkl')
@app.route('/predict', methods=['POST'])
def predict():
# Get features from request
data = request.get_json()
features = [[
data['avg_off_epa'],
data['avg_success_rate'],
data['avg_pass_epa'],
data['avg_rush_epa']
]]
# Make prediction
probability = model.predict_proba(features)[0][1]
return jsonify({
'win_probability': float(probability),
'model_version': 'v2'
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Advanced Topics and Extensions
Ensemble Methods
Combining multiple models often outperforms single models:
#| label: ensemble-methods-r
#| message: false
#| warning: false
#| eval: false
# Stacking ensemble
library(caretEnsemble)
# Train multiple models
model_list <- caretList(
epa ~ .,
data = model_comp_data,
trControl = train_control,
methodList = c("lm", "ridge", "rf"),
tuneList = NULL
)
# Stack models
stacked_model <- caretStack(
model_list,
method = "glm",
trControl = train_control
)
# Stacked predictions often improve performance
#| label: ensemble-methods-py
#| message: false
#| warning: false
#| eval: false
from sklearn.ensemble import VotingRegressor
# Create ensemble
ensemble = VotingRegressor([
('linear', LinearRegression()),
('ridge', Ridge(alpha=1.0)),
('rf', RandomForestRegressor(n_estimators=50, random_state=42))
])
# Train ensemble
ensemble.fit(X_train, y_train)
# Ensemble predictions
ensemble_pred = ensemble.predict(X_test)
Neural Networks for Complex Patterns
Deep learning excels with large datasets and complex interactions:
#| eval: false
# Simple neural network example
from sklearn.neural_network import MLPRegressor
nn_model = MLPRegressor(
hidden_layer_sizes=(64, 32, 16),
activation='relu',
solver='adam',
max_iter=500,
random_state=42
)
nn_model.fit(X_train, y_train)
nn_pred = nn_model.predict(X_test)
Summary
Machine learning provides powerful tools for football analytics, enabling predictions and insights impossible with traditional methods. Key takeaways:
Classification vs Regression:
- Classification for categorical outcomes (wins, play types)
- Regression for continuous metrics (EPA, yards, player performance)
Algorithm Selection:
- Logistic regression for interpretable classification
- Random forests for complex interactions
- Gradient boosting for maximum predictive accuracy
- Ridge/Lasso for regularization with small samples
Feature Engineering:
- Domain knowledge drives feature creation
- Temporal, situational, and aggregated features
- Handle football-specific patterns
Evaluation:
- Choose metrics appropriate for problem (accuracy, AUC, RMSE, R²)
- Use cross-validation to prevent overfitting
- Visualize performance with ROC curves, confusion matrices
Interpretability:
- SHAP values explain individual predictions
- Feature importance identifies key drivers
- Balance accuracy with interpretability
Production:
- Version models and track performance
- Retrain regularly as strategies evolve
- Monitor for model drift
In this chapter, we:
- Distinguished supervised from unsupervised learning
- Built classification models for wins and play types
- Developed regression models for EPA and player performance
- Engineered football-specific features
- Compared algorithms and tuned hyperparameters
- Evaluated models with appropriate metrics
- Interpreted predictions with SHAP values
- Addressed football data challenges
- Considered production deployment
Exercises
Conceptual Questions
-
Algorithm Selection: You need to predict whether a team will convert on 4th down. The dataset is highly imbalanced (conversions are rare). Which algorithm would you choose and why? What techniques would you use to handle the class imbalance?
-
Feature Engineering: Design five features for predicting quarterback performance next season. Explain why each feature might be predictive and how you would calculate it from play-by-play data.
-
Cross-Validation Strategy: Why is random cross-validation inappropriate for time-series football data? Describe a better validation strategy for predicting game outcomes.
Coding Exercises
Exercise 1: Build a Fourth Down Conversion Model
Using nflfastR data, build a classification model to predict 4th down conversion success. **Requirements:** a) Load 2023 season data and filter to 4th down plays b) Create features: distance, field position, score situation, time remaining c) Train logistic regression, random forest, and gradient boosting models d) Compare model performance using accuracy, precision, recall, and AUC e) Visualize feature importance **Bonus**: Handle class imbalance using SMOTE or class weights.Exercise 2: Receiver Yards After Catch Prediction
Build a regression model to predict yards after catch (YAC) for receivers. **Requirements:** a) Filter to completed passes with YAC data b) Engineer features: air yards, defender proximity, field position, route type c) Train multiple regression models d) Use 5-fold cross-validation to compare models e) Calculate RMSE and R² for each model f) Identify the top 5 most important features **Bonus**: Create SHAP plots to explain predictions for specific catches.Exercise 3: Team Performance Forecasting
Predict team offensive EPA for the next 4 weeks based on recent performance. **Requirements:** a) Calculate team-level EPA statistics by week b) Create rolling average features (last 3 games, last 5 games) c) Include opponent defensive metrics d) Build a model to predict next week's offensive EPA e) Evaluate using time-series cross-validation f) Compare predictions to actual results **Bonus**: Incorporate schedule difficulty and rest days.Exercise 4: Model Interpretation Challenge
Take any model from the previous exercises and: a) Generate SHAP values for 100 predictions b) Create a SHAP summary plot c) Select 3 interesting predictions and explain why the model made those predictions d) Identify cases where the model is most uncertain e) Suggest features that might improve model performance **Goal**: Practice explaining model behavior to non-technical stakeholders.Further Reading
Books
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R. Springer.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly.
Papers
- Burke, B. (2009). "Advanced NFL Stats: Win Probability." Advanced Football Analytics.
- Lopez, M., Matthews, G., & Baumer, B. (2018). "How often does the best team win? A unified approach to understanding randomness in North American sport." The Annals of Applied Statistics, 12(4), 2483-2516.
- Yurko, R., Ventura, S., & Horowitz, M. (2019). "nflWAR: A Reproducible Method for Offensive Player Evaluation in Football." Journal of Quantitative Analysis in Sports, 15(3), 163-183.
Online Resources
- scikit-learn documentation: https://scikit-learn.org/
- SHAP documentation: https://shap.readthedocs.io/
- caret package documentation: https://topepo.github.io/caret/
- nflfastR beginner's guide: https://www.nflfastr.com/
References
:::