Learning ObjectivesBy the end of this chapter, you will be able to:
- Master the core nflverse packages (nflfastR, nflreadr, nflplotR)
- Understand the structure and key variables in play-by-play data
- Work effectively with roster, team, and player data
- Create professional NFL visualizations with team logos and colors
- Integrate multiple nflverse data sources for comprehensive analysis
- Apply best practices and avoid common pitfalls in NFL data analysis
Introduction
The nflverse is a comprehensive ecosystem of R packages and Python tools designed specifically for NFL data analysis. Developed and maintained by a community of football analytics enthusiasts, the nflverse has become the de facto standard for accessing, analyzing, and visualizing NFL data.
What makes the nflverse special is its:
- Comprehensive Coverage: Play-by-play data, rosters, schedules, betting lines, and more
- Clean, Consistent Data: Pre-processed and standardized across seasons
- Active Development: Regular updates and new features
- Strong Community: Extensive documentation and support
- Cross-Language Support: Both R (nflverse) and Python (nfl_data_py) implementations
The nflverse Philosophy
The nflverse is built on the principle of making NFL data analysis accessible to everyone. All packages are open-source, free to use, and designed to work seamlessly together. The data is sourced from publicly available NFL feeds and enhanced with advanced metrics calculated by the community.In this chapter, we'll explore the major components of the nflverse ecosystem, learn how to work with its data structures, and build practical skills for real-world football analysis.
The nflverse Package Family
The nflverse consists of several specialized packages, each serving a specific purpose:
| Package | Purpose | Primary Use Cases |
|---|---|---|
| nflfastR | Play-by-play data and models | EPA, WP, success rate calculations |
| nflreadr | Fast data loading | Efficient data import from nflverse repository |
| nflplotR | NFL-specific visualizations | Team logos, colors, charts |
| nfl4th | Fourth down analysis | 4th down decision recommendations |
| nflseedR | Playoff scenarios | Playoff probability simulations |
| nflfastR-roster | Player information | Roster data and player identifiers |
For Python users, nfl_data_py provides equivalent functionality to nflreadr with a Pythonic interface.
nflfastR: The Foundation
nflfastR is the cornerstone of the nflverse, providing access to detailed play-by-play data with advanced metrics already calculated. Let's explore its capabilities.
Loading Play-by-Play Data
#| label: setup-r
#| message: false
#| warning: false
library(tidyverse)
library(nflfastR)
library(nflplotR)
library(gt)
# Load play-by-play data for multiple seasons
pbp <- load_pbp(2021:2023)
# Display data dimensions
cat("Loaded", nrow(pbp), "plays from",
length(unique(pbp$season)), "seasons\n")
cat("Number of columns:", ncol(pbp), "\n")
cat("Seasons included:", paste(unique(pbp$season), collapse = ", "), "\n")
#| label: setup-py
#| message: false
#| warning: false
import pandas as pd
import numpy as np
import nfl_data_py as nfl
import matplotlib.pyplot as plt
import seaborn as sns
# Load play-by-play data for multiple seasons
pbp = nfl.import_pbp_data([2021, 2022, 2023])
# Display data dimensions
print(f"Loaded {len(pbp):,} plays from {pbp['season'].nunique()} seasons")
print(f"Number of columns: {len(pbp.columns)}")
print(f"Seasons included: {', '.join(map(str, sorted(pbp['season'].unique())))}")
Understanding Play-by-Play Structure
Play-by-play data contains over 370 columns of information about each play. Let's examine the key categories:
#| label: pbp-structure-r
#| message: false
# Display key variable categories
key_vars <- list(
Game_Info = c("game_id", "season", "week", "game_date", "season_type"),
Teams = c("posteam", "defteam", "home_team", "away_team"),
Situation = c("down", "ydstogo", "yardline_100", "quarter_seconds_remaining"),
Play_Info = c("play_type", "yards_gained", "touchdown", "fumble", "interception"),
Advanced_Metrics = c("epa", "wpa", "success", "cpoe", "air_yards"),
Players = c("passer_player_name", "rusher_player_name", "receiver_player_name")
)
# Show sample data for each category
pbp %>%
filter(season == 2023, week == 1, !is.na(epa)) %>%
select(!!!unlist(key_vars)) %>%
slice(1:3) %>%
glimpse()
#| label: pbp-structure-py
#| message: false
# Display key variable categories
key_vars = {
'Game_Info': ['game_id', 'season', 'week', 'game_date', 'season_type'],
'Teams': ['posteam', 'defteam', 'home_team', 'away_team'],
'Situation': ['down', 'ydstogo', 'yardline_100', 'quarter_seconds_remaining'],
'Play_Info': ['play_type', 'yards_gained', 'touchdown', 'fumble', 'interception'],
'Advanced_Metrics': ['epa', 'wpa', 'success', 'cpoe', 'air_yards'],
'Players': ['passer_player_name', 'rusher_player_name', 'receiver_player_name']
}
# Flatten the list of columns
all_cols = [col for cols in key_vars.values() for col in cols]
# Show sample data
sample_data = (pbp
.query("season == 2023 & week == 1 & epa.notna()")
[all_cols]
.head(3)
)
print("\nSample Play-by-Play Data:")
print(sample_data.to_string())
Key Variables Explained
Here are the most important variables you'll work with:
Identifiers and Context:
game_id: Unique game identifier (format: YYYY_WW_AWAY_HOME)play_id: Unique play identifier within a gameold_game_id: ESPN game ID for cross-referencing
Game Situation:
down: Down (1-4)ydstogo: Yards needed for first downyardline_100: Yards from opponent's end zone (0-100)quarter_seconds_remaining: Time left in current quarterhalf_seconds_remaining: Time left in current halfgame_seconds_remaining: Time left in game
Play Outcome:
play_type: Type of play (pass, run, punt, field_goal, etc.)yards_gained: Net yards gained on the playtouchdown: Binary indicator for touchdownfirst_down: Binary indicator for first down conversion
Advanced Metrics:
epa: Expected Points Addedwpa: Win Probability Addedsuccess: Binary success indicator (EPA > 0)cpoe: Completion Percentage Over Expected (passing plays)air_yards: Yards ball traveled in the air (passing plays)yards_after_catch: Yards gained after reception
Working with Play Types
Common play types include: - `pass`: Passing attempt - `run`: Running play - `punt`: Punt - `field_goal`: Field goal attempt - `kickoff`: Kickoff - `extra_point`: PAT attempt - `qb_kneel`: QB kneel - `qb_spike`: QB spike For most analyses, filter to `play_type %in% c("pass", "run")` and `!is.na(epa)` to focus on standard offensive plays.nflreadr: Efficient Data Loading
While nflfastR's load_pbp() function is convenient, nflreadr provides faster, more flexible data loading and access to additional datasets.
Available Datasets
#| label: nflreadr-datasets-r
#| eval: false
library(nflreadr)
# Play-by-play data (faster than load_pbp)
pbp <- load_pbp(2023)
# Roster data
rosters <- load_rosters(2023)
# Team information
teams <- load_teams()
# Schedule data
schedules <- load_schedules(2023)
# Player statistics
player_stats <- load_player_stats(2023)
# Next Gen Stats
ngs_passing <- load_nextgen_stats(2023, stat_type = "passing")
ngs_rushing <- load_nextgen_stats(2023, stat_type = "rushing")
ngs_receiving <- load_nextgen_stats(2023, stat_type = "receiving")
# Depth charts
depth_charts <- load_depth_charts(2023)
# Injuries
injuries <- load_injuries(2023)
# Combine results
combine <- load_combine()
# Draft picks
draft_picks <- load_draft_picks()
# PFF data (if available)
# snap_counts <- load_snap_counts(2023)
#| label: nflreadr-datasets-py
#| eval: false
import nfl_data_py as nfl
# Play-by-play data
pbp = nfl.import_pbp_data([2023])
# Roster data
rosters = nfl.import_rosters([2023])
# Team information
teams = nfl.import_team_desc()
# Schedule data
schedules = nfl.import_schedules([2023])
# Weekly player statistics
weekly_data = nfl.import_weekly_data([2023])
# Seasonal player statistics
seasonal_data = nfl.import_seasonal_data([2023])
# Next Gen Stats
ngs_passing = nfl.import_ngs_data('passing', [2023])
ngs_rushing = nfl.import_ngs_data('rushing', [2023])
ngs_receiving = nfl.import_ngs_data('receiving', [2023])
# Depth charts
depth_charts = nfl.import_depth_charts([2023])
# Injuries
injuries = nfl.import_injuries([2023])
# Combine results
combine = nfl.import_combine_data([2020, 2021, 2022, 2023])
# Draft picks
draft_picks = nfl.import_draft_picks([2020, 2021, 2022, 2023])
# Snap counts
snap_counts = nfl.import_snap_counts([2023])
Working with Roster Data
Roster data connects player IDs across different systems and provides biographical information:
#| label: roster-data-r
#| message: false
#| warning: false
#| cache: true
library(nflreadr)
# Load roster data
rosters_2023 <- load_rosters(2023)
# Display key roster information
rosters_2023 %>%
filter(team == "KC") %>%
select(full_name, position, jersey_number, height, weight,
college, years_exp, gsis_id) %>%
arrange(position, jersey_number) %>%
head(10) %>%
gt() %>%
tab_header(
title = "Kansas City Chiefs Roster Sample",
subtitle = "2023 Season"
) %>%
cols_label(
full_name = "Player",
position = "Pos",
jersey_number = "No.",
height = "Height",
weight = "Weight",
college = "College",
years_exp = "Exp",
gsis_id = "GSIS ID"
)
#| label: roster-data-py
#| message: false
#| warning: false
#| cache: true
# Load roster data
rosters_2023 = nfl.import_rosters([2023])
# Display key roster information
kc_roster = (rosters_2023
.query("team == 'KC'")
[['full_name', 'position', 'jersey_number', 'height', 'weight',
'college', 'years_exp', 'gsis_id']]
.sort_values(['position', 'jersey_number'])
.head(10)
)
print("\nKansas City Chiefs Roster Sample (2023):")
print(kc_roster.to_string(index=False))
Team Information and Colors
The teams dataset provides official team colors, logos, and organizational information:
#| label: team-info-r
#| message: false
#| warning: false
# Load team information
teams <- load_teams()
# Display team colors and identifiers
teams %>%
select(team_abbr, team_name, team_color, team_color2,
team_logo_espn, team_conference, team_division) %>%
filter(team_conference == "AFC", team_division == "West") %>%
gt() %>%
tab_header(
title = "AFC West Team Information"
) %>%
cols_label(
team_abbr = "Team",
team_name = "Name",
team_color = "Primary Color",
team_color2 = "Secondary Color",
team_logo_espn = "Logo URL",
team_conference = "Conf",
team_division = "Div"
)
#| label: team-info-py
#| message: false
#| warning: false
# Load team information
teams = nfl.import_team_desc()
# Display team colors and identifiers
afc_west = (teams
.query("team_conf == 'AFC' & team_division == 'AFC West'")
[['team_abbr', 'team_name', 'team_color', 'team_color2',
'team_conf', 'team_division']]
)
print("\nAFC West Team Information:")
print(afc_west.to_string(index=False))
nflplotR: Professional NFL Visualizations
nflplotR is a powerful package that makes it easy to create publication-quality NFL visualizations with team logos, colors, and consistent styling.
Basic Logo Plots
#| label: fig-logo-plot-r
#| fig-cap: "Average EPA per play by team (2023 season)"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
# Calculate team offensive EPA
team_epa <- pbp %>%
filter(season == 2023, !is.na(epa), !is.na(posteam)) %>%
filter(play_type %in% c("pass", "run")) %>%
group_by(posteam) %>%
summarise(
plays = n(),
epa_per_play = mean(epa),
success_rate = mean(success),
.groups = "drop"
) %>%
arrange(desc(epa_per_play))
# Create logo plot
team_epa %>%
ggplot(aes(x = reorder(posteam, epa_per_play), y = epa_per_play)) +
geom_col(aes(fill = posteam, color = posteam), alpha = 0.7, width = 0.7) +
scale_fill_nfl(type = "primary") +
scale_color_nfl(type = "secondary") +
nflplotR::geom_nfl_logos(aes(team_abbr = posteam), width = 0.04, alpha = 0.9) +
coord_flip() +
labs(
title = "Offensive Efficiency by Team",
subtitle = "Average EPA per play, 2023 Regular Season",
x = NULL,
y = "EPA per Play",
caption = "Data: nflfastR | Includes pass and run plays only"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
axis.text.y = element_blank(),
legend.position = "none",
panel.grid.major.y = element_blank()
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-logo-plot-py
#| fig-cap: "Average EPA per play by team - Python (2023 season)"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
# Calculate team offensive EPA
team_epa = (pbp
.query("season == 2023 & epa.notna() & posteam.notna()")
.query("play_type.isin(['pass', 'run'])")
.groupby('posteam')
.agg(
plays=('epa', 'count'),
epa_per_play=('epa', 'mean'),
success_rate=('success', 'mean')
)
.reset_index()
.sort_values('epa_per_play', ascending=True)
)
# Create bar plot
fig, ax = plt.subplots(figsize=(12, 8))
# Get team colors
team_colors = dict(zip(teams['team_abbr'], teams['team_color']))
colors = [team_colors.get(team, '#333333') for team in team_epa['posteam']]
bars = ax.barh(range(len(team_epa)), team_epa['epa_per_play'], color=colors, alpha=0.7)
# Customize plot
ax.set_yticks(range(len(team_epa)))
ax.set_yticklabels(team_epa['posteam'])
ax.set_xlabel('EPA per Play', fontsize=12)
ax.set_title('Offensive Efficiency by Team\nAverage EPA per play, 2023 Regular Season',
fontsize=16, fontweight='bold', pad=20)
ax.text(0.99, 0.01, 'Data: nfl_data_py | Includes pass and run plays only',
transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5, alpha=0.3)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Scatter Plots with Logos
One of nflplotR's most powerful features is the ability to replace points with team logos:
#| label: fig-scatter-logos-r
#| fig-cap: "Team offensive vs defensive EPA (2023 season)"
#| fig-width: 10
#| fig-height: 10
#| message: false
#| warning: false
#| cache: true
# Calculate offensive and defensive EPA
team_off_def <- pbp %>%
filter(season == 2023, !is.na(epa)) %>%
filter(play_type %in% c("pass", "run")) %>%
group_by(posteam) %>%
summarise(off_epa = mean(epa), .groups = "drop") %>%
left_join(
pbp %>%
filter(season == 2023, !is.na(epa)) %>%
filter(play_type %in% c("pass", "run")) %>%
group_by(defteam) %>%
summarise(def_epa = mean(epa), .groups = "drop"),
by = c("posteam" = "defteam")
)
# Create scatter plot with logos
team_off_def %>%
ggplot(aes(x = off_epa, y = -def_epa)) +
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
geom_nfl_logos(aes(team_abbr = posteam), width = 0.06, alpha = 0.8) +
labs(
title = "Team Performance Matrix: Offense vs Defense",
subtitle = "EPA per play, 2023 Regular Season | Top right = Best teams",
x = "Offensive EPA per Play",
y = "Defensive EPA per Play (reversed, higher = better)",
caption = "Data: nflfastR"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11),
panel.grid.minor = element_blank()
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-scatter-logos-py
#| fig-cap: "Team offensive vs defensive EPA - Python (2023 season)"
#| fig-width: 10
#| fig-height: 10
#| message: false
#| warning: false
#| cache: true
# Calculate offensive EPA
off_epa = (pbp
.query("season == 2023 & epa.notna() & play_type.isin(['pass', 'run'])")
.groupby('posteam')
.agg(off_epa=('epa', 'mean'))
.reset_index()
)
# Calculate defensive EPA
def_epa = (pbp
.query("season == 2023 & epa.notna() & play_type.isin(['pass', 'run'])")
.groupby('defteam')
.agg(def_epa=('epa', 'mean'))
.reset_index()
)
# Merge
team_off_def = off_epa.merge(def_epa, left_on='posteam', right_on='defteam')
# Create scatter plot
fig, ax = plt.subplots(figsize=(10, 10))
# Plot points with team colors
for _, row in team_off_def.iterrows():
team = row['posteam']
color = team_colors.get(team, '#333333')
ax.scatter(row['off_epa'], -row['def_epa'], s=300, color=color, alpha=0.6)
ax.text(row['off_epa'], -row['def_epa'], team,
ha='center', va='center', fontsize=8, fontweight='bold')
# Add reference lines
ax.axvline(x=0, linestyle='--', alpha=0.5, color='gray')
ax.axhline(y=0, linestyle='--', alpha=0.5, color='gray')
# Customize plot
ax.set_xlabel('Offensive EPA per Play', fontsize=12)
ax.set_ylabel('Defensive EPA per Play (reversed, higher = better)', fontsize=12)
ax.set_title('Team Performance Matrix: Offense vs Defense\nEPA per play, 2023 Regular Season | Top right = Best teams',
fontsize=14, fontweight='bold', pad=20)
ax.text(0.99, 0.01, 'Data: nfl_data_py',
transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Time Series with Team Colors
#| label: fig-time-series-r
#| fig-cap: "Weekly EPA trends for AFC West teams (2023)"
#| fig-width: 12
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true
# Calculate weekly EPA for AFC West teams
afc_west_teams <- c("KC", "LV", "LAC", "DEN")
weekly_epa <- pbp %>%
filter(season == 2023, season_type == "REG",
posteam %in% afc_west_teams,
!is.na(epa), play_type %in% c("pass", "run")) %>%
group_by(posteam, week) %>%
summarise(epa_per_play = mean(epa), .groups = "drop")
# Create line plot
weekly_epa %>%
ggplot(aes(x = week, y = epa_per_play, color = posteam)) +
geom_line(size = 1.5, alpha = 0.8) +
geom_point(size = 3) +
scale_color_nfl(type = "primary") +
scale_x_continuous(breaks = 1:18) +
labs(
title = "AFC West Weekly Offensive Performance",
subtitle = "EPA per play by week, 2023 Regular Season",
x = "Week",
y = "EPA per Play",
color = "Team",
caption = "Data: nflfastR"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "top",
panel.grid.minor.x = element_blank()
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-time-series-py
#| fig-cap: "Weekly EPA trends for AFC West teams - Python (2023)"
#| fig-width: 12
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true
# Calculate weekly EPA for AFC West teams
afc_west_teams = ['KC', 'LV', 'LAC', 'DEN']
weekly_epa = (pbp
.query("season == 2023 & season_type == 'REG' & posteam in @afc_west_teams")
.query("epa.notna() & play_type.isin(['pass', 'run'])")
.groupby(['posteam', 'week'])
.agg(epa_per_play=('epa', 'mean'))
.reset_index()
)
# Create line plot
fig, ax = plt.subplots(figsize=(12, 6))
for team in afc_west_teams:
team_data = weekly_epa[weekly_epa['posteam'] == team]
color = team_colors.get(team, '#333333')
ax.plot(team_data['week'], team_data['epa_per_play'],
marker='o', linewidth=2.5, label=team, color=color, alpha=0.8)
ax.set_xlabel('Week', fontsize=12)
ax.set_ylabel('EPA per Play', fontsize=12)
ax.set_title('AFC West Weekly Offensive Performance\nEPA per play by week, 2023 Regular Season',
fontsize=14, fontweight='bold', pad=20)
ax.legend(title='Team', loc='best', frameon=True)
ax.grid(alpha=0.3)
ax.set_xticks(range(1, 19))
ax.text(0.99, 0.01, 'Data: nfl_data_py',
transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')
plt.tight_layout()
plt.show()
nfl4th: Fourth Down Decision Analysis
The nfl4th package provides recommendations for 4th down decisions based on win probability models.
#| label: nfl4th-example-r
#| eval: false
#| echo: true
library(nfl4th)
# Get 4th down recommendation
# Example: 4th and 3 at opponent's 35, down by 3, 5 minutes left
recommendation <- add_4th_probs(
df = tibble(
# Game situation
game_seconds_remaining = 300,
half_seconds_remaining = 300,
# Field position and down
yardline_100 = 35,
ydstogo = 3,
# Score
score_differential = -3,
# Other context
home = 1,
posteam_timeouts_remaining = 3,
defteam_timeouts_remaining = 3
)
)
# View recommendation
recommendation %>%
select(yardline_100, ydstogo, go_boost, first_down_prob,
wp_go, wp_fg, wp_punt, choice) %>%
glimpse()
#| label: nfl4th-example-py
#| eval: false
#| echo: true
# Python doesn't have a direct nfl4th equivalent
# But you can calculate win probability using EPA models
# Here's a conceptual example of how you might approach this
def calculate_4th_down_wp(
yardline_100, ydstogo, score_diff,
seconds_remaining, timeouts_remaining
):
"""
Simplified 4th down win probability calculator
In practice, you would use a trained model
"""
# This is a placeholder for demonstration
# Real implementation would use historical data
# to build a win probability model
# Factors to consider:
# - Distance to go
# - Field position
# - Score differential
# - Time remaining
# - Historical conversion rates
pass # Implementation would go here
# Example usage
situation = {
'yardline_100': 35,
'ydstogo': 3,
'score_diff': -3,
'seconds_remaining': 300,
'timeouts': 3
}
nflseedR: Playoff Simulation
nflseedR simulates playoff scenarios and calculates playoff probabilities.
#| label: nflseedr-example-r
#| eval: false
#| echo: true
library(nflseedR)
# Simulate playoff scenarios
# This runs thousands of simulations of remaining games
sims <- nflseedR::simulate_nfl(
nfl_season = 2023,
fresh_season = FALSE, # Use current standings
simulations = 10000
)
# Calculate playoff probabilities
playoff_probs <- sims %>%
group_by(team) %>%
summarise(
playoff_prob = mean(playoff),
division_prob = mean(div_champ),
conf_prob = mean(conf_champ),
sb_prob = mean(sb_champ),
avg_wins = mean(wins)
) %>%
arrange(desc(playoff_prob))
# View top teams
playoff_probs %>%
head(10) %>%
gt() %>%
fmt_percent(columns = ends_with("_prob"), decimals = 1) %>%
fmt_number(columns = avg_wins, decimals = 1)
#| label: nflseedr-example-py
#| eval: false
#| echo: true
# Python implementation of playoff simulation
# This is a simplified conceptual example
import numpy as np
def simulate_remaining_games(schedule, team_strengths, n_sims=10000):
"""
Simulate remaining NFL games to calculate playoff probabilities
"""
results = []
for sim in range(n_sims):
# Simulate each remaining game
season_results = schedule.copy()
for idx, game in season_results.iterrows():
home_team = game['home_team']
away_team = game['away_team']
# Get team strengths (could use EPA, ELO, etc.)
home_strength = team_strengths.get(home_team, 0)
away_strength = team_strengths.get(away_team, 0)
# Calculate win probability (simplified)
home_wp = 1 / (1 + np.exp(-(home_strength - away_strength + 0.2)))
# Simulate outcome
home_wins = np.random.random() < home_wp
# Store result
# ... continue simulation logic
return results
# This would be expanded into full playoff simulation
Combining Multiple Data Sources
The real power of the nflverse comes from combining multiple datasets. Let's build a comprehensive player analysis:
#| label: combine-data-r
#| message: false
#| warning: false
#| cache: true
# Load multiple data sources
rosters <- load_rosters(2023)
pbp_2023 <- load_pbp(2023)
# Analyze QB performance with roster information
qb_analysis <- pbp_2023 %>%
filter(!is.na(passer_player_id), play_type == "pass") %>%
group_by(passer_player_id, passer_player_name) %>%
summarise(
team = first(posteam),
dropbacks = n(),
epa_per_dropback = mean(epa, na.rm = TRUE),
cpoe = mean(cpoe, na.rm = TRUE),
success_rate = mean(success, na.rm = TRUE),
.groups = "drop"
) %>%
filter(dropbacks >= 100) %>%
# Join with roster data
left_join(
rosters %>% select(gsis_id, college, years_exp, height, weight),
by = c("passer_player_id" = "gsis_id")
) %>%
arrange(desc(epa_per_dropback))
# Display top QBs
qb_analysis %>%
select(passer_player_name, team, dropbacks, epa_per_dropback,
cpoe, success_rate, college, years_exp) %>%
head(10) %>%
gt() %>%
tab_header(
title = "Top Quarterback Performance",
subtitle = "Minimum 100 dropbacks, 2023 season"
) %>%
cols_label(
passer_player_name = "Player",
team = "Team",
dropbacks = "Dropbacks",
epa_per_dropback = "EPA/Play",
cpoe = "CPOE",
success_rate = "Success %",
college = "College",
years_exp = "Exp"
) %>%
fmt_number(
columns = c(epa_per_dropback, cpoe, success_rate),
decimals = 3
) %>%
fmt_number(
columns = dropbacks,
decimals = 0
)
#| label: combine-data-py
#| message: false
#| warning: false
#| cache: true
# Load multiple data sources
rosters = nfl.import_rosters([2023])
pbp_2023_py = nfl.import_pbp_data([2023])
# Analyze QB performance with roster information
qb_analysis = (pbp_2023_py
.query("passer_player_id.notna() & play_type == 'pass'")
.groupby(['passer_player_id', 'passer_player_name'])
.agg(
team=('posteam', 'first'),
dropbacks=('epa', 'count'),
epa_per_dropback=('epa', lambda x: x.mean()),
cpoe=('cpoe', lambda x: x.mean()),
success_rate=('success', lambda x: x.mean())
)
.reset_index()
.query("dropbacks >= 100")
)
# Join with roster data
qb_analysis = qb_analysis.merge(
rosters[['gsis_id', 'college', 'years_exp', 'height', 'weight']],
left_on='passer_player_id',
right_on='gsis_id',
how='left'
)
# Sort and display
qb_analysis = qb_analysis.sort_values('epa_per_dropback', ascending=False)
print("\nTop Quarterback Performance (Min 100 dropbacks, 2023):")
print(qb_analysis[['passer_player_name', 'team', 'dropbacks', 'epa_per_dropback',
'cpoe', 'success_rate', 'college', 'years_exp']].head(10).to_string(index=False))
Advanced Example: Team Composition Analysis
Let's analyze team roster composition and its relationship to performance:
#| label: roster-composition-r
#| message: false
#| warning: false
#| cache: true
# Analyze roster composition by team
roster_composition <- rosters %>%
filter(status == "ACT") %>%
group_by(team, position) %>%
summarise(
count = n(),
avg_exp = mean(years_exp, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = position,
values_from = count,
values_fill = 0
)
# Calculate team performance
team_performance <- pbp %>%
filter(season == 2023, !is.na(epa), play_type %in% c("pass", "run")) %>%
group_by(posteam) %>%
summarise(
epa_per_play = mean(epa),
.groups = "drop"
)
# Combine and analyze
roster_performance <- roster_composition %>%
left_join(team_performance, by = c("team" = "posteam"))
# Show correlation between roster composition and performance
print("Roster composition analysis complete")
#| label: roster-composition-py
#| message: false
#| warning: false
#| cache: true
# Analyze roster composition by team
roster_composition = (rosters
.query("status == 'ACT'")
.groupby(['team', 'position'])
.agg(
count=('gsis_id', 'count'),
avg_exp=('years_exp', 'mean')
)
.reset_index()
.pivot_table(
index='team',
columns='position',
values='count',
fill_value=0
)
.reset_index()
)
# Calculate team performance
team_performance = (pbp_2023_py
.query("epa.notna() & play_type.isin(['pass', 'run'])")
.groupby('posteam')
.agg(epa_per_play=('epa', 'mean'))
.reset_index()
)
# Combine and analyze
roster_performance = roster_composition.merge(
team_performance,
left_on='team',
right_on='posteam',
how='left'
)
print("\nRoster composition analysis complete")
print(f"Teams analyzed: {len(roster_performance)}")
Data Structure Deep Dive
Understanding the data structure is crucial for effective analysis. Let's examine the most important aspects:
Play-by-Play Data Schema
The play-by-play data follows a hierarchical structure:
Season
└── Game
└── Drive
└── Play
├── Pre-snap situation
├── Play execution
├── Post-play situation
└── Calculated metrics
Key Data Relationships
#| label: data-relationships-r
#| message: false
#| warning: false
# Examine data relationships
data_structure <- tibble(
Level = c("Season", "Game", "Drive", "Play"),
Key_Field = c("season", "game_id", "drive", "play_id"),
Count_2023 = c(
pbp %>% filter(season == 2023) %>% pull(season) %>% n_distinct(),
pbp %>% filter(season == 2023) %>% pull(game_id) %>% n_distinct(),
pbp %>% filter(season == 2023) %>% pull(drive) %>% sum(),
pbp %>% filter(season == 2023) %>% nrow()
),
Example_Fields = c(
"season_type, week",
"home_team, away_team, game_date",
"drive_start_yard_line, drive_end_yard_line",
"down, ydstogo, play_type, epa"
)
)
data_structure %>%
gt() %>%
tab_header(
title = "Play-by-Play Data Structure",
subtitle = "Hierarchical organization of NFL data"
) %>%
fmt_number(
columns = Count_2023,
decimals = 0,
use_seps = TRUE
)
#| label: data-relationships-py
#| message: false
#| warning: false
# Examine data relationships
data_structure = pd.DataFrame({
'Level': ['Season', 'Game', 'Drive', 'Play'],
'Key_Field': ['season', 'game_id', 'drive', 'play_id'],
'Count_2023': [
pbp_2023_py['season'].nunique(),
pbp_2023_py['game_id'].nunique(),
pbp_2023_py['drive'].sum() if 'drive' in pbp_2023_py.columns else 0,
len(pbp_2023_py)
],
'Example_Fields': [
'season_type, week',
'home_team, away_team, game_date',
'drive_start_yard_line, drive_end_yard_line',
'down, ydstogo, play_type, epa'
]
})
print("\nPlay-by-Play Data Structure:")
print(data_structure.to_string(index=False))
Missing Data Patterns
Understanding which fields may have missing data is important:
#| label: missing-data-r
#| message: false
#| warning: false
# Analyze missing data patterns
missing_analysis <- pbp %>%
filter(season == 2023) %>%
summarise(
total_plays = n(),
missing_epa = sum(is.na(epa)),
missing_wpa = sum(is.na(wpa)),
missing_cpoe = sum(is.na(cpoe)),
missing_air_yards = sum(is.na(air_yards)),
missing_passer = sum(is.na(passer_player_id))
) %>%
pivot_longer(
cols = starts_with("missing_"),
names_to = "field",
values_to = "missing_count"
) %>%
mutate(
field = str_remove(field, "missing_"),
pct_missing = missing_count / total_plays * 100
)
missing_analysis %>%
gt() %>%
tab_header(
title = "Missing Data Analysis",
subtitle = "2023 season play-by-play data"
) %>%
fmt_number(
columns = c(missing_count, total_plays),
decimals = 0,
use_seps = TRUE
) %>%
fmt_number(
columns = pct_missing,
decimals = 2
)
#| label: missing-data-py
#| message: false
#| warning: false
# Analyze missing data patterns
fields_to_check = ['epa', 'wpa', 'cpoe', 'air_yards', 'passer_player_id']
total_plays = len(pbp_2023_py)
missing_data = []
for field in fields_to_check:
if field in pbp_2023_py.columns:
missing_count = pbp_2023_py[field].isna().sum()
missing_data.append({
'field': field,
'missing_count': missing_count,
'pct_missing': (missing_count / total_plays) * 100
})
missing_analysis = pd.DataFrame(missing_data)
print("\nMissing Data Analysis (2023 season):")
print(f"Total plays: {total_plays:,}")
print(missing_analysis.to_string(index=False))
Common Missing Data Patterns
- **EPA/WPA**: Missing for special teams plays (kickoffs, punts, XP) - **CPOE**: Only available for pass attempts - **Air Yards**: Only available for pass attempts - **Player IDs**: May be missing for some older seasons or certain play types - **Success**: Defined differently for different play types Always filter `!is.na(epa)` for offensive analyses to avoid issues.Best Practices
1. Data Loading and Caching
#| label: best-practice-loading-r
#| eval: false
#| echo: true
# DO: Cache loaded data
pbp <- load_pbp(2020:2023)
saveRDS(pbp, "data/pbp_2020_2023.rds")
# Later, load from cache
pbp <- readRDS("data/pbp_2020_2023.rds")
# DON'T: Repeatedly download the same data
for (season in 2020:2023) {
pbp <- load_pbp(season) # Inefficient!
}
#| label: best-practice-loading-py
#| eval: false
#| echo: true
# DO: Cache loaded data
pbp = nfl.import_pbp_data(list(range(2020, 2024)))
pbp.to_parquet('data/pbp_2020_2023.parquet')
# Later, load from cache
pbp = pd.read_parquet('data/pbp_2020_2023.parquet')
# DON'T: Repeatedly download the same data
for season in range(2020, 2024):
pbp = nfl.import_pbp_data([season]) # Inefficient!
2. Filtering Best Practices
#| label: best-practice-filtering-r
#| eval: false
#| echo: true
# DO: Filter comprehensively
analysis_data <- pbp %>%
filter(
!is.na(epa), # Remove plays without EPA
play_type %in% c("pass", "run"), # Only standard plays
!is.na(posteam), # Remove missing team
season_type == "REG" # Only regular season
)
# DON'T: Incomplete filtering
bad_data <- pbp %>%
filter(play_type == "pass") # Missing other important filters!
#| label: best-practice-filtering-py
#| eval: false
#| echo: true
# DO: Filter comprehensively
analysis_data = (pbp
.query("epa.notna()") # Remove plays without EPA
.query("play_type.isin(['pass', 'run'])") # Only standard plays
.query("posteam.notna()") # Remove missing team
.query("season_type == 'REG'") # Only regular season
)
# DON'T: Incomplete filtering
bad_data = pbp.query("play_type == 'pass'") # Missing other important filters!
3. Grouping and Aggregation
#| label: best-practice-grouping-r
#| eval: false
#| echo: true
# DO: Use meaningful aggregations
team_stats <- pbp %>%
filter(!is.na(epa), play_type %in% c("pass", "run")) %>%
group_by(posteam, season) %>%
summarise(
plays = n(),
epa_per_play = mean(epa),
success_rate = mean(success),
explosive_rate = mean(epa > 1.0),
.groups = "drop"
) %>%
filter(plays >= 100) # Minimum sample size
# DON'T: Aggregate without sample size filters
bad_stats <- pbp %>%
group_by(posteam) %>%
summarise(avg_epa = mean(epa, na.rm = TRUE)) # Includes all plays!
#| label: best-practice-grouping-py
#| eval: false
#| echo: true
# DO: Use meaningful aggregations
team_stats = (pbp
.query("epa.notna() & play_type.isin(['pass', 'run'])")
.groupby(['posteam', 'season'])
.agg(
plays=('epa', 'count'),
epa_per_play=('epa', 'mean'),
success_rate=('success', 'mean'),
explosive_rate=('epa', lambda x: (x > 1.0).mean())
)
.reset_index()
.query("plays >= 100") # Minimum sample size
)
# DON'T: Aggregate without sample size filters
bad_stats = (pbp
.groupby('posteam')
.agg(avg_epa=('epa', 'mean')) # Includes all plays!
)
4. Player Identification
Player ID Best Practices
Always use player IDs (like `gsis_id`) rather than names when joining data: - Names may have spelling variations - Multiple players may share the same name - Names may change (Jr., II, III variations) - IDs are guaranteed unique# DO:
left_join(pbp, rosters, by = c("passer_player_id" = "gsis_id"))
# DON'T:
left_join(pbp, rosters, by = c("passer_player_name" = "full_name"))
```</div>
## Common Pitfalls
### 1. Not Filtering for Regular Season
```r
# WRONG: Includes preseason and playoffs
team_stats <- pbp %>%
group_by(posteam, season) %>%
summarise(avg_epa = mean(epa, na.rm = TRUE))
# CORRECT: Only regular season
team_stats <- pbp %>%
filter(season_type == "REG") %>%
group_by(posteam, season) %>%
summarise(avg_epa = mean(epa, na.rm = TRUE))
# WRONG: Includes all plays
avg_epa <- mean(pbp$epa, na.rm = TRUE)
# CORRECT: Only offensive plays
avg_epa <- pbp %>%
filter(play_type %in% c("pass", "run"), !is.na(epa)) %>%
pull(epa) %>%
mean()
# WRONG: No minimum sample size
qb_stats <- pbp %>%
group_by(passer_player_name) %>%
summarise(avg_epa = mean(epa, na.rm = TRUE))
# CORRECT: Minimum attempts threshold
qb_stats <- pbp %>%
filter(play_type == "pass", !is.na(epa)) %>%
group_by(passer_player_name) %>%
summarise(
attempts = n(),
avg_epa = mean(epa)
) %>%
filter(attempts >= 100) # Meaningful sample
Roster and Play-by-Play Alignment
Be careful when joining roster data with play-by-play data from different time periods:# WRONG: Using 2023 rosters with 2021 data
pbp_2021 %>%
left_join(load_rosters(2023), by = c("passer_player_id" = "gsis_id"))
# CORRECT: Match seasons
pbp_2021 %>%
left_join(load_rosters(2021), by = c("passer_player_id" = "gsis_id"))
```</div>
## Performance Optimization Tips
### 1. Use nflreadr Instead of nflfastR for Data Loading
```r
# Slower
pbp <- nflfastR::load_pbp(2023)
# Faster
pbp <- nflreadr::load_pbp(2023)
# Slower: Load all 370+ columns
pbp <- load_pbp(2023)
# Faster: Select specific columns
pbp <- load_pbp(2023) %>%
select(game_id, posteam, defteam, down, ydstogo,
play_type, yards_gained, epa, wpa)
# DO: Cache expensive computations
if (!file.exists("cache/team_epa.rds")) {
team_epa <- pbp %>%
group_by(posteam, season) %>%
summarise(epa_per_play = mean(epa))
saveRDS(team_epa, "cache/team_epa.rds")
} else {
team_epa <- readRDS("cache/team_epa.rds")
}