Chapter 3: nflverse and nflfastR | Football Analytics Textbook

Learning ObjectivesBy the end of this chapter, you will be able to:

Master the core nflverse packages (nflfastR, nflreadr, nflplotR)
Understand the structure and key variables in play-by-play data
Work effectively with roster, team, and player data
Create professional NFL visualizations with team logos and colors
Integrate multiple nflverse data sources for comprehensive analysis
Apply best practices and avoid common pitfalls in NFL data analysis

Introduction

The nflverse is a comprehensive ecosystem of R packages and Python tools designed specifically for NFL data analysis. Developed and maintained by a community of football analytics enthusiasts, the nflverse has become the de facto standard for accessing, analyzing, and visualizing NFL data.

What makes the nflverse special is its:

Comprehensive Coverage: Play-by-play data, rosters, schedules, betting lines, and more
Clean, Consistent Data: Pre-processed and standardized across seasons
Active Development: Regular updates and new features
Strong Community: Extensive documentation and support
Cross-Language Support: Both R (nflverse) and Python (nfl_data_py) implementations

The nflverse Philosophy

The nflverse is built on the principle of making NFL data analysis accessible to everyone. All packages are open-source, free to use, and designed to work seamlessly together. The data is sourced from publicly available NFL feeds and enhanced with advanced metrics calculated by the community.

In this chapter, we'll explore the major components of the nflverse ecosystem, learn how to work with its data structures, and build practical skills for real-world football analysis.

The nflverse Package Family

The nflverse consists of several specialized packages, each serving a specific purpose:

Package	Purpose	Primary Use Cases
nflfastR	Play-by-play data and models	EPA, WP, success rate calculations
nflreadr	Fast data loading	Efficient data import from nflverse repository
nflplotR	NFL-specific visualizations	Team logos, colors, charts
nfl4th	Fourth down analysis	4th down decision recommendations
nflseedR	Playoff scenarios	Playoff probability simulations
nflfastR-roster	Player information	Roster data and player identifiers

For Python users, nfl_data_py provides equivalent functionality to nflreadr with a Pythonic interface.

nflfastR: The Foundation

nflfastR is the cornerstone of the nflverse, providing access to detailed play-by-play data with advanced metrics already calculated. Let's explore its capabilities.

#| label: setup-r
#| message: false
#| warning: false

library(tidyverse)
library(nflfastR)
library(nflplotR)
library(gt)

# Load play-by-play data for multiple seasons
pbp <- load_pbp(2021:2023)

# Display data dimensions
cat("Loaded", nrow(pbp), "plays from",
    length(unique(pbp$season)), "seasons\n")
cat("Number of columns:", ncol(pbp), "\n")
cat("Seasons included:", paste(unique(pbp$season), collapse = ", "), "\n")

#| label: setup-py
#| message: false
#| warning: false

import pandas as pd
import numpy as np
import nfl_data_py as nfl
import matplotlib.pyplot as plt
import seaborn as sns

# Load play-by-play data for multiple seasons
pbp = nfl.import_pbp_data([2021, 2022, 2023])

# Display data dimensions
print(f"Loaded {len(pbp):,} plays from {pbp['season'].nunique()} seasons")
print(f"Number of columns: {len(pbp.columns)}")
print(f"Seasons included: {', '.join(map(str, sorted(pbp['season'].unique())))}")

Understanding Play-by-Play Structure

Play-by-play data contains over 370 columns of information about each play. Let's examine the key categories:

R
Python

#| label: pbp-structure-r
#| message: false

# Display key variable categories
key_vars <- list(
  Game_Info = c("game_id", "season", "week", "game_date", "season_type"),
  Teams = c("posteam", "defteam", "home_team", "away_team"),
  Situation = c("down", "ydstogo", "yardline_100", "quarter_seconds_remaining"),
  Play_Info = c("play_type", "yards_gained", "touchdown", "fumble", "interception"),
  Advanced_Metrics = c("epa", "wpa", "success", "cpoe", "air_yards"),
  Players = c("passer_player_name", "rusher_player_name", "receiver_player_name")
)

# Show sample data for each category
pbp %>%
  filter(season == 2023, week == 1, !is.na(epa)) %>%
  select(!!!unlist(key_vars)) %>%
  slice(1:3) %>%
  glimpse()

#| label: pbp-structure-py
#| message: false

# Display key variable categories
key_vars = {
    'Game_Info': ['game_id', 'season', 'week', 'game_date', 'season_type'],
    'Teams': ['posteam', 'defteam', 'home_team', 'away_team'],
    'Situation': ['down', 'ydstogo', 'yardline_100', 'quarter_seconds_remaining'],
    'Play_Info': ['play_type', 'yards_gained', 'touchdown', 'fumble', 'interception'],
    'Advanced_Metrics': ['epa', 'wpa', 'success', 'cpoe', 'air_yards'],
    'Players': ['passer_player_name', 'rusher_player_name', 'receiver_player_name']
}

# Flatten the list of columns
all_cols = [col for cols in key_vars.values() for col in cols]

# Show sample data
sample_data = (pbp
    .query("season == 2023 & week == 1 & epa.notna()")
    [all_cols]
    .head(3)
)

print("\nSample Play-by-Play Data:")
print(sample_data.to_string())

Key Variables Explained

Here are the most important variables you'll work with:

Identifiers and Context:

game_id: Unique game identifier (format: YYYY_WW_AWAY_HOME)
play_id: Unique play identifier within a game
old_game_id: ESPN game ID for cross-referencing

Game Situation:

down: Down (1-4)
ydstogo: Yards needed for first down
yardline_100: Yards from opponent's end zone (0-100)
quarter_seconds_remaining: Time left in current quarter
half_seconds_remaining: Time left in current half
game_seconds_remaining: Time left in game

Play Outcome:

play_type: Type of play (pass, run, punt, field_goal, etc.)
yards_gained: Net yards gained on the play
touchdown: Binary indicator for touchdown
first_down: Binary indicator for first down conversion

Advanced Metrics:

epa: Expected Points Added
wpa: Win Probability Added
success: Binary success indicator (EPA > 0)
cpoe: Completion Percentage Over Expected (passing plays)
air_yards: Yards ball traveled in the air (passing plays)
yards_after_catch: Yards gained after reception

Working with Play Types

Common play types include: - `pass`: Passing attempt - `run`: Running play - `punt`: Punt - `field_goal`: Field goal attempt - `kickoff`: Kickoff - `extra_point`: PAT attempt - `qb_kneel`: QB kneel - `qb_spike`: QB spike For most analyses, filter to `play_type %in% c("pass", "run")` and `!is.na(epa)` to focus on standard offensive plays.

nflreadr: Efficient Data Loading

While nflfastR's load_pbp() function is convenient, nflreadr provides faster, more flexible data loading and access to additional datasets.

Available Datasets

R
Python

#| label: nflreadr-datasets-r
#| eval: false

library(nflreadr)

# Play-by-play data (faster than load_pbp)
pbp <- load_pbp(2023)

# Roster data
rosters <- load_rosters(2023)

# Team information
teams <- load_teams()

# Schedule data
schedules <- load_schedules(2023)

# Player statistics
player_stats <- load_player_stats(2023)

# Next Gen Stats
ngs_passing <- load_nextgen_stats(2023, stat_type = "passing")
ngs_rushing <- load_nextgen_stats(2023, stat_type = "rushing")
ngs_receiving <- load_nextgen_stats(2023, stat_type = "receiving")

# Depth charts
depth_charts <- load_depth_charts(2023)

# Injuries
injuries <- load_injuries(2023)

# Combine results
combine <- load_combine()

# Draft picks
draft_picks <- load_draft_picks()

# PFF data (if available)
# snap_counts <- load_snap_counts(2023)

#| label: nflreadr-datasets-py
#| eval: false

import nfl_data_py as nfl

# Play-by-play data
pbp = nfl.import_pbp_data([2023])

# Roster data
rosters = nfl.import_rosters([2023])

# Team information
teams = nfl.import_team_desc()

# Schedule data
schedules = nfl.import_schedules([2023])

# Weekly player statistics
weekly_data = nfl.import_weekly_data([2023])

# Seasonal player statistics
seasonal_data = nfl.import_seasonal_data([2023])

# Next Gen Stats
ngs_passing = nfl.import_ngs_data('passing', [2023])
ngs_rushing = nfl.import_ngs_data('rushing', [2023])
ngs_receiving = nfl.import_ngs_data('receiving', [2023])

# Depth charts
depth_charts = nfl.import_depth_charts([2023])

# Injuries
injuries = nfl.import_injuries([2023])

# Combine results
combine = nfl.import_combine_data([2020, 2021, 2022, 2023])

# Draft picks
draft_picks = nfl.import_draft_picks([2020, 2021, 2022, 2023])

# Snap counts
snap_counts = nfl.import_snap_counts([2023])

Working with Roster Data

Roster data connects player IDs across different systems and provides biographical information:

R
Python

#| label: roster-data-r
#| message: false
#| warning: false
#| cache: true

library(nflreadr)

# Load roster data
rosters_2023 <- load_rosters(2023)

# Display key roster information
rosters_2023 %>%
  filter(team == "KC") %>%
  select(full_name, position, jersey_number, height, weight,
         college, years_exp, gsis_id) %>%
  arrange(position, jersey_number) %>%
  head(10) %>%
  gt() %>%
  tab_header(
    title = "Kansas City Chiefs Roster Sample",
    subtitle = "2023 Season"
  ) %>%
  cols_label(
    full_name = "Player",
    position = "Pos",
    jersey_number = "No.",
    height = "Height",
    weight = "Weight",
    college = "College",
    years_exp = "Exp",
    gsis_id = "GSIS ID"
  )

#| label: roster-data-py
#| message: false
#| warning: false
#| cache: true

# Load roster data
rosters_2023 = nfl.import_rosters([2023])

# Display key roster information
kc_roster = (rosters_2023
    .query("team == 'KC'")
    [['full_name', 'position', 'jersey_number', 'height', 'weight',
      'college', 'years_exp', 'gsis_id']]
    .sort_values(['position', 'jersey_number'])
    .head(10)
)

print("\nKansas City Chiefs Roster Sample (2023):")
print(kc_roster.to_string(index=False))

Team Information and Colors

The teams dataset provides official team colors, logos, and organizational information:

R
Python

#| label: team-info-r
#| message: false
#| warning: false

# Load team information
teams <- load_teams()

# Display team colors and identifiers
teams %>%
  select(team_abbr, team_name, team_color, team_color2,
         team_logo_espn, team_conference, team_division) %>%
  filter(team_conference == "AFC", team_division == "West") %>%
  gt() %>%
  tab_header(
    title = "AFC West Team Information"
  ) %>%
  cols_label(
    team_abbr = "Team",
    team_name = "Name",
    team_color = "Primary Color",
    team_color2 = "Secondary Color",
    team_logo_espn = "Logo URL",
    team_conference = "Conf",
    team_division = "Div"
  )

#| label: team-info-py
#| message: false
#| warning: false

# Load team information
teams = nfl.import_team_desc()

# Display team colors and identifiers
afc_west = (teams
    .query("team_conf == 'AFC' & team_division == 'AFC West'")
    [['team_abbr', 'team_name', 'team_color', 'team_color2',
      'team_conf', 'team_division']]
)

print("\nAFC West Team Information:")
print(afc_west.to_string(index=False))

nflplotR: Professional NFL Visualizations

nflplotR is a powerful package that makes it easy to create publication-quality NFL visualizations with team logos, colors, and consistent styling.

Basic Logo Plots

R
Python

#| label: fig-logo-plot-r
#| fig-cap: "Average EPA per play by team (2023 season)"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true

# Calculate team offensive EPA
team_epa <- pbp %>%
  filter(season == 2023, !is.na(epa), !is.na(posteam)) %>%
  filter(play_type %in% c("pass", "run")) %>%
  group_by(posteam) %>%
  summarise(
    plays = n(),
    epa_per_play = mean(epa),
    success_rate = mean(success),
    .groups = "drop"
  ) %>%
  arrange(desc(epa_per_play))

# Create logo plot
team_epa %>%
  ggplot(aes(x = reorder(posteam, epa_per_play), y = epa_per_play)) +
  geom_col(aes(fill = posteam, color = posteam), alpha = 0.7, width = 0.7) +
  scale_fill_nfl(type = "primary") +
  scale_color_nfl(type = "secondary") +
  nflplotR::geom_nfl_logos(aes(team_abbr = posteam), width = 0.04, alpha = 0.9) +
  coord_flip() +
  labs(
    title = "Offensive Efficiency by Team",
    subtitle = "Average EPA per play, 2023 Regular Season",
    x = NULL,
    y = "EPA per Play",
    caption = "Data: nflfastR | Includes pass and run plays only"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    axis.text.y = element_blank(),
    legend.position = "none",
    panel.grid.major.y = element_blank()
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-logo-plot-py
#| fig-cap: "Average EPA per play by team - Python (2023 season)"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true

# Calculate team offensive EPA
team_epa = (pbp
    .query("season == 2023 & epa.notna() & posteam.notna()")
    .query("play_type.isin(['pass', 'run'])")
    .groupby('posteam')
    .agg(
        plays=('epa', 'count'),
        epa_per_play=('epa', 'mean'),
        success_rate=('success', 'mean')
    )
    .reset_index()
    .sort_values('epa_per_play', ascending=True)
)

# Create bar plot
fig, ax = plt.subplots(figsize=(12, 8))

# Get team colors
team_colors = dict(zip(teams['team_abbr'], teams['team_color']))
colors = [team_colors.get(team, '#333333') for team in team_epa['posteam']]

bars = ax.barh(range(len(team_epa)), team_epa['epa_per_play'], color=colors, alpha=0.7)

# Customize plot
ax.set_yticks(range(len(team_epa)))
ax.set_yticklabels(team_epa['posteam'])
ax.set_xlabel('EPA per Play', fontsize=12)
ax.set_title('Offensive Efficiency by Team\nAverage EPA per play, 2023 Regular Season',
             fontsize=16, fontweight='bold', pad=20)
ax.text(0.99, 0.01, 'Data: nfl_data_py | Includes pass and run plays only',
        transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5, alpha=0.3)
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

Scatter Plots with Logos

One of nflplotR's most powerful features is the ability to replace points with team logos:

R
Python

#| label: fig-scatter-logos-r
#| fig-cap: "Team offensive vs defensive EPA (2023 season)"
#| fig-width: 10
#| fig-height: 10
#| message: false
#| warning: false
#| cache: true

# Calculate offensive and defensive EPA
team_off_def <- pbp %>%
  filter(season == 2023, !is.na(epa)) %>%
  filter(play_type %in% c("pass", "run")) %>%
  group_by(posteam) %>%
  summarise(off_epa = mean(epa), .groups = "drop") %>%
  left_join(
    pbp %>%
      filter(season == 2023, !is.na(epa)) %>%
      filter(play_type %in% c("pass", "run")) %>%
      group_by(defteam) %>%
      summarise(def_epa = mean(epa), .groups = "drop"),
    by = c("posteam" = "defteam")
  )

# Create scatter plot with logos
team_off_def %>%
  ggplot(aes(x = off_epa, y = -def_epa)) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_nfl_logos(aes(team_abbr = posteam), width = 0.06, alpha = 0.8) +
  labs(
    title = "Team Performance Matrix: Offense vs Defense",
    subtitle = "EPA per play, 2023 Regular Season | Top right = Best teams",
    x = "Offensive EPA per Play",
    y = "Defensive EPA per Play (reversed, higher = better)",
    caption = "Data: nflfastR"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank()
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-scatter-logos-py
#| fig-cap: "Team offensive vs defensive EPA - Python (2023 season)"
#| fig-width: 10
#| fig-height: 10
#| message: false
#| warning: false
#| cache: true

# Calculate offensive EPA
off_epa = (pbp
    .query("season == 2023 & epa.notna() & play_type.isin(['pass', 'run'])")
    .groupby('posteam')
    .agg(off_epa=('epa', 'mean'))
    .reset_index()
)

# Calculate defensive EPA
def_epa = (pbp
    .query("season == 2023 & epa.notna() & play_type.isin(['pass', 'run'])")
    .groupby('defteam')
    .agg(def_epa=('epa', 'mean'))
    .reset_index()
)

# Merge
team_off_def = off_epa.merge(def_epa, left_on='posteam', right_on='defteam')

# Create scatter plot
fig, ax = plt.subplots(figsize=(10, 10))

# Plot points with team colors
for _, row in team_off_def.iterrows():
    team = row['posteam']
    color = team_colors.get(team, '#333333')
    ax.scatter(row['off_epa'], -row['def_epa'], s=300, color=color, alpha=0.6)
    ax.text(row['off_epa'], -row['def_epa'], team,
            ha='center', va='center', fontsize=8, fontweight='bold')

# Add reference lines
ax.axvline(x=0, linestyle='--', alpha=0.5, color='gray')
ax.axhline(y=0, linestyle='--', alpha=0.5, color='gray')

# Customize plot
ax.set_xlabel('Offensive EPA per Play', fontsize=12)
ax.set_ylabel('Defensive EPA per Play (reversed, higher = better)', fontsize=12)
ax.set_title('Team Performance Matrix: Offense vs Defense\nEPA per play, 2023 Regular Season | Top right = Best teams',
             fontsize=14, fontweight='bold', pad=20)
ax.text(0.99, 0.01, 'Data: nfl_data_py',
        transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

Time Series with Team Colors

R
Python

#| label: fig-time-series-r
#| fig-cap: "Weekly EPA trends for AFC West teams (2023)"
#| fig-width: 12
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true

# Calculate weekly EPA for AFC West teams
afc_west_teams <- c("KC", "LV", "LAC", "DEN")

weekly_epa <- pbp %>%
  filter(season == 2023, season_type == "REG",
         posteam %in% afc_west_teams,
         !is.na(epa), play_type %in% c("pass", "run")) %>%
  group_by(posteam, week) %>%
  summarise(epa_per_play = mean(epa), .groups = "drop")

# Create line plot
weekly_epa %>%
  ggplot(aes(x = week, y = epa_per_play, color = posteam)) +
  geom_line(size = 1.5, alpha = 0.8) +
  geom_point(size = 3) +
  scale_color_nfl(type = "primary") +
  scale_x_continuous(breaks = 1:18) +
  labs(
    title = "AFC West Weekly Offensive Performance",
    subtitle = "EPA per play by week, 2023 Regular Season",
    x = "Week",
    y = "EPA per Play",
    color = "Team",
    caption = "Data: nflfastR"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top",
    panel.grid.minor.x = element_blank()
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-time-series-py
#| fig-cap: "Weekly EPA trends for AFC West teams - Python (2023)"
#| fig-width: 12
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true

# Calculate weekly EPA for AFC West teams
afc_west_teams = ['KC', 'LV', 'LAC', 'DEN']

weekly_epa = (pbp
    .query("season == 2023 & season_type == 'REG' & posteam in @afc_west_teams")
    .query("epa.notna() & play_type.isin(['pass', 'run'])")
    .groupby(['posteam', 'week'])
    .agg(epa_per_play=('epa', 'mean'))
    .reset_index()
)

# Create line plot
fig, ax = plt.subplots(figsize=(12, 6))

for team in afc_west_teams:
    team_data = weekly_epa[weekly_epa['posteam'] == team]
    color = team_colors.get(team, '#333333')
    ax.plot(team_data['week'], team_data['epa_per_play'],
            marker='o', linewidth=2.5, label=team, color=color, alpha=0.8)

ax.set_xlabel('Week', fontsize=12)
ax.set_ylabel('EPA per Play', fontsize=12)
ax.set_title('AFC West Weekly Offensive Performance\nEPA per play by week, 2023 Regular Season',
             fontsize=14, fontweight='bold', pad=20)
ax.legend(title='Team', loc='best', frameon=True)
ax.grid(alpha=0.3)
ax.set_xticks(range(1, 19))
ax.text(0.99, 0.01, 'Data: nfl_data_py',
        transform=ax.transAxes, ha='right', va='bottom', fontsize=8, style='italic')

plt.tight_layout()
plt.show()

nfl4th: Fourth Down Decision Analysis

The nfl4th package provides recommendations for 4th down decisions based on win probability models.

R
Python

#| label: nfl4th-example-r
#| eval: false
#| echo: true

library(nfl4th)

# Get 4th down recommendation
# Example: 4th and 3 at opponent's 35, down by 3, 5 minutes left
recommendation <- add_4th_probs(
  df = tibble(
    # Game situation
    game_seconds_remaining = 300,
    half_seconds_remaining = 300,

    # Field position and down
    yardline_100 = 35,
    ydstogo = 3,

    # Score
    score_differential = -3,

    # Other context
    home = 1,
    posteam_timeouts_remaining = 3,
    defteam_timeouts_remaining = 3
  )
)

# View recommendation
recommendation %>%
  select(yardline_100, ydstogo, go_boost, first_down_prob,
         wp_go, wp_fg, wp_punt, choice) %>%
  glimpse()

#| label: nfl4th-example-py
#| eval: false
#| echo: true

# Python doesn't have a direct nfl4th equivalent
# But you can calculate win probability using EPA models
# Here's a conceptual example of how you might approach this

def calculate_4th_down_wp(
    yardline_100, ydstogo, score_diff,
    seconds_remaining, timeouts_remaining
):
    """
    Simplified 4th down win probability calculator
    In practice, you would use a trained model
    """
    # This is a placeholder for demonstration
    # Real implementation would use historical data
    # to build a win probability model

    # Factors to consider:
    # - Distance to go
    # - Field position
    # - Score differential
    # - Time remaining
    # - Historical conversion rates

    pass  # Implementation would go here

# Example usage
situation = {
    'yardline_100': 35,
    'ydstogo': 3,
    'score_diff': -3,
    'seconds_remaining': 300,
    'timeouts': 3
}

nflseedR: Playoff Simulation

nflseedR simulates playoff scenarios and calculates playoff probabilities.

R
Python

#| label: nflseedr-example-r
#| eval: false
#| echo: true

library(nflseedR)

# Simulate playoff scenarios
# This runs thousands of simulations of remaining games
sims <- nflseedR::simulate_nfl(
  nfl_season = 2023,
  fresh_season = FALSE,  # Use current standings
  simulations = 10000
)

# Calculate playoff probabilities
playoff_probs <- sims %>%
  group_by(team) %>%
  summarise(
    playoff_prob = mean(playoff),
    division_prob = mean(div_champ),
    conf_prob = mean(conf_champ),
    sb_prob = mean(sb_champ),
    avg_wins = mean(wins)
  ) %>%
  arrange(desc(playoff_prob))

# View top teams
playoff_probs %>%
  head(10) %>%
  gt() %>%
  fmt_percent(columns = ends_with("_prob"), decimals = 1) %>%
  fmt_number(columns = avg_wins, decimals = 1)

#| label: nflseedr-example-py
#| eval: false
#| echo: true

# Python implementation of playoff simulation
# This is a simplified conceptual example

import numpy as np

def simulate_remaining_games(schedule, team_strengths, n_sims=10000):
    """
    Simulate remaining NFL games to calculate playoff probabilities
    """
    results = []

    for sim in range(n_sims):
        # Simulate each remaining game
        season_results = schedule.copy()

        for idx, game in season_results.iterrows():
            home_team = game['home_team']
            away_team = game['away_team']

            # Get team strengths (could use EPA, ELO, etc.)
            home_strength = team_strengths.get(home_team, 0)
            away_strength = team_strengths.get(away_team, 0)

            # Calculate win probability (simplified)
            home_wp = 1 / (1 + np.exp(-(home_strength - away_strength + 0.2)))

            # Simulate outcome
            home_wins = np.random.random() < home_wp

            # Store result
            # ... continue simulation logic

    return results

# This would be expanded into full playoff simulation

Combining Multiple Data Sources

The real power of the nflverse comes from combining multiple datasets. Let's build a comprehensive player analysis:

R
Python

#| label: combine-data-r
#| message: false
#| warning: false
#| cache: true

# Load multiple data sources
rosters <- load_rosters(2023)
pbp_2023 <- load_pbp(2023)

# Analyze QB performance with roster information
qb_analysis <- pbp_2023 %>%
  filter(!is.na(passer_player_id), play_type == "pass") %>%
  group_by(passer_player_id, passer_player_name) %>%
  summarise(
    team = first(posteam),
    dropbacks = n(),
    epa_per_dropback = mean(epa, na.rm = TRUE),
    cpoe = mean(cpoe, na.rm = TRUE),
    success_rate = mean(success, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(dropbacks >= 100) %>%
  # Join with roster data
  left_join(
    rosters %>% select(gsis_id, college, years_exp, height, weight),
    by = c("passer_player_id" = "gsis_id")
  ) %>%
  arrange(desc(epa_per_dropback))

# Display top QBs
qb_analysis %>%
  select(passer_player_name, team, dropbacks, epa_per_dropback,
         cpoe, success_rate, college, years_exp) %>%
  head(10) %>%
  gt() %>%
  tab_header(
    title = "Top Quarterback Performance",
    subtitle = "Minimum 100 dropbacks, 2023 season"
  ) %>%
  cols_label(
    passer_player_name = "Player",
    team = "Team",
    dropbacks = "Dropbacks",
    epa_per_dropback = "EPA/Play",
    cpoe = "CPOE",
    success_rate = "Success %",
    college = "College",
    years_exp = "Exp"
  ) %>%
  fmt_number(
    columns = c(epa_per_dropback, cpoe, success_rate),
    decimals = 3
  ) %>%
  fmt_number(
    columns = dropbacks,
    decimals = 0
  )

#| label: combine-data-py
#| message: false
#| warning: false
#| cache: true

# Load multiple data sources
rosters = nfl.import_rosters([2023])
pbp_2023_py = nfl.import_pbp_data([2023])

# Analyze QB performance with roster information
qb_analysis = (pbp_2023_py
    .query("passer_player_id.notna() & play_type == 'pass'")
    .groupby(['passer_player_id', 'passer_player_name'])
    .agg(
        team=('posteam', 'first'),
        dropbacks=('epa', 'count'),
        epa_per_dropback=('epa', lambda x: x.mean()),
        cpoe=('cpoe', lambda x: x.mean()),
        success_rate=('success', lambda x: x.mean())
    )
    .reset_index()
    .query("dropbacks >= 100")
)

# Join with roster data
qb_analysis = qb_analysis.merge(
    rosters[['gsis_id', 'college', 'years_exp', 'height', 'weight']],
    left_on='passer_player_id',
    right_on='gsis_id',
    how='left'
)

# Sort and display
qb_analysis = qb_analysis.sort_values('epa_per_dropback', ascending=False)

print("\nTop Quarterback Performance (Min 100 dropbacks, 2023):")
print(qb_analysis[['passer_player_name', 'team', 'dropbacks', 'epa_per_dropback',
                    'cpoe', 'success_rate', 'college', 'years_exp']].head(10).to_string(index=False))

Advanced Example: Team Composition Analysis

Let's analyze team roster composition and its relationship to performance:

R
Python

#| label: roster-composition-r
#| message: false
#| warning: false
#| cache: true

# Analyze roster composition by team
roster_composition <- rosters %>%
  filter(status == "ACT") %>%
  group_by(team, position) %>%
  summarise(
    count = n(),
    avg_exp = mean(years_exp, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = position,
    values_from = count,
    values_fill = 0
  )

# Calculate team performance
team_performance <- pbp %>%
  filter(season == 2023, !is.na(epa), play_type %in% c("pass", "run")) %>%
  group_by(posteam) %>%
  summarise(
    epa_per_play = mean(epa),
    .groups = "drop"
  )

# Combine and analyze
roster_performance <- roster_composition %>%
  left_join(team_performance, by = c("team" = "posteam"))

# Show correlation between roster composition and performance
print("Roster composition analysis complete")

#| label: roster-composition-py
#| message: false
#| warning: false
#| cache: true

# Analyze roster composition by team
roster_composition = (rosters
    .query("status == 'ACT'")
    .groupby(['team', 'position'])
    .agg(
        count=('gsis_id', 'count'),
        avg_exp=('years_exp', 'mean')
    )
    .reset_index()
    .pivot_table(
        index='team',
        columns='position',
        values='count',
        fill_value=0
    )
    .reset_index()
)

# Calculate team performance
team_performance = (pbp_2023_py
    .query("epa.notna() & play_type.isin(['pass', 'run'])")
    .groupby('posteam')
    .agg(epa_per_play=('epa', 'mean'))
    .reset_index()
)

# Combine and analyze
roster_performance = roster_composition.merge(
    team_performance,
    left_on='team',
    right_on='posteam',
    how='left'
)

print("\nRoster composition analysis complete")
print(f"Teams analyzed: {len(roster_performance)}")

Data Structure Deep Dive

Understanding the data structure is crucial for effective analysis. Let's examine the most important aspects:

Play-by-Play Data Schema

The play-by-play data follows a hierarchical structure:

Season
  └── Game
      └── Drive
          └── Play
              ├── Pre-snap situation
              ├── Play execution
              ├── Post-play situation
              └── Calculated metrics

Key Data Relationships

R
Python

#| label: data-relationships-r
#| message: false
#| warning: false

# Examine data relationships
data_structure <- tibble(
  Level = c("Season", "Game", "Drive", "Play"),
  Key_Field = c("season", "game_id", "drive", "play_id"),
  Count_2023 = c(
    pbp %>% filter(season == 2023) %>% pull(season) %>% n_distinct(),
    pbp %>% filter(season == 2023) %>% pull(game_id) %>% n_distinct(),
    pbp %>% filter(season == 2023) %>% pull(drive) %>% sum(),
    pbp %>% filter(season == 2023) %>% nrow()
  ),
  Example_Fields = c(
    "season_type, week",
    "home_team, away_team, game_date",
    "drive_start_yard_line, drive_end_yard_line",
    "down, ydstogo, play_type, epa"
  )
)

data_structure %>%
  gt() %>%
  tab_header(
    title = "Play-by-Play Data Structure",
    subtitle = "Hierarchical organization of NFL data"
  ) %>%
  fmt_number(
    columns = Count_2023,
    decimals = 0,
    use_seps = TRUE
  )

#| label: data-relationships-py
#| message: false
#| warning: false

# Examine data relationships
data_structure = pd.DataFrame({
    'Level': ['Season', 'Game', 'Drive', 'Play'],
    'Key_Field': ['season', 'game_id', 'drive', 'play_id'],
    'Count_2023': [
        pbp_2023_py['season'].nunique(),
        pbp_2023_py['game_id'].nunique(),
        pbp_2023_py['drive'].sum() if 'drive' in pbp_2023_py.columns else 0,
        len(pbp_2023_py)
    ],
    'Example_Fields': [
        'season_type, week',
        'home_team, away_team, game_date',
        'drive_start_yard_line, drive_end_yard_line',
        'down, ydstogo, play_type, epa'
    ]
})

print("\nPlay-by-Play Data Structure:")
print(data_structure.to_string(index=False))

Missing Data Patterns

Understanding which fields may have missing data is important:

R
Python

#| label: missing-data-r
#| message: false
#| warning: false

# Analyze missing data patterns
missing_analysis <- pbp %>%
  filter(season == 2023) %>%
  summarise(
    total_plays = n(),
    missing_epa = sum(is.na(epa)),
    missing_wpa = sum(is.na(wpa)),
    missing_cpoe = sum(is.na(cpoe)),
    missing_air_yards = sum(is.na(air_yards)),
    missing_passer = sum(is.na(passer_player_id))
  ) %>%
  pivot_longer(
    cols = starts_with("missing_"),
    names_to = "field",
    values_to = "missing_count"
  ) %>%
  mutate(
    field = str_remove(field, "missing_"),
    pct_missing = missing_count / total_plays * 100
  )

missing_analysis %>%
  gt() %>%
  tab_header(
    title = "Missing Data Analysis",
    subtitle = "2023 season play-by-play data"
  ) %>%
  fmt_number(
    columns = c(missing_count, total_plays),
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_number(
    columns = pct_missing,
    decimals = 2
  )

#| label: missing-data-py
#| message: false
#| warning: false

# Analyze missing data patterns
fields_to_check = ['epa', 'wpa', 'cpoe', 'air_yards', 'passer_player_id']
total_plays = len(pbp_2023_py)

missing_data = []
for field in fields_to_check:
    if field in pbp_2023_py.columns:
        missing_count = pbp_2023_py[field].isna().sum()
        missing_data.append({
            'field': field,
            'missing_count': missing_count,
            'pct_missing': (missing_count / total_plays) * 100
        })

missing_analysis = pd.DataFrame(missing_data)

print("\nMissing Data Analysis (2023 season):")
print(f"Total plays: {total_plays:,}")
print(missing_analysis.to_string(index=False))

Common Missing Data Patterns

- **EPA/WPA**: Missing for special teams plays (kickoffs, punts, XP) - **CPOE**: Only available for pass attempts - **Air Yards**: Only available for pass attempts - **Player IDs**: May be missing for some older seasons or certain play types - **Success**: Defined differently for different play types Always filter `!is.na(epa)` for offensive analyses to avoid issues.

Best Practices

1. Data Loading and Caching

R
Python

#| label: best-practice-loading-r
#| eval: false
#| echo: true

# DO: Cache loaded data
pbp <- load_pbp(2020:2023)
saveRDS(pbp, "data/pbp_2020_2023.rds")

# Later, load from cache
pbp <- readRDS("data/pbp_2020_2023.rds")

# DON'T: Repeatedly download the same data
for (season in 2020:2023) {
  pbp <- load_pbp(season)  # Inefficient!
}

#| label: best-practice-loading-py
#| eval: false
#| echo: true

# DO: Cache loaded data
pbp = nfl.import_pbp_data(list(range(2020, 2024)))
pbp.to_parquet('data/pbp_2020_2023.parquet')

# Later, load from cache
pbp = pd.read_parquet('data/pbp_2020_2023.parquet')

# DON'T: Repeatedly download the same data
for season in range(2020, 2024):
    pbp = nfl.import_pbp_data([season])  # Inefficient!

2. Filtering Best Practices

R
Python

#| label: best-practice-filtering-r
#| eval: false
#| echo: true

# DO: Filter comprehensively
analysis_data <- pbp %>%
  filter(
    !is.na(epa),                          # Remove plays without EPA
    play_type %in% c("pass", "run"),      # Only standard plays
    !is.na(posteam),                      # Remove missing team
    season_type == "REG"                  # Only regular season
  )

# DON'T: Incomplete filtering
bad_data <- pbp %>%
  filter(play_type == "pass")  # Missing other important filters!

#| label: best-practice-filtering-py
#| eval: false
#| echo: true

# DO: Filter comprehensively
analysis_data = (pbp
    .query("epa.notna()")                      # Remove plays without EPA
    .query("play_type.isin(['pass', 'run'])")  # Only standard plays
    .query("posteam.notna()")                  # Remove missing team
    .query("season_type == 'REG'")             # Only regular season
)

# DON'T: Incomplete filtering
bad_data = pbp.query("play_type == 'pass'")  # Missing other important filters!

3. Grouping and Aggregation

R
Python

#| label: best-practice-grouping-r
#| eval: false
#| echo: true

# DO: Use meaningful aggregations
team_stats <- pbp %>%
  filter(!is.na(epa), play_type %in% c("pass", "run")) %>%
  group_by(posteam, season) %>%
  summarise(
    plays = n(),
    epa_per_play = mean(epa),
    success_rate = mean(success),
    explosive_rate = mean(epa > 1.0),
    .groups = "drop"
  ) %>%
  filter(plays >= 100)  # Minimum sample size

# DON'T: Aggregate without sample size filters
bad_stats <- pbp %>%
  group_by(posteam) %>%
  summarise(avg_epa = mean(epa, na.rm = TRUE))  # Includes all plays!

#| label: best-practice-grouping-py
#| eval: false
#| echo: true

# DO: Use meaningful aggregations
team_stats = (pbp
    .query("epa.notna() & play_type.isin(['pass', 'run'])")
    .groupby(['posteam', 'season'])
    .agg(
        plays=('epa', 'count'),
        epa_per_play=('epa', 'mean'),
        success_rate=('success', 'mean'),
        explosive_rate=('epa', lambda x: (x > 1.0).mean())
    )
    .reset_index()
    .query("plays >= 100")  # Minimum sample size
)

# DON'T: Aggregate without sample size filters
bad_stats = (pbp
    .groupby('posteam')
    .agg(avg_epa=('epa', 'mean'))  # Includes all plays!
)

4. Player Identification

Player ID Best Practices

Always use player IDs (like `gsis_id`) rather than names when joining data: - Names may have spelling variations - Multiple players may share the same name - Names may change (Jr., II, III variations) - IDs are guaranteed unique

# DO:
left_join(pbp, rosters, by = c("passer_player_id" = "gsis_id"))

# DON'T:
left_join(pbp, rosters, by = c("passer_player_name" = "full_name"))
```</div>

## Common Pitfalls

### 1. Not Filtering for Regular Season

```r
# WRONG: Includes preseason and playoffs
team_stats <- pbp %>%
  group_by(posteam, season) %>%
  summarise(avg_epa = mean(epa, na.rm = TRUE))

# CORRECT: Only regular season
team_stats <- pbp %>%
  filter(season_type == "REG") %>%
  group_by(posteam, season) %>%
  summarise(avg_epa = mean(epa, na.rm = TRUE))

### 2. Ignoring Play Type

# WRONG: Includes all plays
avg_epa <- mean(pbp$epa, na.rm = TRUE)

# CORRECT: Only offensive plays
avg_epa <- pbp %>%
  filter(play_type %in% c("pass", "run"), !is.na(epa)) %>%
  pull(epa) %>%
  mean()

### 3. Sample Size Issues

# WRONG: No minimum sample size
qb_stats <- pbp %>%
  group_by(passer_player_name) %>%
  summarise(avg_epa = mean(epa, na.rm = TRUE))

# CORRECT: Minimum attempts threshold
qb_stats <- pbp %>%
  filter(play_type == "pass", !is.na(epa)) %>%
  group_by(passer_player_name) %>%
  summarise(
    attempts = n(),
    avg_epa = mean(epa)
  ) %>%
  filter(attempts >= 100)  # Meaningful sample

### 4. Time Period Mismatch

Roster and Play-by-Play Alignment

Be careful when joining roster data with play-by-play data from different time periods:

# WRONG: Using 2023 rosters with 2021 data
pbp_2021 %>%
  left_join(load_rosters(2023), by = c("passer_player_id" = "gsis_id"))

# CORRECT: Match seasons
pbp_2021 %>%
  left_join(load_rosters(2021), by = c("passer_player_id" = "gsis_id"))
```</div>

## Performance Optimization Tips

### 1. Use nflreadr Instead of nflfastR for Data Loading

```r
# Slower
pbp <- nflfastR::load_pbp(2023)

# Faster
pbp <- nflreadr::load_pbp(2023)

### 2. Select Only Needed Columns

# Slower: Load all 370+ columns
pbp <- load_pbp(2023)

# Faster: Select specific columns
pbp <- load_pbp(2023) %>%
  select(game_id, posteam, defteam, down, ydstogo,
         play_type, yards_gained, epa, wpa)

### 3. Cache Computed Results

# DO: Cache expensive computations
if (!file.exists("cache/team_epa.rds")) {
  team_epa <- pbp %>%
    group_by(posteam, season) %>%
    summarise(epa_per_play = mean(epa))
  saveRDS(team_epa, "cache/team_epa.rds")
} else {
  team_epa <- readRDS("cache/team_epa.rds")
}

## Summary In this chapter, we explored the comprehensive nflverse ecosystem: - **nflfastR** provides play-by-play data with advanced metrics like EPA and WPA - **nflreadr** offers fast, efficient data loading for multiple data types - **nflplotR** creates professional visualizations with team logos and colors - **nfl4th** analyzes fourth down decisions using win probability - **nflseedR** simulates playoff scenarios and calculates probabilities We learned: - The structure of play-by-play data and key variables - How to combine multiple data sources (roster, play-by-play, team info) - Best practices for filtering, aggregating, and analyzing NFL data - Common pitfalls and how to avoid them - Performance optimization techniques The nflverse ecosystem provides everything you need for professional-grade NFL analytics. Master these tools, and you'll be equipped to answer virtually any football analytics question. ## Exercises ### Conceptual Questions 1. **Data Structure**: Why is EPA missing for kickoff and punt plays? What other play types might have missing EPA values? 2. **Player Identification**: Explain why using player IDs is superior to using player names when merging datasets. Provide a specific example where using names could fail. 3. **Sample Size**: You want to evaluate quarterback performance. What minimum number of pass attempts would you require and why? How might this threshold differ for regular season vs. playoff analysis? ### Coding Exercises

Exercise 1: Team Passing Efficiency

Load 2023 play-by-play data and calculate the following for each team: a) Average EPA per pass attempt b) Completion percentage over expected (CPOE) c) Average air yards per attempt d) Success rate on passing plays Create a table showing the top 10 teams in passing EPA, including all metrics above. Use nflplotR to add team logos to your table. **Hint**: Filter for `play_type == "pass"` and `!is.na(epa)`.

Exercise 2: Roster Age Analysis

Using roster data: a) Calculate the average age and experience for each position group b) Determine which team has the oldest roster (by average years of experience) c) Analyze if there's a relationship between roster experience and team performance (use EPA from play-by-play data) **Hint**: You'll need to join roster data with team performance data from play-by-play.

Exercise 3: Logo Scatter Plot

Create a scatter plot with team logos showing: - X-axis: Offensive EPA per play (pass plays only) - Y-axis: Offensive EPA per play (run plays only) Include: - Reference lines at x=0 and y=0 - Team logos as points - Appropriate title and labels - A caption noting data source **Advanced**: Add a regression line and calculate the correlation between passing and rushing EPA.

Exercise 4: Multi-Source Integration

Combine play-by-play, roster, and schedule data to analyze: a) Home vs. away performance by team b) Whether roster experience correlates with performance consistency (standard deviation of weekly EPA) c) Create a visualization showing performance trends over the season This exercise requires: - Loading multiple datasets - Multiple joins - Grouping at different levels (game, week, season) - Creating meaningful visualizations

Exercise 5: Missing Data Exploration

Investigate missing data patterns: a) For each play type, calculate the percentage of plays with missing EPA b) Identify which variables are complete vs. incomplete for different play types c) Create a visualization (heatmap or bar chart) showing missingness patterns **Goal**: Understand when and why data might be missing, which is crucial for proper filtering in analyses.

## Further Reading ### Official Documentation - nflfastR Documentation: https://www.nflfastr.com/ - nflverse GitHub: https://github.com/nflverse - nfl_data_py Documentation: https://github.com/cooperdff/nfl_data_py ### Tutorials and Guides - Baldwin, B. (2023). "nflfastR Beginner's Guide." https://www.nflfastr.com/articles/beginners_guide.html - nflfastR Examples and Tutorials: https://www.nflfastr.com/articles/ ### Research Using nflverse - Yurko, R., Ventura, S., & Horowitz, M. (2019). "nflWAR: A Reproducible Method for Offensive Player Evaluation in Football." *Journal of Quantitative Analysis in Sports*, 15(3), 163-183. - Baldwin, B. (2021). "Updating Expected Points Models." Open Source Football. ### Community Resources - Open Source Football: https://www.opensourcefootball.com/ - r/NFLstatheads on Reddit: https://www.reddit.com/r/NFLstatheads/ - nflverse Discord: Community discussions and support ## References :::