Learning ObjectivesBy the end of this chapter, you will be able to:
- Master tidyverse (R) and pandas (Python) for football data manipulation
- Filter, select, and transform play-by-play and roster data effectively
- Handle missing data and outliers in football datasets
- Create derived variables and features for advanced analytics
- Aggregate and summarize data at player, team, and game levels
- Reshape data between wide and long formats
- Join and merge multiple football datasets
- Clean and standardize team names, player names, and temporal data
Introduction
Data wrangling—also called data munging or data preparation—is the process of transforming raw data into a clean, structured format suitable for analysis. In football analytics, this step is crucial because raw play-by-play data, roster information, and schedule data rarely arrive in the exact format you need for analysis. Understanding how to efficiently reshape, filter, combine, and clean data is what separates beginners from practitioners who can tackle real-world analytical challenges.
Think of data wrangling as the foundation of a building. No matter how sophisticated your statistical models or elegant your visualizations, if your data foundation is unstable—filled with errors, missing values, or improperly structured—everything built on top will be compromised. A quarterback evaluation model that includes penalty plays, a team efficiency analysis that doesn't account for garbage time, or a player comparison that fails to handle missing data will produce misleading results that lead to poor decisions.
The importance of data wrangling in football analytics cannot be overstated for several reasons:
Play-by-play data complexity: The nflfastR dataset includes over 370 variables per play. Most analyses require only 10-20 of these variables, but identifying which variables to use, understanding their meanings, and selecting them efficiently is essential. Many variables are highly correlated (yards_gained and epa_yards), some are situation-specific (air_yards only for passes), and others contain nuanced information that requires domain knowledge to interpret correctly.
Missing values are pervasive: EPA is missing on penalties and two-point conversions. Air yards and completion probability don't exist for run plays. Player IDs are missing when players aren't involved in a play. Understanding why data is missing and how to handle it appropriately is crucial. Blindly deleting rows with any missing values would eliminate most of your dataset, while improperly filling missing values can introduce bias and lead to incorrect conclusions.
Outliers require domain knowledge: A play with -15 EPA might be a catastrophic turnover in great field position, or it might be a garbage-time play that shouldn't influence your model of competitive performance. A 99-yard touchdown pass is a legitimate outlier that tells you something important about a player's ability, while a Hail Mary completion that gained 60 yards but still resulted in negative EPA requires context to interpret correctly. You can't mechanically apply statistical outlier detection rules without understanding football context.
Multiple datasets must be combined: Answering interesting questions requires joining play-by-play data with rosters (to get player demographics), with schedules (to get game context), with Next Gen Stats (to get tracking data), and potentially with external data sources (weather, Vegas lines, injury reports). Each join introduces opportunities for errors if not handled carefully—mismatched team abbreviations, missing keys, and duplicate records can all corrupt your analysis.
Inconsistent formats require standardization: Team names appear as "KC" in some datasets and "KAN" in others. Player names might be "P.Mahomes", "Patrick Mahomes", or "MAHOMES, PATRICK". Dates might be strings ("2023-09-07") or date objects. Standardizing these inconsistencies is tedious but essential for correct analysis.
Studies suggest that data scientists spend 60-80% of their time on data wrangling, with only 20-40% on modeling and visualization. This chapter equips you with the essential techniques to efficiently prepare football data for analysis. Mastering these skills will dramatically improve your productivity, reduce errors, and enable you to tackle more sophisticated analytical questions.
The 80/20 Rule of Data Analysis
In football analytics, 80% of your time will be spent wrangling data, and 20% on modeling and visualization. While this might seem backwards—shouldn't analysis be the main activity?—it reflects reality. Clean, well-structured data makes analysis straightforward. Messy, poorly understood data makes even simple analyses error-prone and time-consuming. The good news: time invested in learning data wrangling pays dividends. Techniques you'll learn in this chapter—filtering, grouping, joining, handling missing data—apply to virtually every analysis you'll ever conduct. Master these fundamentals, and more complex analyses become manageable.Tools of the Trade
We'll use two primary ecosystems for data wrangling, each with distinct strengths and philosophies:
R (tidyverse): The tidyverse is a collection of R packages designed for data science with a consistent, human-readable syntax. Key packages include:
dplyr: Data manipulation (filter, select, mutate, summarize) using verb-based functions that express intent clearlytidyr: Data reshaping (pivot_longer, pivot_wider) for converting between wide and long formatsstringr: String manipulation with consistent function names and argumentslubridate: Date/time handling that simplifies common temporal operations
The tidyverse philosophy emphasizes readability and consistency. Functions have verb names (filter, select, arrange) that describe what they do. The pipe operator (%>% or |>) chains operations together in a readable left-to-right sequence. This makes tidyverse code relatively easy to read and understand even months after writing it.
Python (pandas): pandas is the foundational data manipulation library in Python, providing:
- DataFrames for tabular data with powerful indexing and querying capabilities
- Rich methods for filtering, grouping, and transforming with both method chaining and direct operations
- Integration with NumPy for numerical operations, enabling efficient vectorized computations
- Flexibility in syntax—multiple ways to accomplish the same task, allowing you to choose the most appropriate for your situation
pandas draws inspiration from R but adds Python's general-purpose programming capabilities. This makes it ideal for analyses that combine data manipulation with other tasks (web scraping, machine learning, automation). While pandas syntax can be less consistent than tidyverse (multiple ways to filter, select, or group), it's extremely powerful and widely used in industry.
Which Should You Choose? Both ecosystems are excellent. R/tidyverse excels at interactive data analysis and visualization, particularly for statistical work. Python/pandas excels at data pipelines, integration with machine learning libraries, and production systems. Many professional analysts use both, choosing based on the specific task. This textbook provides examples in both languages, allowing you to learn either or both.
Learning Both Languages: A Strategic Advantage
While you can accomplish nearly any football analytics task in either R or Python, learning both provides significant advantages: 1. **Career Flexibility**: Some organizations prefer R (common in research and statistics-heavy teams), others prefer Python (common in engineering-heavy teams). Knowing both makes you valuable to more organizations. 2. **Tool Selection**: Some tasks are easier in R (statistical modeling, publication-quality visualizations), others easier in Python (machine learning, data pipelines, web applications). Choose the right tool for each task. 3. **Cross-Pollination**: Concepts learned in one language deepen understanding of the other. Understanding tidyverse's dplyr helps you appreciate pandas, and vice versa. 4. **Reading Others' Code**: Much football analytics work is shared publicly in either R or Python. Being bilingual lets you learn from all sources, not just those in your preferred language. Start with whichever language you're more comfortable with, but consider learning the other as your skills develop. The conceptual understanding—what operations you need to perform—matters more than the specific syntax.Loading Required Libraries
Before we can wrangle data, we need to load the libraries that provide data manipulation functionality. This is a necessary first step in every R or Python session where you'll be working with data.
#| label: setup-r
#| message: false
#| warning: false
# Load core tidyverse packages
# The tidyverse is actually a meta-package that loads multiple packages:
# - dplyr (data manipulation)
# - ggplot2 (visualization)
# - tidyr (data reshaping)
# - readr (data import)
# - purrr (functional programming)
# - tibble (modern data frames)
# - stringr (string manipulation)
# - forcats (factor handling)
library(tidyverse) # Data manipulation and visualization
# Load NFL data packages
# nflfastR provides play-by-play data with advanced metrics pre-calculated
# This includes EPA, WPA, success rate, and 370+ other variables
library(nflfastR) # NFL play-by-play data
# Load additional packages for specific tasks
# lubridate simplifies date/time operations (parsing, arithmetic, extraction)
library(lubridate) # Date/time handling
# gt creates publication-quality tables with extensive formatting options
library(gt) # Professional table formatting
# nflplotR adds NFL team logos, colors, and styling to ggplot2 visualizations
library(nflplotR) # NFL-specific plotting tools
# Set default ggplot2 theme for consistent appearance
# theme_minimal() provides clean, uncluttered plots suitable for presentations
theme_set(theme_minimal())
# Display confirmation
cat("✓ All R packages loaded successfully\n")
cat("✓ Ready to wrangle football data!\n")
#| label: setup-py
#| message: false
#| warning: false
# Import core data manipulation packages
# pandas is the primary library for working with tabular data (DataFrames)
# By convention, we import pandas as 'pd' - this is universal in Python data science
import pandas as pd
# numpy provides numerical computing capabilities (arrays, math functions, etc.)
# By convention, we import numpy as 'np'
import numpy as np
# Import NFL data package
# nfl_data_py is the Python equivalent of nflfastR
# By convention, we import it as 'nfl'
import nfl_data_py as nfl
# Import visualization packages
# matplotlib.pyplot is the core plotting library (like base R graphics)
import matplotlib.pyplot as plt
# seaborn provides statistical visualization built on matplotlib
# It creates more attractive plots with less code
import seaborn as sns
# Import datetime for temporal operations
from datetime import datetime, timedelta
# Configure pandas display options for better readability
# These settings control how DataFrames are displayed in notebooks/console
pd.set_option('display.max_columns', 20) # Show up to 20 columns before truncating
pd.set_option('display.width', 120) # Wider display width
pd.set_option('display.precision', 3) # Show 3 decimal places for floats
# Display confirmation
print("✓ All Python packages loaded successfully")
print("✓ Ready to wrangle football data!")
Loading Sample Data
Throughout this chapter, we'll work with 2023 NFL play-by-play data as our primary dataset. This provides a rich, real-world dataset for demonstrating data wrangling techniques. We'll also load roster and schedule data to demonstrate joining multiple datasets.
Understanding what data you're working with—its structure, size, and contents—is the first step in any data wrangling task. Let's load data and examine what we have.
#| label: load-sample-data-r
#| message: false
#| warning: false
#| cache: true
# Load 2023 play-by-play data
# load_pbp() is the primary nflfastR function for accessing play-by-play data
# It accepts a season year or vector of years, e.g., load_pbp(2020:2023)
# The function automatically caches data locally, so subsequent loads are fast
pbp <- load_pbp(2023)
# Load roster data for the 2023 season
# Rosters include player names, positions, height, weight, college, draft info
# This data is essential for player-level analyses and demographics
rosters <- load_rosters(2023)
# Load schedule data
# Schedules include game dates, locations, outcomes, and other game metadata
# This is crucial for aggregating play-level data to game level
schedules <- load_schedules(2023)
# Display summary information about loaded datasets
cat("=" %R% 50, "\n")
cat("LOADED DATA SUMMARY\n")
cat("=" %R% 50, "\n\n")
cat("Play-by-Play Data:\n")
cat(sprintf(" - %s plays (rows)\n", format(nrow(pbp), big.mark = ",")))
cat(sprintf(" - %s variables (columns)\n", ncol(pbp)))
cat(sprintf(" - Memory size: %.1f MB\n",
object.size(pbp) / 1024^2))
cat(sprintf(" - Weeks: %d through %d\n",
min(pbp$week, na.rm = TRUE), max(pbp$week, na.rm = TRUE)))
cat("\nRoster Data:\n")
cat(sprintf(" - %s player-season records (rows)\n",
format(nrow(rosters), big.mark = ",")))
cat(sprintf(" - %s variables (columns)\n", ncol(rosters)))
cat("\nSchedule Data:\n")
cat(sprintf(" - %s games (rows)\n", nrow(schedules)))
cat(sprintf(" - %s variables (columns)\n", ncol(schedules)))
# Display sample of key variables from play-by-play data
cat("\n" %R% "=" %R% 50, "\n")
cat("SAMPLE PLAY-BY-PLAY DATA (First 3 Plays)\n")
cat("=" %R% 50, "\n")
pbp %>%
select(game_id, posteam, defteam, down, ydstogo, yardline_100,
desc, yards_gained, epa) %>%
head(3) %>%
print(width = 120)
#| label: load-sample-data-py
#| message: false
#| warning: false
#| cache: true
# Load 2023 play-by-play data
# import_pbp_data() is the primary nfl_data_py function for play-by-play data
# It accepts a list of season years, e.g., [2020, 2021, 2022, 2023]
# Like nflfastR, it caches data locally for fast subsequent loads
pbp = nfl.import_pbp_data([2023])
# Load roster data for 2023 season
# Rosters include player demographics, positions, and background information
rosters = nfl.import_rosters([2023])
# Load schedule data
# Schedules provide game-level information for joining and aggregation
schedules = nfl.import_schedules([2023])
# Display summary information about loaded datasets
print("=" * 70)
print("LOADED DATA SUMMARY")
print("=" * 70)
print("\nPlay-by-Play Data:")
print(f" - {len(pbp):,} plays (rows)")
print(f" - {len(pbp.columns)} variables (columns)")
print(f" - Memory size: {pbp.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f" - Weeks: {pbp['week'].min()} through {pbp['week'].max()}")
print("\nRoster Data:")
print(f" - {len(rosters):,} player-season records (rows)")
print(f" - {len(rosters.columns)} variables (columns)")
print("\nSchedule Data:")
print(f" - {len(schedules)} games (rows)")
print(f" - {len(schedules.columns)} variables (columns)")
# Display sample of key variables from play-by-play data
print("\n" + "=" * 70)
print("SAMPLE PLAY-BY-PLAY DATA (First 3 Plays)")
print("=" * 70)
sample_cols = ['game_id', 'posteam', 'defteam', 'down', 'ydstogo',
'yardline_100', 'desc', 'yards_gained', 'epa']
print(pbp[sample_cols].head(3).to_string(index=False))
Understanding Data Dimensions
The dimensions of your dataset—number of rows and columns—provide crucial context for all subsequent operations: **Rows (Observations)**: Each row in play-by-play data represents one play. With ~43,000 plays across 272 games, this averages to ~158 plays per game, which matches typical NFL totals (about 130-140 offensive plays plus special teams and penalties). **Columns (Variables)**: The 370+ variables might seem overwhelming, but you'll typically use only 10-20 for any specific analysis. Think of the comprehensive variable set as insurance—whatever question you want to answer, the data probably includes the variables you need. **Working with Dimensions**: - When filtering rows, track how many remain to ensure your filters work as intended - When selecting columns, choose only what you need for clarity and performance - When creating new variables, keep track of what you're adding and why - When joining datasets, understand how dimensions change (rows might increase if it's a one-to-many join) Always print dimension summaries after major operations to verify your code is working as expected. Unexpected dimension changes often indicate errors in your data wrangling logic.Selecting and Filtering Data: Focusing Your Analysis
The first step in most analyses is narrowing your dataset to the relevant subset. With 370+ columns and 43,000+ rows, you need to select only the columns you need and filter to only the rows relevant to your question. This section covers the essential operations for focusing your analysis.
Selecting and filtering serve different purposes:
- Selecting columns reduces clutter, improves performance, and helps you focus on relevant variables
- Filtering rows removes irrelevant observations, allowing you to analyze specific subsets (e.g., only pass plays, only specific teams, only competitive situations)
Mastering these operations is foundational—you'll use them in virtually every analysis you ever conduct.
Selecting Columns: Working with What You Need
Play-by-play data includes 370+ variables, but most analyses use only 10-20. Selecting just the columns you need offers several advantages:
Clarity: With fewer columns, your data is easier to inspect and understand. When you print a dataset with 10 columns, you can see all of them. With 370 columns, most are hidden, making it hard to verify your data looks correct.
Performance: Many operations (printing, filtering, grouping) are faster with fewer columns because there's less data to process. While the performance difference is negligible for single-season data, it becomes significant when working with multiple seasons or very large datasets.
Reduced Errors: With fewer columns, you're less likely to accidentally use the wrong variable. If you have both passer_player_name and receiver_player_name in your dataset, you might accidentally use the wrong one. If you've selected only the variables you need, such mistakes become impossible.
Documentation: Your column selection serves as documentation of what variables your analysis uses. Someone reading your code (including future you) can look at the select() call and immediately see what data you're working with.
Let's look at various approaches to selecting columns:
#| label: select-columns-r
# =====================================
# Basic Column Selection
# =====================================
# Select specific columns by name
# This is the most explicit and self-documenting approach
# Use this when you know exactly which columns you need
pbp_basic <- pbp %>%
select(
# Identifiers
game_id, play_id, game_date,
# Teams
posteam, defteam,
# Situation
down, ydstogo, yardline_100,
# Outcome
play_type, yards_gained,
# Advanced metrics
epa, wpa, success
)
cat("Basic selection:", ncol(pbp_basic), "columns\n")
# =====================================
# Pattern-Based Selection
# =====================================
# Select columns matching patterns
# This is useful when you want all columns related to a specific concept
# The tidyverse provides helper functions for pattern matching:
# - starts_with(): columns that start with a prefix
# - ends_with(): columns that end with a suffix
# - contains(): columns that contain a string
# - matches(): columns matching a regular expression
pbp_pass <- pbp %>%
select(
game_id, play_id, # Always include identifiers
starts_with("pass"), # All passing-related columns
ends_with("_epa"), # All EPA variants
contains("air_yards") # All air yards columns
)
cat("Pass-focused selection:", ncol(pbp_pass), "columns\n")
# =====================================
# Selection by Range
# =====================================
# Select consecutive columns
# Use the colon operator to select all columns between two endpoints
# This is less common because column order might not be meaningful
pbp_range <- pbp %>%
select(game_id:play_id, down:ydstogo, epa:wpa)
cat("Range selection:", ncol(pbp_range), "columns\n")
# =====================================
# Negative Selection (Dropping Columns)
# =====================================
# Sometimes it's easier to specify what to remove rather than what to keep
# Use the minus sign or - operator to drop columns
pbp_no_ids <- pbp %>%
select(
-old_game_id, # Drop individual columns
-ends_with("_player_id"), # Drop pattern-matched columns
-starts_with("td_") # Drop all touchdown probability columns
)
cat("After dropping columns:", ncol(pbp_no_ids), "columns\n")
# =====================================
# Selection with Renaming
# =====================================
# Select and rename simultaneously for clearer variable names
pbp_renamed <- pbp %>%
select(
game = game_id,
offense = posteam,
defense = defteam,
down = down,
distance = ydstogo,
field_position = yardline_100,
play_description = desc,
yards = yards_gained,
expected_points_added = epa
)
cat("Selection with renaming:", ncol(pbp_renamed), "columns\n")
# Display column names from each selection approach
cat("\n========================================\n")
cat("COLUMN NAMES FROM BASIC SELECTION:\n")
cat("========================================\n")
cat(paste(names(pbp_basic), collapse = ", "), "\n")
cat("\n========================================\n")
cat("SAMPLE OF PASS-FOCUSED COLUMNS:\n")
cat("========================================\n")
cat(paste(head(names(pbp_pass), 15), collapse = ", "), "\n")
cat("... and", ncol(pbp_pass) - 15, "more columns\n")
#| label: select-columns-py
# =====================================
# Basic Column Selection
# =====================================
# Select specific columns by name using list indexing
# This is the most explicit approach in pandas
# Pass a list of column names inside square brackets
pbp_basic = pbp[[
# Identifiers
'game_id', 'play_id', 'game_date',
# Teams
'posteam', 'defteam',
# Situation
'down', 'ydstogo', 'yardline_100',
# Outcome
'play_type', 'yards_gained',
# Advanced metrics
'epa', 'wpa', 'success'
]]
print(f"Basic selection: {len(pbp_basic.columns)} columns")
# =====================================
# Pattern-Based Selection
# =====================================
# Select columns matching patterns using list comprehension
# Python doesn't have built-in pattern matching like R's dplyr,
# but list comprehensions provide flexible pattern matching
# Columns starting with 'pass'
pass_cols = [col for col in pbp.columns if col.startswith('pass')]
# Columns ending with '_epa'
epa_cols = [col for col in pbp.columns if col.endswith('_epa')]
# Columns containing 'air_yards'
air_cols = [col for col in pbp.columns if 'air_yards' in col]
# Combine all selected columns with identifiers
pbp_pass = pbp[['game_id', 'play_id'] + pass_cols + epa_cols + air_cols]
print(f"Pass-focused selection: {len(pbp_pass.columns)} columns")
# =====================================
# Selection by Position (less common)
# =====================================
# Select columns by position using iloc
# Rarely used because position-based selection is brittle
# (column order might change, breaking your code)
pbp_range = pbp.iloc[:, 0:10] # First 10 columns
print(f"Position-based selection: {len(pbp_range.columns)} columns")
# =====================================
# Negative Selection (Dropping Columns)
# =====================================
# Drop columns using the drop() method
# axis=1 indicates we're dropping columns (not rows)
# Use this when it's easier to specify what to remove
# Drop specific columns
pbp_no_ids = pbp.drop(columns=['old_game_id'])
# Drop columns matching patterns
player_id_cols = [col for col in pbp.columns if col.endswith('_player_id')]
td_cols = [col for col in pbp.columns if col.startswith('td_')]
pbp_no_ids = pbp.drop(columns=player_id_cols + td_cols)
print(f"After dropping columns: {len(pbp_no_ids.columns)} columns")
# =====================================
# Selection with Renaming
# =====================================
# Select and rename using rename() method
# First select columns, then rename them
pbp_renamed = pbp[[
'game_id', 'posteam', 'defteam', 'down', 'ydstogo',
'yardline_100', 'desc', 'yards_gained', 'epa'
]].rename(columns={
'game_id': 'game',
'posteam': 'offense',
'defteam': 'defense',
'ydstogo': 'distance',
'yardline_100': 'field_position',
'desc': 'play_description',
'yards_gained': 'yards',
'epa': 'expected_points_added'
})
print(f"Selection with renaming: {len(pbp_renamed.columns)} columns")
# Display column names from each selection approach
print("\n" + "=" * 70)
print("COLUMN NAMES FROM BASIC SELECTION:")
print("=" * 70)
print(", ".join(pbp_basic.columns))
print("\n" + "=" * 70)
print("SAMPLE OF PASS-FOCUSED COLUMNS:")
print("=" * 70)
print(", ".join(list(pbp_pass.columns)[:15]))
print(f"... and {len(pbp_pass.columns) - 15} more columns")
select(game_id, posteam, epa)
pbp[['game_id', 'posteam', 'epa']]
select(pass_attempt, pass_touchdown, pass_length, pass_location, ...)
select(starts_with("pass"))
Column Selection Best Practices
1. **Select early**: Narrow your dataset to needed columns as soon as possible. This improves performance and reduces clutter throughout your analysis. 2. **Be explicit in production code**: For analyses that will be shared or revisited, explicitly list columns rather than using patterns. This makes dependencies clear and prevents breaking changes if the dataset structure changes. 3. **Use patterns for exploration**: When you're exploring data and don't know exactly what you need, pattern matching helps you quickly grab related variables. 4. **Document your selections**: Add comments explaining why you selected specific columns, especially if the choice isn't obvious. 5. **Verify your selection**: After selecting columns, print the column names or dimensions to verify you got what you expected. This catches typos and logic errors. 6. **Consider creating a standard selection**: If you repeatedly use the same set of columns, create a vector/list of column names that you can reuse:# R
standard_cols <- c("game_id", "posteam", "down", "ydstogo", "epa")
pbp_subset <- pbp %>% select(all_of(standard_cols))
# Python
standard_cols = ['game_id', 'posteam', 'down', 'ydstogo', 'epa']
pbp_subset = pbp[standard_cols]
Filtering Rows: Subsetting to Relevant Observations
While selecting columns narrows the variables you're working with, filtering rows narrows the observations. Filtering is how you focus your analysis on specific situations, teams, players, or time periods.
Effective filtering requires understanding:
- What conditions matter for your analysis question
- How to express those conditions in code
- What happens to missing values when filtering
- How to verify your filter worked as intended
Let's explore filtering with increasingly complex examples:
#| label: filter-basic-r
# =====================================
# Single-Condition Filtering
# =====================================
# Filter to only pass plays
# The == operator tests for equality
# This creates a logical vector (TRUE/FALSE for each row)
# filter() keeps rows where the condition is TRUE
pass_plays <- pbp %>%
filter(play_type == "pass")
cat("Total plays:", format(nrow(pbp), big.mark = ","), "\n")
cat("Pass plays:", format(nrow(pass_plays), big.mark = ","),
sprintf("(%.1f%%)\n", 100 * nrow(pass_plays) / nrow(pbp)))
# =====================================
# Multiple-Condition Filtering (AND logic)
# =====================================
# Filter to passes on 3rd down
# Multiple conditions in filter() are combined with AND
# A row must satisfy ALL conditions to be kept
third_down_passes <- pbp %>%
filter(
play_type == "pass", # Must be a pass AND
down == 3 # Must be 3rd down
)
cat("\n3rd down passes:", format(nrow(third_down_passes), big.mark = ","), "\n")
# =====================================
# Complex Filtering with Multiple Operators
# =====================================
# Filter to critical situations: 3rd/4th down, medium/long distance,
# 4th quarter, close game
# This demonstrates combining multiple logical operators:
# - %in% tests if value is in a set
# - >= tests greater than or equal
# - <= tests less than or equal
critical_plays <- pbp %>%
filter(
down %in% c(3, 4), # 3rd or 4th down
ydstogo >= 5, # 5+ yards to go
qtr == 4, # 4th quarter
score_differential >= -8, # Within 8 points (trailing)
score_differential <= 8 # Within 8 points (leading)
)
cat("\nCritical situation plays:", format(nrow(critical_plays), big.mark = ","), "\n")
# =====================================
# Filtering with OR Logic
# =====================================
# Filter to plays that are either touchdowns OR explosive gains (20+ yards)
# Use the | operator for OR logic
# A row is kept if it satisfies ANY of the conditions
explosive_plays <- pbp %>%
filter(
touchdown == 1 | # Touchdown, OR
yards_gained >= 20, # 20+ yard gain
!is.na(epa) # Must have valid EPA
)
cat("\nExplosive plays:", format(nrow(explosive_plays), big.mark = ","), "\n")
# =====================================
# Negative Filtering (Exclusion)
# =====================================
# Filter to non-penalty, regular offensive plays with EPA
# Use ! for negation (NOT)
# !is.na(x) means "is not missing"
# ! (condition) means "not condition"
clean_plays <- pbp %>%
filter(
!is.na(epa), # EPA must not be missing
play_type %in% c("pass", "run"), # Must be pass or run
penalty == 0 | is.na(penalty) # No penalty (or penalty is NA)
)
cat("\nClean offensive plays:", format(nrow(clean_plays), big.mark = ","), "\n")
# =====================================
# Filtering by Field Position Zones
# =====================================
# Filter to red zone plays (inside opponent's 20-yard line)
# Field position in nflfastR is yards to opponent's goal
# yardline_100 = 20 means 20 yards from opponent's goal (own 80-yard line)
# yardline_100 = 10 means 10 yards from opponent's goal (opponent's 10)
red_zone <- pbp %>%
filter(
yardline_100 <= 20, # 20 or fewer yards to goal
yardline_100 > 0, # Not in end zone
play_type %in% c("pass", "run"),
!is.na(epa)
)
cat("\nRed zone plays:", format(nrow(red_zone), big.mark = ","), "\n")
# =====================================
# Filtering by Team
# =====================================
# Filter to Kansas City Chiefs offensive plays
# posteam is the team with possession (offensive team)
chiefs_offense <- pbp %>%
filter(
posteam == "KC",
play_type %in% c("pass", "run"),
!is.na(epa)
)
cat("\nKC offensive plays:", format(nrow(chiefs_offense), big.mark = ","), "\n")
# =====================================
# Demonstrate Filter Logic with Counts
# =====================================
cat("\n========================================\n")
cat("FILTERING IMPACT SUMMARY\n")
cat("========================================\n")
filter_summary <- tibble(
Filter = c(
"All plays",
"Pass plays only",
"3rd down passes",
"Critical situations",
"Explosive plays",
"Clean offensive plays",
"Red zone plays",
"KC offensive plays"
),
Rows = c(
nrow(pbp),
nrow(pass_plays),
nrow(third_down_passes),
nrow(critical_plays),
nrow(explosive_plays),
nrow(clean_plays),
nrow(red_zone),
nrow(chiefs_offense)
)
) %>%
mutate(
`Percent of Total` = Rows / nrow(pbp) * 100,
`Rows Kept` = format(Rows, big.mark = ","),
`% Kept` = sprintf("%.1f%%", `Percent of Total`)
) %>%
select(Filter, `Rows Kept`, `% Kept`)
print(filter_summary)
#| label: filter-basic-py
# =====================================
# Single-Condition Filtering
# =====================================
# Filter to only pass plays
# pandas provides multiple filtering syntaxes:
# 1. Boolean indexing: df[df['column'] == value]
# 2. query() method: df.query("column == 'value'")
# We'll demonstrate both
# Boolean indexing approach
pass_plays = pbp[pbp['play_type'] == 'pass']
# query() approach (often more readable)
# pass_plays = pbp.query("play_type == 'pass'")
print(f"Total plays: {len(pbp):,}")
print(f"Pass plays: {len(pass_plays):,} ({100*len(pass_plays)/len(pbp):.1f}%)")
# =====================================
# Multiple-Condition Filtering (AND logic)
# =====================================
# Filter to passes on 3rd down
# Multiple conditions with & (AND) operator
# Note: Must wrap each condition in parentheses
third_down_passes = pbp[
(pbp['play_type'] == 'pass') & # Must be a pass AND
(pbp['down'] == 3) # Must be 3rd down
]
# Alternative using query() method (often more readable)
# third_down_passes = pbp.query("play_type == 'pass' and down == 3")
print(f"\n3rd down passes: {len(third_down_passes):,}")
# =====================================
# Complex Filtering with Multiple Operators
# =====================================
# Filter to critical situations
# Demonstrates multiple comparison operators
critical_plays = pbp[
(pbp['down'].isin([3, 4])) & # 3rd or 4th down
(pbp['ydstogo'] >= 5) & # 5+ yards to go
(pbp['qtr'] == 4) & # 4th quarter
(pbp['score_differential'] >= -8) & # Within 8 points (trailing)
(pbp['score_differential'] <= 8) # Within 8 points (leading)
]
# Alternative using query() method
# critical_plays = pbp.query(
# "down in [3, 4] and ydstogo >= 5 and qtr == 4 and "
# "score_differential >= -8 and score_differential <= 8"
# )
print(f"\nCritical situation plays: {len(critical_plays):,}")
# =====================================
# Filtering with OR Logic
# =====================================
# Filter to touchdowns OR explosive gains
# Use the | operator for OR logic
# Note: Must wrap each condition in parentheses
explosive_plays = pbp[
((pbp['touchdown'] == 1) | # Touchdown, OR
(pbp['yards_gained'] >= 20)) & # 20+ yard gain
(pbp['epa'].notna()) # Must have valid EPA
]
# Alternative using query() method
# explosive_plays = pbp.query(
# "(touchdown == 1 or yards_gained >= 20) and epa.notna()"
# )
print(f"\nExplosive plays: {len(explosive_plays):,}")
# =====================================
# Negative Filtering (Exclusion)
# =====================================
# Filter to clean offensive plays
# Use ~ for negation (NOT) in boolean indexing
# .notna() checks for non-missing values (opposite of .isna())
clean_plays = pbp[
(pbp['epa'].notna()) & # EPA must not be missing
(pbp['play_type'].isin(['pass', 'run'])) # Must be pass or run
]
# Alternative using query() method
# clean_plays = pbp.query(
# "epa.notna() and play_type in ['pass', 'run']"
# )
print(f"\nClean offensive plays: {len(clean_plays):,}")
# =====================================
# Filtering by Field Position Zones
# =====================================
# Filter to red zone plays
red_zone = pbp[
(pbp['yardline_100'] <= 20) & # 20 or fewer yards to goal
(pbp['yardline_100'] > 0) & # Not in end zone
(pbp['play_type'].isin(['pass', 'run'])) &
(pbp['epa'].notna())
]
print(f"\nRed zone plays: {len(red_zone):,}")
# =====================================
# Filtering by Team
# =====================================
# Filter to Kansas City Chiefs offensive plays
chiefs_offense = pbp[
(pbp['posteam'] == 'KC') &
(pbp['play_type'].isin(['pass', 'run'])) &
(pbp['epa'].notna())
]
print(f"\nKC offensive plays: {len(chiefs_offense):,}")
# =====================================
# Demonstrate Filter Logic with Summary
# =====================================
print("\n" + "=" * 70)
print("FILTERING IMPACT SUMMARY")
print("=" * 70)
filter_summary = pd.DataFrame({
'Filter': [
'All plays',
'Pass plays only',
'3rd down passes',
'Critical situations',
'Explosive plays',
'Clean offensive plays',
'Red zone plays',
'KC offensive plays'
],
'Rows': [
len(pbp),
len(pass_plays),
len(third_down_passes),
len(critical_plays),
len(explosive_plays),
len(clean_plays),
len(red_zone),
len(chiefs_offense)
]
})
filter_summary['Percent of Total'] = (
filter_summary['Rows'] / len(pbp) * 100
)
filter_summary['Rows Kept'] = filter_summary['Rows'].apply(
lambda x: f"{x:,}"
)
filter_summary['% Kept'] = filter_summary['Percent of Total'].apply(
lambda x: f"{x:.1f}%"
)
print(filter_summary[['Filter', 'Rows Kept', '% Kept']].to_string(index=False))
pbp %>% filter(down == 3)
# R: Comma means AND
filter(down == 3, ydstogo >= 5)
# Python: & operator means AND
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]
# R: | operator means OR
filter(down == 3 | down == 4)
# Python: | operator means OR
pbp[(pbp['down'] == 3) | (pbp['down'] == 4)]
# WRONG: This creates a syntax error
filter(down == 3, down == 4) # A play can't be both!
# CORRECT:
filter(down == 3 | down == 4) # OR logic
# WRONG: Operator precedence causes unexpected behavior
pbp[pbp['down'] == 3 & pbp['ydstogo'] >= 5] # Precedence issue!
# CORRECT: Parentheses ensure proper order
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]
# This DOESN'T work as you might expect:
filter(epa == NA) # Returns zero rows!
# This is correct:
filter(is.na(epa)) # Returns rows where epa is missing
# To keep non-missing:
filter(!is.na(epa)) # Returns rows where epa is NOT missing
# R: Efficient way to test multiple values
filter(down %in% c(3, 4))
# Python: .isin() method
pbp[pbp['down'].isin([3, 4])]
# This is equivalent to but cleaner than:
# R: filter(down == 3 | down == 4)
# Python: pbp[(pbp['down'] == 3) | (pbp['down'] == 4)]
# Option 1: Two conditions with AND
filter(yardline_100 >= 20, yardline_100 <= 50)
# Option 2: between() helper function (R only)
filter(between(yardline_100, 20, 50))
# Verify filter
filtered_data <- pbp %>% filter(down == 3)
# Check that all rows have down == 3
all(filtered_data$down == 3) # Should be TRUE
# Check range of other variables makes sense
summary(filtered_data$ydstogo)
# Boolean indexing (traditional approach)
result = pbp[
(pbp['down'] == 3) &
(pbp['ydstogo'] >= 5) &
(pbp['play_type'] == 'pass')
]
# query() method (often more readable)
result = pbp.query("down == 3 and ydstogo >= 5 and play_type == 'pass'")
Common Filtering Pitfalls
1. **Forgetting Missing Values**: Many conditions involving NA values return FALSE, not TRUE or FALSE. Always explicitly handle missing values:# WRONG: This doesn't keep NA values
filter(epa > 0) # Drops NA values silently
# CORRECT: Explicit about NAs
filter(epa > 0 | is.na(epa)) # Keeps positive EPA and NAs
# OR
filter(!is.na(epa), epa > 0) # Explicitly requires non-missing
# WRONG: Precedence issues
pbp[pbp['down'] == 3 & pbp['ydstogo'] >= 5]
# CORRECT: Explicit parentheses
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]
# WRONG: May not work due to floating point precision
filter(epa == 0.5)
# BETTER: Use tolerance
filter(abs(epa - 0.5) < 0.0001)
# OR: Use near() function (R)
filter(near(epa, 0.5))
# After filtering, verify results
filtered_data <- pbp %>% filter(down == 3, ydstogo >= 5)
# Check: Do all rows meet criteria?
# Check: Does row count make sense?
# Check: Do value ranges make sense?
# Kansas City's offensive plays
filter(posteam == "KC")
# Kansas City's defensive plays (plays against KC)
filter(defteam == "KC")
# These are different datasets!
```</div>
### Practical Example: Red Zone Efficiency Analysis
Let's combine selection and filtering to answer a real analytical question: **How do pass and run efficiency differ in the red zone?**
The red zone—inside the opponent's 20-yard line—is strategically important because the compressed field changes play dynamics. Defenses don't have to defend as much space, offensive playbooks are more limited, and scoring opportunities are valuable. Understanding red zone efficiency helps teams optimize play-calling in these high-value situations.
This example demonstrates how to:
1. Filter to a specific field position zone
2. Calculate efficiency metrics by group
3. Present results professionally
4. Interpret findings in football context
<div class="panel-tabset">
<ul class="nav nav-tabs" role="tablist">
<li class="nav-item"><a class="nav-link active" data-bs-toggle="tab" href="#tab-r-6262">R</a></li>
<li class="nav-item"><a class="nav-link" data-bs-toggle="tab" href="#tab-python-6262">Python</a></li>
<li class="nav-item"><a class="nav-link" data-bs-toggle="tab" href="#tab-interpretation-6262">Interpretation</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane active" id="tab-r-6262">
```{r}
#| label: red-zone-analysis-r
# =====================================
# Define and Filter Red Zone Plays
# =====================================
# Red zone is defined as inside opponent's 20-yard line
# In nflfastR, yardline_100 represents yards to opponent's goal:
# - yardline_100 = 20 means you're AT your opponent's 20-yard line
# - yardline_100 = 10 means you're at your opponent's 10-yard line
# - yardline_100 = 1 means you're at the 1-yard line
# We exclude yardline_100 = 0 (plays in end zone, usually PATs)
red_zone <- pbp %>%
filter(
# Only pass and run plays (exclude special teams)
play_type %in% c("pass", "run"),
# Inside the 20-yard line but not in end zone
yardline_100 <= 20,
yardline_100 > 0,
# Only plays with valid EPA
!is.na(epa)
)
cat("Red Zone Analysis\n")
cat("=" %R% 50, "\n")
cat("Total plays in red zone:", format(nrow(red_zone), big.mark = ","), "\n")
cat("Percentage of all offensive plays:",
sprintf("%.1f%%\n", 100 * nrow(red_zone) / sum(pbp$play_type %in% c("pass", "run") & !is.na(pbp$epa))))
# =====================================
# Calculate Efficiency Metrics by Play Type
# =====================================
# For each play type (pass/run), calculate:
# - Total plays (volume)
# - Touchdowns (scoring outcomes)
# - Touchdown rate (efficiency)
# - Average EPA (comprehensive efficiency measure)
# - Success rate (consistency)
red_zone_summary <- red_zone %>%
group_by(play_type) %>%
summarise(
# Count total plays
plays = n(),
# Count touchdowns
# touchdown variable is 1 for TD, 0 otherwise
# sum() counts the 1s
touchdowns = sum(touchdown == 1, na.rm = TRUE),
# Calculate TD rate
# This is the primary outcome we care about in red zone
td_rate = touchdowns / plays,
# Calculate average EPA
# EPA captures full value including field position changes
avg_epa = mean(epa, na.rm = TRUE),
# Calculate success rate
# Success = positive EPA
success_rate = mean(epa > 0, na.rm = TRUE),
# Calculate median EPA for robustness to outliers
median_epa = median(epa, na.rm = TRUE),
.groups = "drop"
) %>%
# Calculate percentage of total red zone plays
mutate(
pct_of_rz_plays = plays / sum(plays) * 100
)
# =====================================
# Display Results in Professional Table
# =====================================
red_zone_summary %>%
# Create publication-quality table
gt() %>%
# Add title and context
tab_header(
title = "Red Zone Efficiency by Play Type",
subtitle = "2023 NFL Regular Season | Plays inside opponent's 20-yard line"
) %>%
# Readable column labels
cols_label(
play_type = "Play Type",
plays = "Plays",
pct_of_rz_plays = "% of RZ Plays",
touchdowns = "Touchdowns",
td_rate = "TD Rate",
avg_epa = "Avg EPA",
median_epa = "Median EPA",
success_rate = "Success Rate"
) %>%
# Format numbers appropriately
fmt_number(
columns = c(plays, touchdowns),
decimals = 0,
use_seps = TRUE
) %>%
fmt_percent(
columns = c(td_rate, success_rate, pct_of_rz_plays),
decimals = 1
) %>%
fmt_number(
columns = c(avg_epa, median_epa),
decimals = 3
) %>%
# Add source note
tab_source_note(
source_note = "Data: nflfastR | Red zone defined as inside opponent's 20-yard line"
) %>%
# Highlight best values
# Highest values are best for avg_epa, td_rate, success_rate
data_color(
columns = c(avg_epa, td_rate, success_rate),
palette = "Greens"
)
# =====================================
# Narrative Summary of Findings
# =====================================
cat("\n\n" %R% "=" %R% 50, "\n")
cat("KEY FINDINGS\n")
cat("=" %R% 50, "\n\n")
pass_data <- red_zone_summary %>% filter(play_type == "pass")
run_data <- red_zone_summary %>% filter(play_type == "run")
cat(sprintf("VOLUME:\n"))
cat(sprintf("- Passes: %s plays (%.1f%% of red zone plays)\n",
format(pass_data$plays, big.mark = ","),
pass_data$pct_of_rz_plays))
cat(sprintf("- Runs: %s plays (%.1f%% of red zone plays)\n",
format(run_data$plays, big.mark = ","),
run_data$pct_of_rz_plays))
cat(sprintf("\nTOUCHDOWN EFFICIENCY:\n"))
cat(sprintf("- Pass TD rate: %.1f%%\n", pass_data$td_rate * 100))
cat(sprintf("- Run TD rate: %.1f%%\n", run_data$td_rate * 100))
cat(sprintf("- Advantage: %s by %.1f percentage points\n",
ifelse(pass_data$td_rate > run_data$td_rate, "Pass", "Run"),
abs(pass_data$td_rate - run_data$td_rate) * 100))
cat(sprintf("\nEPA EFFICIENCY:\n"))
cat(sprintf("- Pass avg EPA: %.3f\n", pass_data$avg_epa))
cat(sprintf("- Run avg EPA: %.3f\n", run_data$avg_epa))
cat(sprintf("- Advantage: %s by %.3f EPA\n",
ifelse(pass_data$avg_epa > run_data$avg_epa, "Pass", "Run"),
abs(pass_data$avg_epa - run_data$avg_epa)))
cat(sprintf("\nCONSISTENCY (Success Rate):\n"))
cat(sprintf("- Pass success rate: %.1f%%\n", pass_data$success_rate * 100))
cat(sprintf("- Run success rate: %.1f%%\n", run_data$success_rate * 100))
#| label: red-zone-analysis-py
# =====================================
# Define and Filter Red Zone Plays
# =====================================
# Red zone is inside opponent's 20-yard line
red_zone = pbp[
(pbp['play_type'].isin(['pass', 'run'])) &
(pbp['yardline_100'] <= 20) &
(pbp['yardline_100'] > 0) &
(pbp['epa'].notna())
]
print("Red Zone Analysis")
print("=" * 70)
print(f"Total plays in red zone: {len(red_zone):,}")
# Calculate percentage of all offensive plays
all_offensive = pbp[
pbp['play_type'].isin(['pass', 'run']) & pbp['epa'].notna()
]
print(f"Percentage of all offensive plays: {100*len(red_zone)/len(all_offensive):.1f}%")
# =====================================
# Calculate Efficiency Metrics by Play Type
# =====================================
# Group by play type and calculate multiple metrics
red_zone_summary = red_zone.groupby('play_type').agg(
# Count plays
plays=('play_id', 'count'),
# Count touchdowns (touchdown column is 1 for TD, 0 otherwise)
touchdowns=('touchdown', lambda x: (x == 1).sum()),
# Calculate average EPA
avg_epa=('epa', 'mean'),
# Calculate median EPA
median_epa=('epa', 'median'),
# Calculate success rate (EPA > 0)
success_rate=('epa', lambda x: (x > 0).mean())
).reset_index()
# Calculate TD rate
red_zone_summary['td_rate'] = (
red_zone_summary['touchdowns'] / red_zone_summary['plays']
)
# Calculate percentage of red zone plays
red_zone_summary['pct_of_rz_plays'] = (
red_zone_summary['plays'] / red_zone_summary['plays'].sum() * 100
)
# =====================================
# Display Results
# =====================================
print("\n" + "=" * 70)
print("RED ZONE EFFICIENCY BY PLAY TYPE")
print("2023 NFL Regular Season | Plays inside opponent's 20-yard line")
print("=" * 70)
print()
# Format for display
display_summary = red_zone_summary.copy()
display_summary['Plays'] = display_summary['plays'].apply(lambda x: f"{x:,}")
display_summary['% of RZ'] = display_summary['pct_of_rz_plays'].apply(lambda x: f"{x:.1f}%")
display_summary['TDs'] = display_summary['touchdowns'].apply(lambda x: f"{x:,}")
display_summary['TD Rate'] = display_summary['td_rate'].apply(lambda x: f"{x:.1%}")
display_summary['Avg EPA'] = display_summary['avg_epa'].apply(lambda x: f"{x:.3f}")
display_summary['Median EPA'] = display_summary['median_epa'].apply(lambda x: f"{x:.3f}")
display_summary['Success Rate'] = display_summary['success_rate'].apply(lambda x: f"{x:.1%}")
display_cols = ['play_type', 'Plays', '% of RZ', 'TDs', 'TD Rate',
'Avg EPA', 'Median EPA', 'Success Rate']
print(display_summary[display_cols].to_string(index=False))
# =====================================
# Narrative Summary of Findings
# =====================================
print("\n" + "=" * 70)
print("KEY FINDINGS")
print("=" * 70)
pass_data = red_zone_summary[red_zone_summary['play_type'] == 'pass'].iloc[0]
run_data = red_zone_summary[red_zone_summary['play_type'] == 'run'].iloc[0]
print("\nVOLUME:")
print(f"- Passes: {pass_data['plays']:,} plays ({pass_data['pct_of_rz_plays']:.1f}% of red zone plays)")
print(f"- Runs: {run_data['plays']:,} plays ({run_data['pct_of_rz_plays']:.1f}% of red zone plays)")
print("\nTOUCHDOWN EFFICIENCY:")
print(f"- Pass TD rate: {pass_data['td_rate']:.1%}")
print(f"- Run TD rate: {run_data['td_rate']:.1%}")
advantage = "Pass" if pass_data['td_rate'] > run_data['td_rate'] else "Run"
diff = abs(pass_data['td_rate'] - run_data['td_rate']) * 100
print(f"- Advantage: {advantage} by {diff:.1f} percentage points")
print("\nEPA EFFICIENCY:")
print(f"- Pass avg EPA: {pass_data['avg_epa']:.3f}")
print(f"- Run avg EPA: {run_data['avg_epa']:.3f}")
advantage = "Pass" if pass_data['avg_epa'] > run_data['avg_epa'] else "Run"
diff = abs(pass_data['avg_epa'] - run_data['avg_epa'])
print(f"- Advantage: {advantage} by {diff:.3f} EPA")
print("\nCONSISTENCY (Success Rate):")
print(f"- Pass success rate: {pass_data['success_rate']:.1%}")
print(f"- Run success rate: {run_data['success_rate']:.1%}")