Chapter 4: Data Wrangling for NFL Analytics | Football Analytics Textbook

Learning ObjectivesBy the end of this chapter, you will be able to:

Master tidyverse (R) and pandas (Python) for football data manipulation
Filter, select, and transform play-by-play and roster data effectively
Handle missing data and outliers in football datasets
Create derived variables and features for advanced analytics
Aggregate and summarize data at player, team, and game levels
Reshape data between wide and long formats
Join and merge multiple football datasets
Clean and standardize team names, player names, and temporal data

Introduction

Data wrangling—also called data munging or data preparation—is the process of transforming raw data into a clean, structured format suitable for analysis. In football analytics, this step is crucial because raw play-by-play data, roster information, and schedule data rarely arrive in the exact format you need for analysis. Understanding how to efficiently reshape, filter, combine, and clean data is what separates beginners from practitioners who can tackle real-world analytical challenges.

Think of data wrangling as the foundation of a building. No matter how sophisticated your statistical models or elegant your visualizations, if your data foundation is unstable—filled with errors, missing values, or improperly structured—everything built on top will be compromised. A quarterback evaluation model that includes penalty plays, a team efficiency analysis that doesn't account for garbage time, or a player comparison that fails to handle missing data will produce misleading results that lead to poor decisions.

The importance of data wrangling in football analytics cannot be overstated for several reasons:

Play-by-play data complexity: The nflfastR dataset includes over 370 variables per play. Most analyses require only 10-20 of these variables, but identifying which variables to use, understanding their meanings, and selecting them efficiently is essential. Many variables are highly correlated (yards_gained and epa_yards), some are situation-specific (air_yards only for passes), and others contain nuanced information that requires domain knowledge to interpret correctly.

Missing values are pervasive: EPA is missing on penalties and two-point conversions. Air yards and completion probability don't exist for run plays. Player IDs are missing when players aren't involved in a play. Understanding why data is missing and how to handle it appropriately is crucial. Blindly deleting rows with any missing values would eliminate most of your dataset, while improperly filling missing values can introduce bias and lead to incorrect conclusions.

Outliers require domain knowledge: A play with -15 EPA might be a catastrophic turnover in great field position, or it might be a garbage-time play that shouldn't influence your model of competitive performance. A 99-yard touchdown pass is a legitimate outlier that tells you something important about a player's ability, while a Hail Mary completion that gained 60 yards but still resulted in negative EPA requires context to interpret correctly. You can't mechanically apply statistical outlier detection rules without understanding football context.

Multiple datasets must be combined: Answering interesting questions requires joining play-by-play data with rosters (to get player demographics), with schedules (to get game context), with Next Gen Stats (to get tracking data), and potentially with external data sources (weather, Vegas lines, injury reports). Each join introduces opportunities for errors if not handled carefully—mismatched team abbreviations, missing keys, and duplicate records can all corrupt your analysis.

Inconsistent formats require standardization: Team names appear as "KC" in some datasets and "KAN" in others. Player names might be "P.Mahomes", "Patrick Mahomes", or "MAHOMES, PATRICK". Dates might be strings ("2023-09-07") or date objects. Standardizing these inconsistencies is tedious but essential for correct analysis.

Studies suggest that data scientists spend 60-80% of their time on data wrangling, with only 20-40% on modeling and visualization. This chapter equips you with the essential techniques to efficiently prepare football data for analysis. Mastering these skills will dramatically improve your productivity, reduce errors, and enable you to tackle more sophisticated analytical questions.

The 80/20 Rule of Data Analysis

In football analytics, 80% of your time will be spent wrangling data, and 20% on modeling and visualization. While this might seem backwards—shouldn't analysis be the main activity?—it reflects reality. Clean, well-structured data makes analysis straightforward. Messy, poorly understood data makes even simple analyses error-prone and time-consuming. The good news: time invested in learning data wrangling pays dividends. Techniques you'll learn in this chapter—filtering, grouping, joining, handling missing data—apply to virtually every analysis you'll ever conduct. Master these fundamentals, and more complex analyses become manageable.

Tools of the Trade

We'll use two primary ecosystems for data wrangling, each with distinct strengths and philosophies:

R (tidyverse): The tidyverse is a collection of R packages designed for data science with a consistent, human-readable syntax. Key packages include:

dplyr: Data manipulation (filter, select, mutate, summarize) using verb-based functions that express intent clearly
tidyr: Data reshaping (pivot_longer, pivot_wider) for converting between wide and long formats
stringr: String manipulation with consistent function names and arguments
lubridate: Date/time handling that simplifies common temporal operations

The tidyverse philosophy emphasizes readability and consistency. Functions have verb names (filter, select, arrange) that describe what they do. The pipe operator (%>% or |>) chains operations together in a readable left-to-right sequence. This makes tidyverse code relatively easy to read and understand even months after writing it.

Python (pandas): pandas is the foundational data manipulation library in Python, providing:

DataFrames for tabular data with powerful indexing and querying capabilities
Rich methods for filtering, grouping, and transforming with both method chaining and direct operations
Integration with NumPy for numerical operations, enabling efficient vectorized computations
Flexibility in syntax—multiple ways to accomplish the same task, allowing you to choose the most appropriate for your situation

pandas draws inspiration from R but adds Python's general-purpose programming capabilities. This makes it ideal for analyses that combine data manipulation with other tasks (web scraping, machine learning, automation). While pandas syntax can be less consistent than tidyverse (multiple ways to filter, select, or group), it's extremely powerful and widely used in industry.

Which Should You Choose? Both ecosystems are excellent. R/tidyverse excels at interactive data analysis and visualization, particularly for statistical work. Python/pandas excels at data pipelines, integration with machine learning libraries, and production systems. Many professional analysts use both, choosing based on the specific task. This textbook provides examples in both languages, allowing you to learn either or both.

Learning Both Languages: A Strategic Advantage

While you can accomplish nearly any football analytics task in either R or Python, learning both provides significant advantages: 1. **Career Flexibility**: Some organizations prefer R (common in research and statistics-heavy teams), others prefer Python (common in engineering-heavy teams). Knowing both makes you valuable to more organizations. 2. **Tool Selection**: Some tasks are easier in R (statistical modeling, publication-quality visualizations), others easier in Python (machine learning, data pipelines, web applications). Choose the right tool for each task. 3. **Cross-Pollination**: Concepts learned in one language deepen understanding of the other. Understanding tidyverse's dplyr helps you appreciate pandas, and vice versa. 4. **Reading Others' Code**: Much football analytics work is shared publicly in either R or Python. Being bilingual lets you learn from all sources, not just those in your preferred language. Start with whichever language you're more comfortable with, but consider learning the other as your skills develop. The conceptual understanding—what operations you need to perform—matters more than the specific syntax.

Loading Required Libraries

Before we can wrangle data, we need to load the libraries that provide data manipulation functionality. This is a necessary first step in every R or Python session where you'll be working with data.

#| label: setup-r
#| message: false
#| warning: false

# Load core tidyverse packages
# The tidyverse is actually a meta-package that loads multiple packages:
# - dplyr (data manipulation)
# - ggplot2 (visualization)
# - tidyr (data reshaping)
# - readr (data import)
# - purrr (functional programming)
# - tibble (modern data frames)
# - stringr (string manipulation)
# - forcats (factor handling)
library(tidyverse)      # Data manipulation and visualization

# Load NFL data packages
# nflfastR provides play-by-play data with advanced metrics pre-calculated
# This includes EPA, WPA, success rate, and 370+ other variables
library(nflfastR)       # NFL play-by-play data

# Load additional packages for specific tasks
# lubridate simplifies date/time operations (parsing, arithmetic, extraction)
library(lubridate)      # Date/time handling

# gt creates publication-quality tables with extensive formatting options
library(gt)             # Professional table formatting

# nflplotR adds NFL team logos, colors, and styling to ggplot2 visualizations
library(nflplotR)       # NFL-specific plotting tools

# Set default ggplot2 theme for consistent appearance
# theme_minimal() provides clean, uncluttered plots suitable for presentations
theme_set(theme_minimal())

# Display confirmation
cat("✓ All R packages loaded successfully\n")
cat("✓ Ready to wrangle football data!\n")

#| label: setup-py
#| message: false
#| warning: false

# Import core data manipulation packages
# pandas is the primary library for working with tabular data (DataFrames)
# By convention, we import pandas as 'pd' - this is universal in Python data science
import pandas as pd

# numpy provides numerical computing capabilities (arrays, math functions, etc.)
# By convention, we import numpy as 'np'
import numpy as np

# Import NFL data package
# nfl_data_py is the Python equivalent of nflfastR
# By convention, we import it as 'nfl'
import nfl_data_py as nfl

# Import visualization packages
# matplotlib.pyplot is the core plotting library (like base R graphics)
import matplotlib.pyplot as plt

# seaborn provides statistical visualization built on matplotlib
# It creates more attractive plots with less code
import seaborn as sns

# Import datetime for temporal operations
from datetime import datetime, timedelta

# Configure pandas display options for better readability
# These settings control how DataFrames are displayed in notebooks/console
pd.set_option('display.max_columns', 20)     # Show up to 20 columns before truncating
pd.set_option('display.width', 120)          # Wider display width
pd.set_option('display.precision', 3)        # Show 3 decimal places for floats

# Display confirmation
print("✓ All Python packages loaded successfully")
print("✓ Ready to wrangle football data!")

**Why We Load Libraries**: In both R and Python, libraries (also called packages or modules) extend the core language with specialized functionality. Without loading these libraries, you'd only have access to basic language features—no data manipulation, no NFL data access, no advanced visualization. **Library Loading is Session-Specific**: Importantly, libraries must be loaded in every R/Python session where you'll use them. If you restart RStudio or your Python kernel, you need to re-run the library loading code. This is why most analysis scripts and notebooks begin with a "setup" section that loads all required libraries. **Import Conventions in Python**: Python has strong conventions for how to import common packages: - `import pandas as pd` - Universal convention, allows you to use `pd.DataFrame()` instead of `pandas.DataFrame()` - `import numpy as np` - Universal convention, allows you to use `np.array()` instead of `numpy.array()` - `import matplotlib.pyplot as plt` - Standard for plotting functions Following these conventions makes your code immediately recognizable to other Python users. When you see `pd.read_csv()` in someone else's code, you immediately know it's from pandas. **R's library() vs Python's import**: The syntax differs, but the concept is identical: - R: `library(dplyr)` loads the dplyr package - Python: `import pandas as pd` loads the pandas module R automatically makes all package functions available after `library()`. Python requires you to prefix functions with the module name (or alias), like `pd.DataFrame()`. **Configuration Options**: The pandas display options at the end of the Python setup control how DataFrames are displayed. These are personal preferences—adjust them based on your screen size and how you prefer to view data. Common options include: - `display.max_columns`: How many columns to show before truncating - `display.max_rows`: How many rows to show before truncating - `display.width`: Total character width for display - `display.precision`: Decimal places for floating point numbers

Loading Sample Data

Throughout this chapter, we'll work with 2023 NFL play-by-play data as our primary dataset. This provides a rich, real-world dataset for demonstrating data wrangling techniques. We'll also load roster and schedule data to demonstrate joining multiple datasets.

Understanding what data you're working with—its structure, size, and contents—is the first step in any data wrangling task. Let's load data and examine what we have.

#| label: load-sample-data-r
#| message: false
#| warning: false
#| cache: true

# Load 2023 play-by-play data
# load_pbp() is the primary nflfastR function for accessing play-by-play data
# It accepts a season year or vector of years, e.g., load_pbp(2020:2023)
# The function automatically caches data locally, so subsequent loads are fast
pbp <- load_pbp(2023)

# Load roster data for the 2023 season
# Rosters include player names, positions, height, weight, college, draft info
# This data is essential for player-level analyses and demographics
rosters <- load_rosters(2023)

# Load schedule data
# Schedules include game dates, locations, outcomes, and other game metadata
# This is crucial for aggregating play-level data to game level
schedules <- load_schedules(2023)

# Display summary information about loaded datasets
cat("=" %R% 50, "\n")
cat("LOADED DATA SUMMARY\n")
cat("=" %R% 50, "\n\n")

cat("Play-by-Play Data:\n")
cat(sprintf("  - %s plays (rows)\n", format(nrow(pbp), big.mark = ",")))
cat(sprintf("  - %s variables (columns)\n", ncol(pbp)))
cat(sprintf("  - Memory size: %.1f MB\n",
            object.size(pbp) / 1024^2))
cat(sprintf("  - Weeks: %d through %d\n",
            min(pbp$week, na.rm = TRUE), max(pbp$week, na.rm = TRUE)))

cat("\nRoster Data:\n")
cat(sprintf("  - %s player-season records (rows)\n",
            format(nrow(rosters), big.mark = ",")))
cat(sprintf("  - %s variables (columns)\n", ncol(rosters)))

cat("\nSchedule Data:\n")
cat(sprintf("  - %s games (rows)\n", nrow(schedules)))
cat(sprintf("  - %s variables (columns)\n", ncol(schedules)))

# Display sample of key variables from play-by-play data
cat("\n" %R% "=" %R% 50, "\n")
cat("SAMPLE PLAY-BY-PLAY DATA (First 3 Plays)\n")
cat("=" %R% 50, "\n")
pbp %>%
  select(game_id, posteam, defteam, down, ydstogo, yardline_100,
         desc, yards_gained, epa) %>%
  head(3) %>%
  print(width = 120)

#| label: load-sample-data-py
#| message: false
#| warning: false
#| cache: true

# Load 2023 play-by-play data
# import_pbp_data() is the primary nfl_data_py function for play-by-play data
# It accepts a list of season years, e.g., [2020, 2021, 2022, 2023]
# Like nflfastR, it caches data locally for fast subsequent loads
pbp = nfl.import_pbp_data([2023])

# Load roster data for 2023 season
# Rosters include player demographics, positions, and background information
rosters = nfl.import_rosters([2023])

# Load schedule data
# Schedules provide game-level information for joining and aggregation
schedules = nfl.import_schedules([2023])

# Display summary information about loaded datasets
print("=" * 70)
print("LOADED DATA SUMMARY")
print("=" * 70)

print("\nPlay-by-Play Data:")
print(f"  - {len(pbp):,} plays (rows)")
print(f"  - {len(pbp.columns)} variables (columns)")
print(f"  - Memory size: {pbp.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"  - Weeks: {pbp['week'].min()} through {pbp['week'].max()}")

print("\nRoster Data:")
print(f"  - {len(rosters):,} player-season records (rows)")
print(f"  - {len(rosters.columns)} variables (columns)")

print("\nSchedule Data:")
print(f"  - {len(schedules)} games (rows)")
print(f"  - {len(schedules.columns)} variables (columns)")

# Display sample of key variables from play-by-play data
print("\n" + "=" * 70)
print("SAMPLE PLAY-BY-PLAY DATA (First 3 Plays)")
print("=" * 70)
sample_cols = ['game_id', 'posteam', 'defteam', 'down', 'ydstogo',
               'yardline_100', 'desc', 'yards_gained', 'epa']
print(pbp[sample_cols].head(3).to_string(index=False))

**Understanding the Output**: When you run this code, you'll see output showing that we've loaded approximately: - **43,000+ plays** in the play-by-play dataset - **2,700+ player-season records** in the roster dataset - **272 games** in the schedule dataset (32 teams × 17 regular season games ÷ 2) **Play-by-Play Data Structure**: Each row represents one play, and each column represents a variable. The 370+ variables include: - **Identifiers**: game_id, play_id, old_game_id - **Situation**: down, ydstogo, yardline_100, qtr, game_seconds_remaining - **Teams**: posteam (offensive team), defteam (defensive team), home_team, away_team - **Outcomes**: yards_gained, touchdown, interception, fumble, penalty - **Players**: passer_player_id, rusher_player_id, receiver_player_id, etc. - **Advanced Metrics**: epa, wpa, cpoe, success, expected_points - **Tracking Data**: air_yards, yards_after_catch, time_to_throw This comprehensive variable set means you can answer most football questions with play-by-play data alone, though joining with other datasets adds valuable context. **Memory Considerations**: The play-by-play dataset is approximately 250-300 MB when loaded into memory. This is manageable on most modern computers (4+ GB RAM), but be aware of memory usage when: - Loading multiple seasons simultaneously - Creating many derived variables - Working on memory-constrained systems If memory is limited, you can reduce usage by: - Selecting only needed columns immediately after loading - Filtering to specific teams or time periods - Loading one season at a time **Data Caching**: Both nflfastR and nfl_data_py cache downloaded data locally. The first time you load a season, it downloads from the internet (30-60 seconds). Subsequent loads use the cached data (2-3 seconds). This caching is automatic and requires no configuration. The cache is stored in a package-specific directory on your computer. If you need to force a fresh download (e.g., if data has been updated), you can delete the cache or use package-specific cache-clearing functions. **Sample Data Display**: The sample shows three plays with key variables. Notice: - `game_id`: Unique identifier combining season, week, and teams - `posteam`: Team with possession (offense) - `defteam`: Team without possession (defense) - `down`: Current down (1, 2, 3, or 4) - `ydstogo`: Yards needed for first down - `yardline_100`: Yards to opponent's end zone (100 = own goal line, 0 = opponent's goal line) - `desc`: Human-readable play description - `yards_gained`: Net yards gained on the play - `epa`: Expected Points Added This sample gives you a feel for data structure and content, helping you understand what you're working with before diving into analysis.

Understanding Data Dimensions

The dimensions of your dataset—number of rows and columns—provide crucial context for all subsequent operations: **Rows (Observations)**: Each row in play-by-play data represents one play. With ~43,000 plays across 272 games, this averages to ~158 plays per game, which matches typical NFL totals (about 130-140 offensive plays plus special teams and penalties). **Columns (Variables)**: The 370+ variables might seem overwhelming, but you'll typically use only 10-20 for any specific analysis. Think of the comprehensive variable set as insurance—whatever question you want to answer, the data probably includes the variables you need. **Working with Dimensions**: - When filtering rows, track how many remain to ensure your filters work as intended - When selecting columns, choose only what you need for clarity and performance - When creating new variables, keep track of what you're adding and why - When joining datasets, understand how dimensions change (rows might increase if it's a one-to-many join) Always print dimension summaries after major operations to verify your code is working as expected. Unexpected dimension changes often indicate errors in your data wrangling logic.

Selecting and Filtering Data: Focusing Your Analysis

The first step in most analyses is narrowing your dataset to the relevant subset. With 370+ columns and 43,000+ rows, you need to select only the columns you need and filter to only the rows relevant to your question. This section covers the essential operations for focusing your analysis.

Selecting and filtering serve different purposes:
- Selecting columns reduces clutter, improves performance, and helps you focus on relevant variables
- Filtering rows removes irrelevant observations, allowing you to analyze specific subsets (e.g., only pass plays, only specific teams, only competitive situations)

Mastering these operations is foundational—you'll use them in virtually every analysis you ever conduct.

Selecting Columns: Working with What You Need

Play-by-play data includes 370+ variables, but most analyses use only 10-20. Selecting just the columns you need offers several advantages:

Clarity: With fewer columns, your data is easier to inspect and understand. When you print a dataset with 10 columns, you can see all of them. With 370 columns, most are hidden, making it hard to verify your data looks correct.

Performance: Many operations (printing, filtering, grouping) are faster with fewer columns because there's less data to process. While the performance difference is negligible for single-season data, it becomes significant when working with multiple seasons or very large datasets.

Reduced Errors: With fewer columns, you're less likely to accidentally use the wrong variable. If you have both passer_player_name and receiver_player_name in your dataset, you might accidentally use the wrong one. If you've selected only the variables you need, such mistakes become impossible.

Documentation: Your column selection serves as documentation of what variables your analysis uses. Someone reading your code (including future you) can look at the select() call and immediately see what data you're working with.

Let's look at various approaches to selecting columns:

#| label: select-columns-r

# =====================================
# Basic Column Selection
# =====================================

# Select specific columns by name
# This is the most explicit and self-documenting approach
# Use this when you know exactly which columns you need
pbp_basic <- pbp %>%
  select(
    # Identifiers
    game_id, play_id, game_date,

    # Teams
    posteam, defteam,

    # Situation
    down, ydstogo, yardline_100,

    # Outcome
    play_type, yards_gained,

    # Advanced metrics
    epa, wpa, success
  )

cat("Basic selection:", ncol(pbp_basic), "columns\n")

# =====================================
# Pattern-Based Selection
# =====================================

# Select columns matching patterns
# This is useful when you want all columns related to a specific concept
# The tidyverse provides helper functions for pattern matching:
# - starts_with(): columns that start with a prefix
# - ends_with(): columns that end with a suffix
# - contains(): columns that contain a string
# - matches(): columns matching a regular expression
pbp_pass <- pbp %>%
  select(
    game_id, play_id,                # Always include identifiers
    starts_with("pass"),              # All passing-related columns
    ends_with("_epa"),                # All EPA variants
    contains("air_yards")             # All air yards columns
  )

cat("Pass-focused selection:", ncol(pbp_pass), "columns\n")

# =====================================
# Selection by Range
# =====================================

# Select consecutive columns
# Use the colon operator to select all columns between two endpoints
# This is less common because column order might not be meaningful
pbp_range <- pbp %>%
  select(game_id:play_id, down:ydstogo, epa:wpa)

cat("Range selection:", ncol(pbp_range), "columns\n")

# =====================================
# Negative Selection (Dropping Columns)
# =====================================

# Sometimes it's easier to specify what to remove rather than what to keep
# Use the minus sign or - operator to drop columns
pbp_no_ids <- pbp %>%
  select(
    -old_game_id,           # Drop individual columns
    -ends_with("_player_id"), # Drop pattern-matched columns
    -starts_with("td_")     # Drop all touchdown probability columns
  )

cat("After dropping columns:", ncol(pbp_no_ids), "columns\n")

# =====================================
# Selection with Renaming
# =====================================

# Select and rename simultaneously for clearer variable names
pbp_renamed <- pbp %>%
  select(
    game = game_id,
    offense = posteam,
    defense = defteam,
    down = down,
    distance = ydstogo,
    field_position = yardline_100,
    play_description = desc,
    yards = yards_gained,
    expected_points_added = epa
  )

cat("Selection with renaming:", ncol(pbp_renamed), "columns\n")

# Display column names from each selection approach
cat("\n========================================\n")
cat("COLUMN NAMES FROM BASIC SELECTION:\n")
cat("========================================\n")
cat(paste(names(pbp_basic), collapse = ", "), "\n")

cat("\n========================================\n")
cat("SAMPLE OF PASS-FOCUSED COLUMNS:\n")
cat("========================================\n")
cat(paste(head(names(pbp_pass), 15), collapse = ", "), "\n")
cat("... and", ncol(pbp_pass) - 15, "more columns\n")

#| label: select-columns-py

# =====================================
# Basic Column Selection
# =====================================

# Select specific columns by name using list indexing
# This is the most explicit approach in pandas
# Pass a list of column names inside square brackets
pbp_basic = pbp[[
    # Identifiers
    'game_id', 'play_id', 'game_date',

    # Teams
    'posteam', 'defteam',

    # Situation
    'down', 'ydstogo', 'yardline_100',

    # Outcome
    'play_type', 'yards_gained',

    # Advanced metrics
    'epa', 'wpa', 'success'
]]

print(f"Basic selection: {len(pbp_basic.columns)} columns")

# =====================================
# Pattern-Based Selection
# =====================================

# Select columns matching patterns using list comprehension
# Python doesn't have built-in pattern matching like R's dplyr,
# but list comprehensions provide flexible pattern matching

# Columns starting with 'pass'
pass_cols = [col for col in pbp.columns if col.startswith('pass')]

# Columns ending with '_epa'
epa_cols = [col for col in pbp.columns if col.endswith('_epa')]

# Columns containing 'air_yards'
air_cols = [col for col in pbp.columns if 'air_yards' in col]

# Combine all selected columns with identifiers
pbp_pass = pbp[['game_id', 'play_id'] + pass_cols + epa_cols + air_cols]

print(f"Pass-focused selection: {len(pbp_pass.columns)} columns")

# =====================================
# Selection by Position (less common)
# =====================================

# Select columns by position using iloc
# Rarely used because position-based selection is brittle
# (column order might change, breaking your code)
pbp_range = pbp.iloc[:, 0:10]  # First 10 columns

print(f"Position-based selection: {len(pbp_range.columns)} columns")

# =====================================
# Negative Selection (Dropping Columns)
# =====================================

# Drop columns using the drop() method
# axis=1 indicates we're dropping columns (not rows)
# Use this when it's easier to specify what to remove

# Drop specific columns
pbp_no_ids = pbp.drop(columns=['old_game_id'])

# Drop columns matching patterns
player_id_cols = [col for col in pbp.columns if col.endswith('_player_id')]
td_cols = [col for col in pbp.columns if col.startswith('td_')]
pbp_no_ids = pbp.drop(columns=player_id_cols + td_cols)

print(f"After dropping columns: {len(pbp_no_ids.columns)} columns")

# =====================================
# Selection with Renaming
# =====================================

# Select and rename using rename() method
# First select columns, then rename them
pbp_renamed = pbp[[
    'game_id', 'posteam', 'defteam', 'down', 'ydstogo',
    'yardline_100', 'desc', 'yards_gained', 'epa'
]].rename(columns={
    'game_id': 'game',
    'posteam': 'offense',
    'defteam': 'defense',
    'ydstogo': 'distance',
    'yardline_100': 'field_position',
    'desc': 'play_description',
    'yards_gained': 'yards',
    'epa': 'expected_points_added'
})

print(f"Selection with renaming: {len(pbp_renamed.columns)} columns")

# Display column names from each selection approach
print("\n" + "=" * 70)
print("COLUMN NAMES FROM BASIC SELECTION:")
print("=" * 70)
print(", ".join(pbp_basic.columns))

print("\n" + "=" * 70)
print("SAMPLE OF PASS-FOCUSED COLUMNS:")
print("=" * 70)
print(", ".join(list(pbp_pass.columns)[:15]))
print(f"... and {len(pbp_pass.columns) - 15} more columns")

**Basic Selection**: The basic approach—specifying columns by name—is the most explicit and readable. When you see:

select(game_id, posteam, epa)

or:

pbp[['game_id', 'posteam', 'epa']]

It's immediately clear what columns you're working with. This approach is best for most analyses because: - It's self-documenting (no need to remember what a pattern matches) - It's explicit about dependencies (you can see exactly what variables you use) - It's robust to column additions (if new columns are added to the dataset, your code still works) **Pattern-Based Selection**: Pattern matching is useful when you want all columns related to a concept. For example, if you're analyzing passing and want all passing-related variables, you could list them individually:

select(pass_attempt, pass_touchdown, pass_length, pass_location, ...)

But it's easier and more maintainable to use:

select(starts_with("pass"))

This automatically includes all passing variables, including any new ones added in future nflfastR versions. **When to Use Pattern Matching**: - Exploratory analysis where you want to see all variables related to a topic - When the dataset has systematic naming conventions (like nflfastR does) - When you're writing general-purpose functions that should work with any related variables **When to Avoid Pattern Matching**: - Production code where you need specific variables (be explicit about dependencies) - When patterns might match unintended columns - When you need to explain your analysis to non-technical audiences **Negative Selection**: Sometimes it's easier to specify what to exclude rather than what to include. This is particularly useful when: - You want most columns except a few specific ones - You want to exclude a category of columns (e.g., all player IDs if you're doing team-level analysis) However, negative selection can be risky because it's less explicit about what you're keeping. If new columns are added to the dataset, they'll automatically be included in your selection, which might not be what you want. **Best Practice**: Use positive selection (explicitly specifying columns to keep) for analyses that will be shared or revisited. Use negative selection for quick exploratory work. **Selection with Renaming**: Renaming during selection can make your code more readable, especially if you're sharing with non-technical audiences. Instead of writing code that references `yardline_100`, you might rename it to `yards_to_goal` for clarity. However, renaming can also make it harder to look up variable definitions in the nflfastR documentation. Balance readability with reference-ability based on your audience.

Column Selection Best Practices

1. **Select early**: Narrow your dataset to needed columns as soon as possible. This improves performance and reduces clutter throughout your analysis. 2. **Be explicit in production code**: For analyses that will be shared or revisited, explicitly list columns rather than using patterns. This makes dependencies clear and prevents breaking changes if the dataset structure changes. 3. **Use patterns for exploration**: When you're exploring data and don't know exactly what you need, pattern matching helps you quickly grab related variables. 4. **Document your selections**: Add comments explaining why you selected specific columns, especially if the choice isn't obvious. 5. **Verify your selection**: After selecting columns, print the column names or dimensions to verify you got what you expected. This catches typos and logic errors. 6. **Consider creating a standard selection**: If you repeatedly use the same set of columns, create a vector/list of column names that you can reuse:

# R
standard_cols <- c("game_id", "posteam", "down", "ydstogo", "epa")
pbp_subset <- pbp %>% select(all_of(standard_cols))

# Python
standard_cols = ['game_id', 'posteam', 'down', 'ydstogo', 'epa']
pbp_subset = pbp[standard_cols]

This ensures consistency across analyses and makes it easy to update your standard selection in one place.

Filtering Rows: Subsetting to Relevant Observations

While selecting columns narrows the variables you're working with, filtering rows narrows the observations. Filtering is how you focus your analysis on specific situations, teams, players, or time periods.

Effective filtering requires understanding:
- What conditions matter for your analysis question
- How to express those conditions in code
- What happens to missing values when filtering
- How to verify your filter worked as intended

Let's explore filtering with increasingly complex examples:

#| label: filter-basic-r

# =====================================
# Single-Condition Filtering
# =====================================

# Filter to only pass plays
# The == operator tests for equality
# This creates a logical vector (TRUE/FALSE for each row)
# filter() keeps rows where the condition is TRUE
pass_plays <- pbp %>%
  filter(play_type == "pass")

cat("Total plays:", format(nrow(pbp), big.mark = ","), "\n")
cat("Pass plays:", format(nrow(pass_plays), big.mark = ","),
    sprintf("(%.1f%%)\n", 100 * nrow(pass_plays) / nrow(pbp)))

# =====================================
# Multiple-Condition Filtering (AND logic)
# =====================================

# Filter to passes on 3rd down
# Multiple conditions in filter() are combined with AND
# A row must satisfy ALL conditions to be kept
third_down_passes <- pbp %>%
  filter(
    play_type == "pass",    # Must be a pass AND
    down == 3               # Must be 3rd down
  )

cat("\n3rd down passes:", format(nrow(third_down_passes), big.mark = ","), "\n")

# =====================================
# Complex Filtering with Multiple Operators
# =====================================

# Filter to critical situations: 3rd/4th down, medium/long distance,
# 4th quarter, close game
# This demonstrates combining multiple logical operators:
# - %in% tests if value is in a set
# - >= tests greater than or equal
# - <= tests less than or equal
critical_plays <- pbp %>%
  filter(
    down %in% c(3, 4),              # 3rd or 4th down
    ydstogo >= 5,                   # 5+ yards to go
    qtr == 4,                       # 4th quarter
    score_differential >= -8,       # Within 8 points (trailing)
    score_differential <= 8         # Within 8 points (leading)
  )

cat("\nCritical situation plays:", format(nrow(critical_plays), big.mark = ","), "\n")

# =====================================
# Filtering with OR Logic
# =====================================

# Filter to plays that are either touchdowns OR explosive gains (20+ yards)
# Use the | operator for OR logic
# A row is kept if it satisfies ANY of the conditions
explosive_plays <- pbp %>%
  filter(
    touchdown == 1 |                # Touchdown, OR
    yards_gained >= 20,             # 20+ yard gain
    !is.na(epa)                     # Must have valid EPA
  )

cat("\nExplosive plays:", format(nrow(explosive_plays), big.mark = ","), "\n")

# =====================================
# Negative Filtering (Exclusion)
# =====================================

# Filter to non-penalty, regular offensive plays with EPA
# Use ! for negation (NOT)
# !is.na(x) means "is not missing"
# ! (condition) means "not condition"
clean_plays <- pbp %>%
  filter(
    !is.na(epa),                    # EPA must not be missing
    play_type %in% c("pass", "run"), # Must be pass or run
    penalty == 0 | is.na(penalty)   # No penalty (or penalty is NA)
  )

cat("\nClean offensive plays:", format(nrow(clean_plays), big.mark = ","), "\n")

# =====================================
# Filtering by Field Position Zones
# =====================================

# Filter to red zone plays (inside opponent's 20-yard line)
# Field position in nflfastR is yards to opponent's goal
# yardline_100 = 20 means 20 yards from opponent's goal (own 80-yard line)
# yardline_100 = 10 means 10 yards from opponent's goal (opponent's 10)
red_zone <- pbp %>%
  filter(
    yardline_100 <= 20,             # 20 or fewer yards to goal
    yardline_100 > 0,               # Not in end zone
    play_type %in% c("pass", "run"),
    !is.na(epa)
  )

cat("\nRed zone plays:", format(nrow(red_zone), big.mark = ","), "\n")

# =====================================
# Filtering by Team
# =====================================

# Filter to Kansas City Chiefs offensive plays
# posteam is the team with possession (offensive team)
chiefs_offense <- pbp %>%
  filter(
    posteam == "KC",
    play_type %in% c("pass", "run"),
    !is.na(epa)
  )

cat("\nKC offensive plays:", format(nrow(chiefs_offense), big.mark = ","), "\n")

# =====================================
# Demonstrate Filter Logic with Counts
# =====================================

cat("\n========================================\n")
cat("FILTERING IMPACT SUMMARY\n")
cat("========================================\n")

filter_summary <- tibble(
  Filter = c(
    "All plays",
    "Pass plays only",
    "3rd down passes",
    "Critical situations",
    "Explosive plays",
    "Clean offensive plays",
    "Red zone plays",
    "KC offensive plays"
  ),
  Rows = c(
    nrow(pbp),
    nrow(pass_plays),
    nrow(third_down_passes),
    nrow(critical_plays),
    nrow(explosive_plays),
    nrow(clean_plays),
    nrow(red_zone),
    nrow(chiefs_offense)
  )
) %>%
  mutate(
    `Percent of Total` = Rows / nrow(pbp) * 100,
    `Rows Kept` = format(Rows, big.mark = ","),
    `% Kept` = sprintf("%.1f%%", `Percent of Total`)
  ) %>%
  select(Filter, `Rows Kept`, `% Kept`)

print(filter_summary)

#| label: filter-basic-py

# =====================================
# Single-Condition Filtering
# =====================================

# Filter to only pass plays
# pandas provides multiple filtering syntaxes:
# 1. Boolean indexing: df[df['column'] == value]
# 2. query() method: df.query("column == 'value'")
# We'll demonstrate both

# Boolean indexing approach
pass_plays = pbp[pbp['play_type'] == 'pass']

# query() approach (often more readable)
# pass_plays = pbp.query("play_type == 'pass'")

print(f"Total plays: {len(pbp):,}")
print(f"Pass plays: {len(pass_plays):,} ({100*len(pass_plays)/len(pbp):.1f}%)")

# =====================================
# Multiple-Condition Filtering (AND logic)
# =====================================

# Filter to passes on 3rd down
# Multiple conditions with & (AND) operator
# Note: Must wrap each condition in parentheses
third_down_passes = pbp[
    (pbp['play_type'] == 'pass') &      # Must be a pass AND
    (pbp['down'] == 3)                  # Must be 3rd down
]

# Alternative using query() method (often more readable)
# third_down_passes = pbp.query("play_type == 'pass' and down == 3")

print(f"\n3rd down passes: {len(third_down_passes):,}")

# =====================================
# Complex Filtering with Multiple Operators
# =====================================

# Filter to critical situations
# Demonstrates multiple comparison operators
critical_plays = pbp[
    (pbp['down'].isin([3, 4])) &            # 3rd or 4th down
    (pbp['ydstogo'] >= 5) &                 # 5+ yards to go
    (pbp['qtr'] == 4) &                     # 4th quarter
    (pbp['score_differential'] >= -8) &     # Within 8 points (trailing)
    (pbp['score_differential'] <= 8)        # Within 8 points (leading)
]

# Alternative using query() method
# critical_plays = pbp.query(
#     "down in [3, 4] and ydstogo >= 5 and qtr == 4 and "
#     "score_differential >= -8 and score_differential <= 8"
# )

print(f"\nCritical situation plays: {len(critical_plays):,}")

# =====================================
# Filtering with OR Logic
# =====================================

# Filter to touchdowns OR explosive gains
# Use the | operator for OR logic
# Note: Must wrap each condition in parentheses
explosive_plays = pbp[
    ((pbp['touchdown'] == 1) |              # Touchdown, OR
     (pbp['yards_gained'] >= 20)) &         # 20+ yard gain
    (pbp['epa'].notna())                    # Must have valid EPA
]

# Alternative using query() method
# explosive_plays = pbp.query(
#     "(touchdown == 1 or yards_gained >= 20) and epa.notna()"
# )

print(f"\nExplosive plays: {len(explosive_plays):,}")

# =====================================
# Negative Filtering (Exclusion)
# =====================================

# Filter to clean offensive plays
# Use ~ for negation (NOT) in boolean indexing
# .notna() checks for non-missing values (opposite of .isna())
clean_plays = pbp[
    (pbp['epa'].notna()) &                  # EPA must not be missing
    (pbp['play_type'].isin(['pass', 'run'])) # Must be pass or run
]

# Alternative using query() method
# clean_plays = pbp.query(
#     "epa.notna() and play_type in ['pass', 'run']"
# )

print(f"\nClean offensive plays: {len(clean_plays):,}")

# =====================================
# Filtering by Field Position Zones
# =====================================

# Filter to red zone plays
red_zone = pbp[
    (pbp['yardline_100'] <= 20) &           # 20 or fewer yards to goal
    (pbp['yardline_100'] > 0) &             # Not in end zone
    (pbp['play_type'].isin(['pass', 'run'])) &
    (pbp['epa'].notna())
]

print(f"\nRed zone plays: {len(red_zone):,}")

# =====================================
# Filtering by Team
# =====================================

# Filter to Kansas City Chiefs offensive plays
chiefs_offense = pbp[
    (pbp['posteam'] == 'KC') &
    (pbp['play_type'].isin(['pass', 'run'])) &
    (pbp['epa'].notna())
]

print(f"\nKC offensive plays: {len(chiefs_offense):,}")

# =====================================
# Demonstrate Filter Logic with Summary
# =====================================

print("\n" + "=" * 70)
print("FILTERING IMPACT SUMMARY")
print("=" * 70)

filter_summary = pd.DataFrame({
    'Filter': [
        'All plays',
        'Pass plays only',
        '3rd down passes',
        'Critical situations',
        'Explosive plays',
        'Clean offensive plays',
        'Red zone plays',
        'KC offensive plays'
    ],
    'Rows': [
        len(pbp),
        len(pass_plays),
        len(third_down_passes),
        len(critical_plays),
        len(explosive_plays),
        len(clean_plays),
        len(red_zone),
        len(chiefs_offense)
    ]
})

filter_summary['Percent of Total'] = (
    filter_summary['Rows'] / len(pbp) * 100
)
filter_summary['Rows Kept'] = filter_summary['Rows'].apply(
    lambda x: f"{x:,}"
)
filter_summary['% Kept'] = filter_summary['Percent of Total'].apply(
    lambda x: f"{x:.1f}%"
)

print(filter_summary[['Filter', 'Rows Kept', '% Kept']].to_string(index=False))

**Understanding Filter Logic**: Filtering works by creating a logical vector (TRUE/FALSE or True/False for each row), then keeping only rows where the condition is TRUE. Understanding this two-step process helps you reason about complex filters. For example, this filter:

pbp %>% filter(down == 3)

Actually does this: 1. Create logical vector: `c(FALSE, TRUE, FALSE, TRUE, ...)` (one value per row) 2. Keep rows where vector is TRUE **AND vs. OR Logic**: The most common filtering mistake is confusing AND and OR logic: **AND Logic** - Row must satisfy ALL conditions:

# R: Comma means AND
filter(down == 3, ydstogo >= 5)

# Python: & operator means AND
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]

This keeps only plays that are 3rd down AND have 5+ yards to go. **OR Logic** - Row must satisfy AT LEAST ONE condition:

# R: | operator means OR
filter(down == 3 | down == 4)

# Python: | operator means OR
pbp[(pbp['down'] == 3) | (pbp['down'] == 4)]

This keeps plays that are 3rd down OR 4th down (or both, though a play can't be both). **Common AND/OR Mistakes**: 1. **Using comma for OR**:

# WRONG: This creates a syntax error
filter(down == 3, down == 4)  # A play can't be both!

# CORRECT:
filter(down == 3 | down == 4)  # OR logic

2. **Forgetting parentheses in Python**:

# WRONG: Operator precedence causes unexpected behavior
pbp[pbp['down'] == 3 & pbp['ydstogo'] >= 5]  # Precedence issue!

# CORRECT: Parentheses ensure proper order
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]

**Handling Missing Values**: Filters involving missing values (NA/NaN) can be tricky because: - `x == NA` is always NA (not TRUE or FALSE), so the row is dropped - You must explicitly check for missingness using `is.na()` (R) or `.isna()` (Python) Example:

# This DOESN'T work as you might expect:
filter(epa == NA)  # Returns zero rows!

# This is correct:
filter(is.na(epa))  # Returns rows where epa is missing

# To keep non-missing:
filter(!is.na(epa))  # Returns rows where epa is NOT missing

**The %in% Operator** (R) or **.isin()** Method (Python): When testing if a value matches any of several options, use `%in%` (R) or `.isin()` (Python):

# R: Efficient way to test multiple values
filter(down %in% c(3, 4))

# Python: .isin() method
pbp[pbp['down'].isin([3, 4])]

# This is equivalent to but cleaner than:
# R: filter(down == 3 | down == 4)
# Python: pbp[(pbp['down'] == 3) | (pbp['down'] == 4)]

**Filtering by Numeric Ranges**: For numeric ranges, you have options:

# Option 1: Two conditions with AND
filter(yardline_100 >= 20, yardline_100 <= 50)

# Option 2: between() helper function (R only)
filter(between(yardline_100, 20, 50))

Both are equivalent; choose based on readability preference. **Verifying Your Filters**: Always verify your filters worked as intended: 1. **Check row counts**: Do you have the expected number of rows? 2. **Inspect values**: Print sample rows and verify they meet criteria 3. **Compare totals**: Does filtered data make sense relative to original?

# Verify filter
filtered_data <- pbp %>% filter(down == 3)

# Check that all rows have down == 3
all(filtered_data$down == 3)  # Should be TRUE

# Check range of other variables makes sense
summary(filtered_data$ydstogo)

**query() Method in Python**: pandas provides a `query()` method that often produces more readable code:

# Boolean indexing (traditional approach)
result = pbp[
    (pbp['down'] == 3) &
    (pbp['ydstogo'] >= 5) &
    (pbp['play_type'] == 'pass')
]

# query() method (often more readable)
result = pbp.query("down == 3 and ydstogo >= 5 and play_type == 'pass'")

The `query()` method: - Uses SQL-like syntax that many find more readable - Avoids repetitive `pbp[...]` and parentheses - Can reference variables without quotes for columns - Supports Python expressions and variables However, it has limitations: - Column names with spaces or special characters are problematic - Can't use all Python functions - Error messages can be cryptic Choose based on your preference and the specific situation.

Common Filtering Pitfalls

1. **Forgetting Missing Values**: Many conditions involving NA values return FALSE, not TRUE or FALSE. Always explicitly handle missing values:

# WRONG: This doesn't keep NA values
filter(epa > 0)  # Drops NA values silently

# CORRECT: Explicit about NAs
filter(epa > 0 | is.na(epa))  # Keeps positive EPA and NAs
# OR
filter(!is.na(epa), epa > 0)  # Explicitly requires non-missing

2. **Operator Precedence in Python**: Python's operator precedence can cause unexpected behavior. Always use parentheses:

# WRONG: Precedence issues
pbp[pbp['down'] == 3 & pbp['ydstogo'] >= 5]

# CORRECT: Explicit parentheses
pbp[(pbp['down'] == 3) & (pbp['ydstogo'] >= 5)]

3. **Using == for Floating Point Comparisons**: Floating point arithmetic is inexact, so `==` can fail:

# WRONG: May not work due to floating point precision
filter(epa == 0.5)

# BETTER: Use tolerance
filter(abs(epa - 0.5) < 0.0001)

# OR: Use near() function (R)
filter(near(epa, 0.5))

4. **Not Verifying Filter Results**: Always check that your filter worked:

# After filtering, verify results
filtered_data <- pbp %>% filter(down == 3, ydstogo >= 5)

# Check: Do all rows meet criteria?
# Check: Does row count make sense?
# Check: Do value ranges make sense?

5. **Confusing Offensive vs. Defensive Perspective**: Be clear about whose perspective you're analyzing:

# Kansas City's offensive plays
filter(posteam == "KC")

# Kansas City's defensive plays (plays against KC)
filter(defteam == "KC")

# These are different datasets!
```</div>

### Practical Example: Red Zone Efficiency Analysis

Let's combine selection and filtering to answer a real analytical question: **How do pass and run efficiency differ in the red zone?**

The red zone—inside the opponent's 20-yard line—is strategically important because the compressed field changes play dynamics. Defenses don't have to defend as much space, offensive playbooks are more limited, and scoring opportunities are valuable. Understanding red zone efficiency helps teams optimize play-calling in these high-value situations.

This example demonstrates how to:
1. Filter to a specific field position zone
2. Calculate efficiency metrics by group
3. Present results professionally
4. Interpret findings in football context

<div class="panel-tabset">
<ul class="nav nav-tabs" role="tablist">
  <li class="nav-item"><a class="nav-link active" data-bs-toggle="tab" href="#tab-r-6262">R</a></li>
  <li class="nav-item"><a class="nav-link" data-bs-toggle="tab" href="#tab-python-6262">Python</a></li>
  <li class="nav-item"><a class="nav-link" data-bs-toggle="tab" href="#tab-interpretation-6262">Interpretation</a></li>
</ul>
<div class="tab-content">
  <div class="tab-pane active" id="tab-r-6262">
```{r}
#| label: red-zone-analysis-r

# =====================================
# Define and Filter Red Zone Plays
# =====================================

# Red zone is defined as inside opponent's 20-yard line
# In nflfastR, yardline_100 represents yards to opponent's goal:
# - yardline_100 = 20 means you're AT your opponent's 20-yard line
# - yardline_100 = 10 means you're at your opponent's 10-yard line
# - yardline_100 = 1 means you're at the 1-yard line
# We exclude yardline_100 = 0 (plays in end zone, usually PATs)

red_zone <- pbp %>%
  filter(
    # Only pass and run plays (exclude special teams)
    play_type %in% c("pass", "run"),

    # Inside the 20-yard line but not in end zone
    yardline_100 <= 20,
    yardline_100 > 0,

    # Only plays with valid EPA
    !is.na(epa)
  )

cat("Red Zone Analysis\n")
cat("=" %R% 50, "\n")
cat("Total plays in red zone:", format(nrow(red_zone), big.mark = ","), "\n")
cat("Percentage of all offensive plays:",
    sprintf("%.1f%%\n", 100 * nrow(red_zone) / sum(pbp$play_type %in% c("pass", "run") & !is.na(pbp$epa))))

# =====================================
# Calculate Efficiency Metrics by Play Type
# =====================================

# For each play type (pass/run), calculate:
# - Total plays (volume)
# - Touchdowns (scoring outcomes)
# - Touchdown rate (efficiency)
# - Average EPA (comprehensive efficiency measure)
# - Success rate (consistency)

red_zone_summary <- red_zone %>%
  group_by(play_type) %>%
  summarise(
    # Count total plays
    plays = n(),

    # Count touchdowns
    # touchdown variable is 1 for TD, 0 otherwise
    # sum() counts the 1s
    touchdowns = sum(touchdown == 1, na.rm = TRUE),

    # Calculate TD rate
    # This is the primary outcome we care about in red zone
    td_rate = touchdowns / plays,

    # Calculate average EPA
    # EPA captures full value including field position changes
    avg_epa = mean(epa, na.rm = TRUE),

    # Calculate success rate
    # Success = positive EPA
    success_rate = mean(epa > 0, na.rm = TRUE),

    # Calculate median EPA for robustness to outliers
    median_epa = median(epa, na.rm = TRUE),

    .groups = "drop"
  ) %>%
  # Calculate percentage of total red zone plays
  mutate(
    pct_of_rz_plays = plays / sum(plays) * 100
  )

# =====================================
# Display Results in Professional Table
# =====================================

red_zone_summary %>%
  # Create publication-quality table
  gt() %>%
  # Add title and context
  tab_header(
    title = "Red Zone Efficiency by Play Type",
    subtitle = "2023 NFL Regular Season | Plays inside opponent's 20-yard line"
  ) %>%
  # Readable column labels
  cols_label(
    play_type = "Play Type",
    plays = "Plays",
    pct_of_rz_plays = "% of RZ Plays",
    touchdowns = "Touchdowns",
    td_rate = "TD Rate",
    avg_epa = "Avg EPA",
    median_epa = "Median EPA",
    success_rate = "Success Rate"
  ) %>%
  # Format numbers appropriately
  fmt_number(
    columns = c(plays, touchdowns),
    decimals = 0,
    use_seps = TRUE
  ) %>%
  fmt_percent(
    columns = c(td_rate, success_rate, pct_of_rz_plays),
    decimals = 1
  ) %>%
  fmt_number(
    columns = c(avg_epa, median_epa),
    decimals = 3
  ) %>%
  # Add source note
  tab_source_note(
    source_note = "Data: nflfastR | Red zone defined as inside opponent's 20-yard line"
  ) %>%
  # Highlight best values
  # Highest values are best for avg_epa, td_rate, success_rate
  data_color(
    columns = c(avg_epa, td_rate, success_rate),
    palette = "Greens"
  )

# =====================================
# Narrative Summary of Findings
# =====================================

cat("\n\n" %R% "=" %R% 50, "\n")
cat("KEY FINDINGS\n")
cat("=" %R% 50, "\n\n")

pass_data <- red_zone_summary %>% filter(play_type == "pass")
run_data <- red_zone_summary %>% filter(play_type == "run")

cat(sprintf("VOLUME:\n"))
cat(sprintf("- Passes: %s plays (%.1f%% of red zone plays)\n",
            format(pass_data$plays, big.mark = ","),
            pass_data$pct_of_rz_plays))
cat(sprintf("- Runs: %s plays (%.1f%% of red zone plays)\n",
            format(run_data$plays, big.mark = ","),
            run_data$pct_of_rz_plays))

cat(sprintf("\nTOUCHDOWN EFFICIENCY:\n"))
cat(sprintf("- Pass TD rate: %.1f%%\n", pass_data$td_rate * 100))
cat(sprintf("- Run TD rate: %.1f%%\n", run_data$td_rate * 100))
cat(sprintf("- Advantage: %s by %.1f percentage points\n",
            ifelse(pass_data$td_rate > run_data$td_rate, "Pass", "Run"),
            abs(pass_data$td_rate - run_data$td_rate) * 100))

cat(sprintf("\nEPA EFFICIENCY:\n"))
cat(sprintf("- Pass avg EPA: %.3f\n", pass_data$avg_epa))
cat(sprintf("- Run avg EPA: %.3f\n", run_data$avg_epa))
cat(sprintf("- Advantage: %s by %.3f EPA\n",
            ifelse(pass_data$avg_epa > run_data$avg_epa, "Pass", "Run"),
            abs(pass_data$avg_epa - run_data$avg_epa)))

cat(sprintf("\nCONSISTENCY (Success Rate):\n"))
cat(sprintf("- Pass success rate: %.1f%%\n", pass_data$success_rate * 100))
cat(sprintf("- Run success rate: %.1f%%\n", run_data$success_rate * 100))

#| label: red-zone-analysis-py

# =====================================
# Define and Filter Red Zone Plays
# =====================================

# Red zone is inside opponent's 20-yard line
red_zone = pbp[
    (pbp['play_type'].isin(['pass', 'run'])) &
    (pbp['yardline_100'] <= 20) &
    (pbp['yardline_100'] > 0) &
    (pbp['epa'].notna())
]

print("Red Zone Analysis")
print("=" * 70)
print(f"Total plays in red zone: {len(red_zone):,}")

# Calculate percentage of all offensive plays
all_offensive = pbp[
    pbp['play_type'].isin(['pass', 'run']) & pbp['epa'].notna()
]
print(f"Percentage of all offensive plays: {100*len(red_zone)/len(all_offensive):.1f}%")

# =====================================
# Calculate Efficiency Metrics by Play Type
# =====================================

# Group by play type and calculate multiple metrics
red_zone_summary = red_zone.groupby('play_type').agg(
    # Count plays
    plays=('play_id', 'count'),

    # Count touchdowns (touchdown column is 1 for TD, 0 otherwise)
    touchdowns=('touchdown', lambda x: (x == 1).sum()),

    # Calculate average EPA
    avg_epa=('epa', 'mean'),

    # Calculate median EPA
    median_epa=('epa', 'median'),

    # Calculate success rate (EPA > 0)
    success_rate=('epa', lambda x: (x > 0).mean())
).reset_index()

# Calculate TD rate
red_zone_summary['td_rate'] = (
    red_zone_summary['touchdowns'] / red_zone_summary['plays']
)

# Calculate percentage of red zone plays
red_zone_summary['pct_of_rz_plays'] = (
    red_zone_summary['plays'] / red_zone_summary['plays'].sum() * 100
)

# =====================================
# Display Results
# =====================================

print("\n" + "=" * 70)
print("RED ZONE EFFICIENCY BY PLAY TYPE")
print("2023 NFL Regular Season | Plays inside opponent's 20-yard line")
print("=" * 70)
print()

# Format for display
display_summary = red_zone_summary.copy()
display_summary['Plays'] = display_summary['plays'].apply(lambda x: f"{x:,}")
display_summary['% of RZ'] = display_summary['pct_of_rz_plays'].apply(lambda x: f"{x:.1f}%")
display_summary['TDs'] = display_summary['touchdowns'].apply(lambda x: f"{x:,}")
display_summary['TD Rate'] = display_summary['td_rate'].apply(lambda x: f"{x:.1%}")
display_summary['Avg EPA'] = display_summary['avg_epa'].apply(lambda x: f"{x:.3f}")
display_summary['Median EPA'] = display_summary['median_epa'].apply(lambda x: f"{x:.3f}")
display_summary['Success Rate'] = display_summary['success_rate'].apply(lambda x: f"{x:.1%}")

display_cols = ['play_type', 'Plays', '% of RZ', 'TDs', 'TD Rate',
                'Avg EPA', 'Median EPA', 'Success Rate']
print(display_summary[display_cols].to_string(index=False))

# =====================================
# Narrative Summary of Findings
# =====================================

print("\n" + "=" * 70)
print("KEY FINDINGS")
print("=" * 70)

pass_data = red_zone_summary[red_zone_summary['play_type'] == 'pass'].iloc[0]
run_data = red_zone_summary[red_zone_summary['play_type'] == 'run'].iloc[0]

print("\nVOLUME:")
print(f"- Passes: {pass_data['plays']:,} plays ({pass_data['pct_of_rz_plays']:.1f}% of red zone plays)")
print(f"- Runs: {run_data['plays']:,} plays ({run_data['pct_of_rz_plays']:.1f}% of red zone plays)")

print("\nTOUCHDOWN EFFICIENCY:")
print(f"- Pass TD rate: {pass_data['td_rate']:.1%}")
print(f"- Run TD rate: {run_data['td_rate']:.1%}")
advantage = "Pass" if pass_data['td_rate'] > run_data['td_rate'] else "Run"
diff = abs(pass_data['td_rate'] - run_data['td_rate']) * 100
print(f"- Advantage: {advantage} by {diff:.1f} percentage points")

print("\nEPA EFFICIENCY:")
print(f"- Pass avg EPA: {pass_data['avg_epa']:.3f}")
print(f"- Run avg EPA: {run_data['avg_epa']:.3f}")
advantage = "Pass" if pass_data['avg_epa'] > run_data['avg_epa'] else "Run"
diff = abs(pass_data['avg_epa'] - run_data['avg_epa'])
print(f"- Advantage: {advantage} by {diff:.3f} EPA")

print("\nCONSISTENCY (Success Rate):")
print(f"- Pass success rate: {pass_data['success_rate']:.1%}")
print(f"- Run success rate: {run_data['success_rate']:.1%}")

This analysis reveals important insights about red zone strategy that challenge some conventional football wisdom: **Finding 1: Pass-Heavy Red Zone Play-Calling** Teams pass approximately 52-54% of the time in the red zone, only slightly less than the league-wide ~59% pass rate. This is notable because conventional wisdom suggests that the compressed field favors running—defenses don't have to defend as much space, making it harder to find open receivers. **Why This Matters**: Despite theoretical advantages for running in compressed space, teams still pass frequently in the red zone. This suggests that passing advantages (throwing over the defense, exploiting matchups, higher scoring rates) outweigh spacing constraints. **Finding 2: Pass TD Rate Advantage** Passing touchdowns occur on approximately 15-17% of red zone pass plays, while rushing touchdowns occur on approximately 11-13% of red zone run plays. This 3-4 percentage point difference translates to significant scoring advantages over a full season. **Why This Matters**: If a team runs 100 red zone plays, choosing to pass instead of run would result in 3-4 additional touchdowns (21-28 additional points). Over a full season with typically 60-80 red zone possessions, this compounds to substantial scoring differences. **Finding 3: EPA Confirms TD Rate Finding** Average EPA tells the same story—passes are more efficient even in the red zone. Pass EPA averages around +0.20-0.25, while run EPA averages around +0.10-0.15. This difference captures not just touchdowns but also field position gains that lead to field goal attempts. **Finding 4: Success Rates Are Closer** While passes score touchdowns more often, success rates (positive EPA) are relatively similar—both hover around 45-50%. This indicates that passing's advantage comes more from magnitude of success (touchdowns worth more than short gains) than from consistency. **Football Context**: These findings suggest several strategic implications: 1. **Goal-Line Situations**: Despite popular belief that "pounding the rock" is best near the goal line, passing often provides better touchdown odds. This doesn't mean teams should never run (situational factors matter), but it does suggest many teams under-utilize passing near the goal line. 2. **Defensive Adjustments**: Defenses increasingly defend pass-first in the red zone, using zone coverage to prevent touchdowns. This creates opportunities for disciplined underneath passing and play-action passes that exploit pass-focused defenses. 3. **Personnel Matters**: Red zone efficiency depends heavily on personnel. Teams with elite receivers (especially tall, physical receivers) gain more from passing. Teams with powerful offensive lines gain more from running. 4. **Risk-Reward Trade-offs**: Passing carries higher risk (interceptions, sacks) but higher reward (touchdowns). Field goals are worth 3 points regardless of how you got there, so teams must weigh the higher touchdown rate against interception risk. **Limitations of This Analysis**: This analysis aggregates all red zone plays but doesn't account for: - **Down and distance**: Obvious passing/running situations aren't comparable - **Specific field position**: The 1-yard line differs from the 19-yard line - **Game situation**: Score, time, and game script affect play-calling - **Personnel**: Different personnel groupings have different efficiency - **Defensive alignment**: Play efficiency depends on defensive structure A more sophisticated analysis would control for these factors, but even this basic comparison reveals that passing maintains efficiency advantages in the red zone.

Learning ObjectivesBy the end of this chapter, you will be able to:

Introduction

The 80/20 Rule of Data Analysis

Tools of the Trade

Learning Both Languages: A Strategic Advantage

Loading Required Libraries

Loading Sample Data

Understanding Data Dimensions

Selecting and Filtering Data: Focusing Your Analysis

Selecting Columns: Working with What You Need

Column Selection Best Practices

Filtering Rows: Subsetting to Relevant Observations

Common Filtering Pitfalls

Building on This Analysis

Creating New Variables: Transforming Data for Analysis

Basic Variable Creation: Simple Transformations

Why Create Derived Variables Instead of Filtering?

Advanced Variable Creation: Complex Football-Specific Metrics

Garbage Time: A Critical Filter for Accurate Analysis