Learning ObjectivesBy the end of this chapter, you will be able to:

Interactive EPA Calculator

Explore how field position, down, and distance affect Expected Points. Adjust the sliders below to see how EP changes.

Expected Points by Field Position

This chart shows how expected points increase as you move down the field toward your opponent's end zone.

EPA Distribution Across All Plays

Most plays cluster around zero EPA, but the distribution has a long tail showing explosive plays and turnovers.

  1. Understand the historical evolution of football analytics and its impact on modern NFL strategy
  2. Identify and explain key metrics and concepts in modern football analysis, including EPA and win probability
  3. Set up your development environment for football analytics with both R and Python
  4. Load and explore basic NFL play-by-play data using modern data tools
  5. Recognize the impact of analytics on team decision-making and identify real-world applications
  6. Perform basic statistical analyses on NFL data and interpret results in football context

Introduction

American football has undergone a remarkable transformation over the past two decades. What was once a sport driven primarily by intuition, traditional wisdom, and coaches' gut feelings has evolved into a sophisticated data-driven enterprise where every play, every decision, and every player movement can be quantified, analyzed, and optimized. This revolution mirrors changes seen across professional sports, but football's complexity—with its distinct plays, discrete game states, and multifaceted strategies—makes it particularly well-suited to analytical approaches.

The change has been profound. In 2003, most NFL teams had no dedicated analytics staff. Decisions about play-calling, fourth-down attempts, and two-point conversions were made largely on instinct and conventional wisdom that had been passed down through coaching generations. By 2023, every NFL team employs analytics professionals, and data-driven insights influence decisions at every level—from draft selections to in-game play-calling to salary cap management. What was once a competitive advantage for early adopters has become a necessity for competitive survival.

This textbook will guide you through the complete landscape of football analytics, from foundational concepts to cutting-edge techniques. Whether you're a student learning data science and looking for engaging real-world applications, an analyst working in sports or considering a career transition, or simply a football fan curious about the numbers behind the game, this book provides the tools, knowledge, and practical skills you need. We assume basic programming knowledge in either R or Python, but we'll explain statistical concepts and football context as we go.

The journey we're embarking on is both technical and strategic. You'll learn not just how to analyze football data, but how to think like an analyst—asking the right questions, choosing appropriate methods, interpreting results critically, and communicating insights effectively. Along the way, you'll work with real NFL data, replicate analyses that have influenced actual team decisions, and develop the skills that professional football analysts use daily.

What is Football Analytics?

Football analytics is the systematic application of statistical methods, machine learning algorithms, and data science techniques to understand, evaluate, and optimize football performance. It encompasses a wide range of applications: - **Play-level analysis**: Evaluating the efficiency and effectiveness of individual plays - **Player evaluation**: Assessing player contributions beyond traditional statistics - **Strategic decision-making**: Optimizing fourth-down decisions, two-point attempts, and clock management - **Draft strategy**: Identifying undervalued prospects and projecting college-to-pro transitions - **Salary cap management**: Optimizing resource allocation across roster positions - **Game planning**: Identifying opponent tendencies and exploitable weaknesses The field draws on multiple disciplines including statistics, computer science, economics, and—crucially—deep football knowledge. The best analytics work combines technical sophistication with understanding of game context.

The Analytics Revolution in Football

Understanding where football analytics came from helps us appreciate where it's going. The evolution of football analytics isn't just a story of technological advancement—it's a story of changing mindsets, institutional resistance, gradual acceptance, and ultimate transformation of how the sport is played and understood.

Early Pioneers (1970s-2000s)

The intellectual foundations of football analytics were laid decades before computers made large-scale analysis practical. These early pioneers worked with limited data, no spreadsheets, and faced skepticism from the football establishment, yet their insights proved remarkably prescient.

The roots of football analytics can be traced to pioneering work beginning in the 1970s and 1980s:

Bill James and the Sabermetric Revolution - Though focused on baseball, Bill James's sabermetric approach—using statistical analysis to challenge conventional baseball wisdom—inspired analysts across sports. His methodology demonstrated that rigorous analysis could reveal hidden truths about sports that intuition missed. Football analysts borrowed his empirical approach: start with a question, collect data systematically, analyze objectively, and let the data tell you what's true rather than confirming what you believe.

Pete Palmer's Rating Systems - Palmer, a statistician and baseball sabermetrician, created some of the first systematic football rating systems in the 1970s. His work on quantifying player and team value laid groundwork for modern advanced metrics. Palmer recognized that raw statistics like passing yards or rushing touchdowns missed crucial context—a 50-yard pass on 3rd-and-30 is less valuable than a 15-yard pass on 3rd-and-10, even though the former gains more yards.

The Hidden Game of Football - In 1988, Bob Carroll, Pete Palmer, and John Thorn published The Hidden Game of Football, one of the first books to apply systematic analytical approaches to football. The book introduced concepts like yards per play adjusted for situation, expected points, and systematic evaluation of coaching decisions. While the authors lacked modern computational tools, their conceptual framework anticipated many ideas that would later become central to football analytics.

Early Academic Work - In the 1970s and 1980s, academic researchers began applying operations research and statistical methods to football questions. Work on optimal fourth-down strategies, for instance, appeared in academic journals decades before NFL teams began implementing such insights. However, this work remained largely isolated in academia, with minimal penetration into football practice.

Why Did Early Analytics Struggle to Gain Acceptance?

Despite sound methodology, early football analytics faced several barriers: 1. **Data Scarcity**: Play-by-play data existed but was not easily accessible or standardized 2. **Computational Limitations**: Complex analyses required significant computing power not widely available 3. **Cultural Resistance**: Football coaching culture valued playing experience over analytical credentials 4. **Risk Aversion**: Analytics-driven decisions that failed faced harsher criticism than conventional decisions that failed 5. **Communication Gaps**: Analysts struggled to translate findings into actionable coaching insights These barriers would persist until the 2000s, when multiple factors converged to enable the analytics revolution.

The Moneyball Era (2000s)

The publication of Michael Lewis's Moneyball in 2003 marked a turning point for sports analytics broadly and football analytics specifically. Though the book focused on baseball's Oakland Athletics, its core message resonated across sports: systematic analysis could identify market inefficiencies and provide competitive advantages even with limited resources.

The success of Billy Beane's Oakland Athletics demonstrated that data-driven decision-making could overcome significant resource disparities. While the Athletics had one of the lowest payrolls in baseball, they competed with wealthier franchises by identifying undervalued players and exploiting market inefficiencies. NFL teams, facing a salary cap that created a relatively level financial playing field, recognized that similar analytical advantages could prove even more valuable in football.

Several key developments emerged during this era:

Football Outsiders and DVOA - In 2003, Aaron Schatz founded Football Outsiders and introduced DVOA (Defense-adjusted Value Over Average), one of the first widely adopted advanced football metrics. DVOA measures team and player efficiency by comparing performance on each play to a baseline average, adjusting for situation (down, distance, field position) and opponent quality. This metric provided a more sophisticated alternative to raw yardage statistics and offered insights that traditional statistics missed.

DVOA demonstrated that situational context matters enormously. A team that gains 4 yards per play might be excellent or terrible depending on when and how those yards are gained. DVOA captured these nuances, and its predictive power—DVOA through mid-season predicts end-of-season success better than win-loss record—gradually convinced skeptics of analytics' value.

Win Probability Models - Researchers including Brian Burke (Advanced NFL Stats) began developing and publicizing win probability models. These models estimate the likelihood that a team will win based on current game situation (score differential, time remaining, field position, possession, down and distance). Win probability provides crucial context for evaluating decisions—a play that gains 20 yards is dramatically different if it changes win probability from 10% to 30% versus from 70% to 75%.

The Internet and Data Democratization - The mid-2000s saw increasing availability of play-by-play data through websites like Pro Football Reference. Combined with the internet's ability to share analyses widely, this created a vibrant community of amateur analysts. Blogs and websites like Advanced NFL Stats, Football Outsiders, and The Fifth Down (New York Times) brought analytics insights to wider audiences and demonstrated their value.

Initial Team Adoption - Some NFL teams began hiring analysts in the mid-2000s, though often in limited roles. The Philadelphia Eagles, New England Patriots, and a few other teams quietly built analytics departments. These early-adopting teams gained competitive advantages, though the full impact wouldn't be visible for years.

The Moneyball Effect: Perception vs. Reality

"Moneyball" is often misunderstood. It's commonly perceived as "ignore scouts, only use stats," but the true message is more nuanced: combine statistical insights with domain expertise, question conventional wisdom, and systematically identify market inefficiencies. The best football analytics departments don't replace scouts and coaches—they complement them, providing additional information and perspective.

The Modern Era (2010s-Present)

The 2010s and 2020s have witnessed explosive growth in football analytics, driven by technological advances, data availability, computational power, and—crucially—increased acceptance within the football establishment. What was once a competitive advantage for early adopters has become table stakes for all teams.

2009: Advanced NFL Stats Launch - Brian Burke launched Advanced NFL Stats (later acquired by ESPN), providing sophisticated statistical analysis and introducing concepts like Expected Points Added and Win Probability Added to wider audiences. Burke's clear explanations and practical applications helped bridge the gap between technical analysis and football practitioners.

2013: The nflscrapR Revolution - The launch of nflscrapR (the predecessor to nflfastR) by Maksim Horowitz, Ron Yurko, and Sam Ventura marked a watershed moment. For the first time, researchers and analysts had easy access to clean, comprehensive play-by-play data with advanced metrics already calculated. This democratized football analytics—suddenly, anyone with basic programming skills could perform analyses that previously required data engineering expertise.

The nflscrapR package (and its Python equivalent, nfl_data_py) meant that loading data went from a multi-day data engineering project to a single line of code. This dramatically lowered barriers to entry and enabled the explosion of public analytics work we see today.

2015: NFL Next Gen Stats Launch - The NFL began tracking player movement data using RFID chips embedded in player equipment. This "Next Gen Stats" data captured player locations 10 times per second, enabling entirely new forms of analysis. Questions about separation, route running, pass rush win rate, and defensive coverage could be addressed with unprecedented precision.

2018-2020: Big Data Bowl - The NFL launched an annual analytics competition, the Big Data Bowl, providing participants with Next Gen Stats tracking data. This competition accelerated methodological innovation, exposed talented analysts, and signaled the league's official embrace of analytics.

2020+: Universal Adoption - By the early 2020s, every NFL team employs dedicated analytics staff, though the size and influence of these departments vary significantly. Some teams have analytics directors with direct access to head coaches and general managers; others still treat analytics as supplementary. The competitive advantage now comes not from having analytics, but from effectively integrating analytics into decision-making processes.

The Analytics-Driven Game - The influence of analytics is visible in how the game is played:
- Fourth-down attempts have increased dramatically (teams went for it on 4th down about 30% more often in 2023 than in 2015)
- Two-point conversion decisions are increasingly dynamic and win-probability-driven
- Play-calling has shifted toward passing, especially on early downs
- Defensive strategies have adapted, with more zone coverage and lighter boxes

The nflverse Ecosystem

Today's football analytics is built on the **nflverse**, a suite of R packages and Python libraries maintained by a community of contributors: - **nflfastR** (R) / **nfl_data_py** (Python): Core play-by-play data and advanced metrics - **nflreadr** (R): Fast data loading with caching - **nflplotR** (R): Team-themed visualizations with logos and colors - **nfl4th** (R): Fourth-down decision models - **nflseedR** (R): Playoff picture simulation These tools, all open-source and freely available, provide the data infrastructure for this textbook. The nflverse represents one of the most successful examples of open-source collaboration in sports analytics.

Key Concepts in Football Analytics

Before we dive into coding and data analysis, let's establish the conceptual foundation. Several key concepts form the vocabulary of modern football analytics. Understanding these concepts is crucial because they'll appear throughout this textbook and in virtually all football analytics discussions.

Expected Points (EP)

One of the most fundamental concepts in modern football analytics is Expected Points (EP). This framework assigns a point value to every possible game situation (down, distance, field position) based on historical data about what typically happens next.

The Core Idea: Imagine it's 1st-and-10 at your own 20-yard line. Historically, teams in this situation go on to score next (before their opponents score) an average of about 0.4 points. Now imagine 1st-and-10 at the opponent's 20-yard line. Teams in this situation average about 4.0 points before their opponents score next. The expected points framework captures this relationship between field position and scoring expectation.

Why Expected Points Matters: Traditional statistics like yards gained miss crucial context. A 15-yard gain from your own 20 to your own 35 is less valuable than a 15-yard gain from the opponent's 30 to the opponent's 15—the latter is close to a scoring opportunity, while the former merely improves field position. Expected Points captures this difference by measuring how much closer to scoring a play moves the offense.

How It's Calculated: Expected Points models are built by:
1. Collecting all plays from historical data (typically 10-20 years)
2. For each unique situation (down, distance, yard line), recording what happened next
3. Calculating the average points scored before the next change of possession
4. Smoothing these averages to create continuous estimates

For example, if we look at all 1st-and-10 plays at the opponent's 25-yard line over the past 10 seasons, we might find that the offensive team went on to score touchdowns 40% of the time (7 points), field goals 30% of the time (3 points), and their opponents scored next 30% of the time (-2 points on average, accounting for what the opponent scored). The expected points would be: (0.40 × 7) + (0.30 × 3) + (0.30 × -2) = 3.1 points.

Expected Points Across the Field

Expected points vary substantially based on field position: - **Own 5-yard line (1st-and-10)**: ≈ -0.5 EP (more likely to punt or turn it over than score next) - **Own 20-yard line (1st-and-10)**: ≈ 0.4 EP - **Midfield (1st-and-10)**: ≈ 2.0 EP - **Opponent's 35 (1st-and-10)**: ≈ 3.5 EP (field goal range) - **Opponent's 10 (1st-and-10)**: ≈ 5.5 EP (very likely to score) The exact values depend on the specific model and era, but the pattern is consistent: the closer to the opponent's goal, the higher the expected points.

Expected Points Added (EPA)

Building on Expected Points, EPA (Expected Points Added) measures the value of each individual play by comparing the EP before and after the play.

The Formula:
$$ \text{EPA} = \text{EP}_{\text{end}} - \text{EP}_{\text{start}} $$

An Example: Suppose it's 2nd-and-7 at your own 23-yard line (EP ≈ 0.5). You complete a pass for 9 yards, giving you 1st-and-10 at your own 32 (EP ≈ 1.2). The EPA for this play is 1.2 - 0.5 = 0.7. Despite not scoring any actual points, this play added 0.7 expected points to your scoring expectation.

Why EPA is Revolutionary: EPA solves multiple problems with traditional statistics:

  1. Context-Aware: A 5-yard gain on 3rd-and-4 (conversion, positive EPA) is more valuable than a 5-yard gain on 3rd-and-10 (punt, negative EPA)
  2. Accounts for Turnovers: An interception doesn't just lose yards, it gives the opponent good field position—EPA captures this full cost
  3. Comparable Across Situations: EPA lets us compare plays from different situations on a common scale
  4. Predictive: EPA predicts future success better than traditional metrics like yards per play

Interpreting EPA Values:
- +2.0 EPA or higher: Excellent play (touchdown, explosive gain, turnover forced in good field position)
- +0.5 to +2.0 EPA: Good play (first down conversion, solid gain)
- -0.5 to +0.5 EPA: Neutral play (small gain or loss that doesn't change situation much)
- -2.0 to -0.5 EPA: Bad play (incompletion on 3rd down, sack, short of first down)
- -2.0 EPA or worse: Disastrous play (turnover, safety, fumble for touchdown)

Why EPA Matters: A Practical Example

Consider two running backs: **RB A**: 150 carries, 750 yards (5.0 yards per carry), 6 TDs **RB B**: 150 carries, 675 yards (4.5 yards per carry), 4 TDs Traditional statistics suggest RB A is better—more yards per carry and more touchdowns. But what if: - RB A averaged 0.02 EPA per play (slightly above average) - RB B averaged 0.15 EPA per play (excellent) The EPA tells us that RB B was actually more valuable. Perhaps RB B converted more first downs in crucial situations, while RB A gained yards in less critical situations or on plays that still resulted in punts. EPA captures this difference that yards per carry misses. This isn't hypothetical—EPA-based evaluations frequently contradict traditional statistics, and EPA consistently proves more predictive of future performance and team success.

Win Probability (WP)

While EPA measures how plays affect scoring expectations, Win Probability (WP) measures how plays affect the likelihood of winning the game. WP models estimate the probability that a team will win based on the current game situation.

Factors That Influence Win Probability:
- Score differential: The current point difference between teams
- Time remaining: How much game time is left (both game clock and possession clock)
- Field position: Where the ball is located
- Down and distance: The current down and yards to go
- Timeouts remaining: How many timeouts each team has
- Possession: Which team has the ball

An Example: Imagine your team leads 21-17 with 5 minutes left in the 4th quarter, with the ball at midfield on 2nd-and-5. Your win probability might be around 75%. If you throw an interception that the opponent returns to your 30-yard line, your win probability might plummet to 30%. The Win Probability Added (WPA) for that play would be -45%.

Why Win Probability Matters:

Win probability provides crucial context for evaluating plays and decisions. A play that gains 30 yards has very different value depending on win probability context:
- When winning 35-10 late in a game, that 30-yard gain barely changes win probability
- When trailing 17-14 with 2 minutes left, that 30-yard gain might change win probability from 15% to 55%

Win probability is particularly valuable for:
1. Evaluating clutch performance: Which players and teams perform best in high-leverage situations?
2. Assessing coaching decisions: Was going for it on 4th down the right call given win probability?
3. Understanding game flow: When did the game really change?
4. Making in-game decisions: Should we kick a field goal or go for touchdown?

EPA vs. WPA: When to Use Each

Both EPA and WPA measure play value, but they serve different purposes: **Use EPA when:** - Evaluating player or team performance across multiple games - Comparing players or plays in different game contexts - Assessing general offensive/defensive efficiency - You want a stable, repeatable measure of quality **Use WPA when:** - Evaluating specific game situations and decisions - Identifying clutch performance - Understanding game flow and key momentum shifts - Making in-game strategic decisions EPA is more stable and better for player evaluation, while WPA is more game-specific and better for understanding particular contests. Many analyses use both metrics to provide complementary perspectives.

Success Rate

Success Rate measures the percentage of plays that positively contribute to the goal of scoring. It provides a simpler, more intuitive complement to EPA.

Success Definitions:
- 1st down: Gain ≥ 40% of yards to go (e.g., 4+ yards on 1st-and-10)
- 2nd down: Gain ≥ 60% of yards to go (e.g., 6+ yards on 2nd-and-10)
- 3rd/4th down: Gain ≥ 100% of yards to go (conversion)

Alternative Definition: Many analysts simply define success as EPA > 0, which aligns with the expected points framework.

Why Success Rate Matters: While EPA measures the magnitude of success, Success Rate asks a binary question: did this play help? This has several advantages:

  1. Intuitive: Easy to explain to non-technical audiences
  2. Stable: Less affected by a few explosive plays than mean EPA
  3. Process-Oriented: Focuses on consistently moving the chains rather than big plays
  4. Complementary: Provides different information than EPA

Example: Consider two offenses:

Offense A:
- 50% success rate
- 0.10 EPA per play
- Many explosive plays (20+ yards) mixed with failures

Offense B:
- 48% success rate
- 0.08 EPA per play
- Consistent, methodical drives with few big plays

Both offenses are effective but in different ways. Offense A is more explosive and boom-or-bust. Offense B is more steady and reliable. Success rate helps us understand these different styles.

The Success Rate-EPA Relationship

Success rate and EPA are correlated but measure different aspects of performance: - **High success rate, high EPA**: Elite offense that consistently moves the ball and produces explosive plays - **High success rate, modest EPA**: Methodical offense that sustains drives but lacks explosiveness - **Low success rate, high EPA**: Boom-or-bust offense that succeeds with big plays but fails frequently - **Low success rate, low EPA**: Struggling offense that can't consistently move the ball The best offenses typically excel at both—they convert first downs consistently AND produce explosive plays regularly.

Setting Up Your Analytics Environment

Before we can analyze football data, we need to set up our development environment. This section will guide you through installing the necessary software and packages for football analytics work. We'll cover both R and Python, as both languages are widely used in football analytics, and this textbook provides examples in both languages.

Choosing Between R and Python: Both languages are excellent for football analytics:

  • R is particularly strong for statistical analysis and visualization, with exceptional packages for data manipulation (tidyverse) and football-specific analysis (nflfastR, nflplotR)
  • Python is more versatile for general programming, machine learning, and integration with other systems, with strong support for data analysis (pandas) and football data (nfl_data_py)

You don't need to choose—many analysts use both languages, selecting whichever is best for each specific task. Throughout this book, we provide code examples in both languages so you can learn either or both.

Prerequisites

This textbook assumes you have: 1. **Basic programming knowledge**: Understanding of variables, functions, loops, and conditionals 2. **Basic statistics**: Understanding of mean, median, standard deviation, correlation 3. **A computer with internet access**: For downloading data and packages We do NOT assume: - Advanced statistical or machine learning knowledge (we'll teach these concepts) - Prior sports analytics experience - Expert-level programming skills If you're new to programming, we recommend completing a basic introduction to either R or Python before starting this textbook. Resources like "R for Data Science" (Wickham & Grolemund) or "Python for Data Analysis" (McKinney) provide excellent foundations.

R Setup

R is a programming language specifically designed for statistical computing and graphics. It's widely used in academia and industry for data analysis, and it's particularly popular in sports analytics.

Installing R and RStudio

Before installing packages, you need R itself and RStudio (an integrated development environment that makes working with R much easier):

  1. Install R: Download from https://cran.r-project.org/
    - Choose the version for your operating system (Windows, Mac, or Linux)
    - Follow the installation instructions
    - Latest stable version is recommended (currently 4.3+)

  2. Install RStudio: Download from https://posit.co/download/rstudio-desktop/
    - RStudio provides a user-friendly interface for R
    - Free desktop version is sufficient for all work in this textbook
    - Install after installing R

Installing Required Packages

Once R and RStudio are installed, you need to install packages that provide football analytics functionality. Packages are collections of functions and data that extend R's capabilities.

#| eval: false
#| echo: true

# Install core data manipulation and visualization packages
# The tidyverse is a collection of packages that work together
# It includes dplyr (data manipulation), ggplot2 (visualization),
# tidyr (data cleaning), and several other essential packages
install.packages("tidyverse")

# Install nflfastR for NFL play-by-play data
# This package provides access to play-by-play data from 1999-present
# It includes advanced metrics like EPA, WPA, and success rate pre-calculated
install.packages("nflfastR")

# Install nflplotR for NFL-themed visualizations
# This package provides team colors, logos, and styling for plots
install.packages("nflplotR")

# Install gt for creating publication-quality tables
# Useful for presenting results in a polished format
install.packages("gt")

# Install additional useful packages
install.packages("ggrepel")  # For better text labels in plots
install.packages("patchwork")  # For combining multiple plots
install.packages("scales")  # For formatting axis labels and numbers
**What Each Package Does**: - **tidyverse**: Your primary toolkit for data manipulation and visualization - **nflfastR**: The core package for accessing NFL data with advanced metrics - **nflplotR**: Adds NFL team logos, colors, and styling to visualizations - **gt**: Creates beautiful, formatted tables from data - **ggrepel**: Prevents text labels from overlapping in plots - **patchwork**: Combines multiple plots into sophisticated layouts - **scales**: Provides formatting functions for axis labels and legends
#| message: false
#| warning: false

# Load libraries to verify they're installed correctly
# If this code runs without errors, your installation succeeded
library(tidyverse)    # Data manipulation and visualization
library(nflfastR)     # NFL play-by-play data
library(nflplotR)     # NFL plotting tools
library(gt)           # Table formatting

# Verify installation by printing version information
cat("✓ R packages loaded successfully\n")
cat("✓ R version:", R.version.string, "\n")
cat("✓ tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
cat("✓ nflfastR version:", as.character(packageVersion("nflfastR")), "\n")
cat("✓ nflplotR version:", as.character(packageVersion("nflplotR")), "\n")

# Test basic functionality
cat("\n✓ Testing data access...\n")
# Load a small sample of team data to confirm everything works
teams <- nflfastR::load_teams()
cat("✓ Successfully loaded data for", nrow(teams), "teams\n")
**What's Happening Here:** 1. **Library Loading**: The `library()` function loads installed packages into your current R session, making their functions available. You need to load packages at the start of each R session (or script) where you'll use them. 2. **Version Checking**: The `packageVersion()` function returns the version number of installed packages. This is useful for: - Confirming successful installation - Debugging when code doesn't work as expected (version mismatches) - Documenting your analysis environment for reproducibility 3. **Functionality Test**: We load a small dataset (team information) to verify that nflfastR can access data. If this succeeds, you're ready to start analyzing. **Expected Output**: You should see output confirming successful loading, showing version numbers, and indicating that team data was loaded. If you see any errors, troubleshoot by: - Ensuring you have an active internet connection (required for first-time data downloads) - Updating packages with `update.packages()` - Restarting RStudio - Checking the error message for specific issues

Common Installation Issues

**Issue**: "Package 'X' is not available" **Solution**: Make sure you're connected to the internet and try again. Some networks (school, work) may block package downloads. **Issue**: "Package 'X' required by 'Y' is not installed" **Solution**: Install the required package first, or install all dependencies with `install.packages("package_name", dependencies = TRUE)` **Issue**: Code runs in RStudio but not in a Quarto document **Solution**: Make sure to load libraries at the beginning of your Quarto document, not just in the console **Issue**: "Cannot load nflfastR data" **Solution**: Check your internet connection. Data is downloaded from cloud storage and requires internet access.

Python Setup

Python is a general-purpose programming language that has become dominant in data science and machine learning. Its extensive ecosystem of libraries makes it excellent for football analytics, especially when combining analytics with other tasks like web scraping, automation, or deploying models.

Installing Python and Jupyter

You need Python itself and tools for running Python code interactively:

  1. Install Python: We recommend using Anaconda, which includes Python plus many useful packages
    - Download from https://www.anaconda.com/download
    - Anaconda includes Jupyter, pandas, numpy, and many other packages
    - Alternatively, install Python directly from https://python.org and install packages individually

  2. Install Jupyter: If not using Anaconda, install Jupyter for interactive analysis
    - Jupyter notebooks provide an interactive environment for data analysis
    - Install with: pip install jupyter jupyterlab

Installing Required Packages

Python packages are installed using pip (Python's package installer) or conda (Anaconda's package manager).

#| eval: false
#| echo: true

# Install core data manipulation packages
# pandas: Primary library for working with tabular data (like R's data frames)
# numpy: Fundamental package for numerical computing
pip install pandas numpy

# Install nfl_data_py for NFL data access
# Python equivalent of nflfastR, provides play-by-play data
pip install nfl_data_py

# Install visualization packages
# matplotlib: Core plotting library (similar to base R graphics)
# seaborn: Statistical visualization library built on matplotlib
pip install matplotlib seaborn

# Install additional useful packages
pip install scikit-learn  # Machine learning library
pip install jupyter  # Interactive notebook environment
pip install plotly  # Interactive plotting library
**Alternative Installation with Conda** (if using Anaconda):
# Conda can sometimes handle dependencies better than pip
conda install pandas numpy matplotlib seaborn scikit-learn jupyter

# nfl_data_py is not in conda channels, so use pip
pip install nfl_data_py
#| message: false
#| warning: false

# Import libraries to verify they're installed correctly
import pandas as pd              # Data manipulation (convention: import as pd)
import numpy as np               # Numerical computing (convention: import as np)
import nfl_data_py as nfl        # NFL data access (convention: import as nfl)
import matplotlib.pyplot as plt  # Plotting (convention: import pyplot as plt)
import seaborn as sns            # Statistical visualization

# Verify installation by printing version information
print("✓ Python packages loaded successfully")
print(f"✓ Python version: {pd.__version__.split('.')[0]}.{pd.__version__.split('.')[1]}")
print(f"✓ pandas version: {pd.__version__}")
print(f"✓ numpy version: {np.__version__}")
print(f"✓ matplotlib version: {plt.matplotlib.__version__}")

# Test basic functionality
print("\n✓ Testing data access...")
# Load schedule data to confirm nfl_data_py works
schedule = nfl.import_schedules([2023])
print(f"✓ Successfully loaded data for {len(schedule)} games")
**Import Conventions**: Python has standard conventions for importing common packages: - `import pandas as pd`: Everyone uses `pd` as the pandas alias - `import numpy as np`: Everyone uses `np` as the numpy alias - `import matplotlib.pyplot as plt`: Standard way to import matplotlib's pyplot module Following these conventions makes your code immediately recognizable to other Python analysts. **Version Checking**: The `__version__` attribute (note the double underscores) provides version information for Python packages. Version checking is important for: - Reproducibility (documenting your environment) - Debugging (ensuring you have compatible versions) - Feature availability (new features require newer versions) **Functionality Test**: We import schedule data (game dates, times, and matchups) to verify that nfl_data_py can successfully download data. This requires: - Active internet connection - Access to the data repository (usually GitHub) - Proper package installation **Expected Output**: You should see confirmation messages showing version numbers and successful data loading. If you encounter errors: - Check your internet connection - Try updating packages: `pip install --upgrade nfl_data_py pandas numpy` - If behind a firewall/proxy, you may need to configure network settings - Search error messages online—the Python community is very active and most issues are documented

Python Environments: Virtual Environments vs. Anaconda

**Virtual Environments**: If using standard Python (not Anaconda), create a virtual environment for your football analytics work. This isolates packages and prevents conflicts:
# Create a virtual environment
python -m venv football_env

# Activate it (Windows)
football_env\Scripts\activate

# Activate it (Mac/Linux)
source football_env/bin/activate

# Install packages in the environment
pip install pandas numpy nfl_data_py matplotlib seaborn
**Anaconda Environments**: If using Anaconda, create a conda environment:
# Create and activate a conda environment
conda create -n football python=3.11
conda activate football
conda install pandas numpy matplotlib seaborn jupyter
pip install nfl_data_py
Using environments prevents version conflicts and makes your setup reproducible.

Your First Football Analysis

Now that we understand the conceptual foundations of football analytics and have our environment set up, it's time to work with real data. In this section, we'll walk through a complete analysis from start to finish, demonstrating the typical workflow that analysts use daily in NFL front offices, media organizations, and research settings.

This analysis will introduce you to several critical skills that you'll use throughout this textbook and in practical football analytics work:
- Loading and inspecting NFL play-by-play data to understand its structure
- Filtering and cleaning data for specific analyses
- Calculating summary statistics by group to compare teams or play types
- Creating visualizations to communicate patterns and insights
- Interpreting results in football context, connecting statistical findings to strategic implications

Don't worry if you're new to programming or data analysis—we'll explain every step in detail, covering both the "how" (the code syntax and functions) and the "why" (the analytical reasoning and football context). By the end of this section, you'll have completed a legitimate analysis that would have been impossible without modern data infrastructure just a decade ago. You'll understand not just what the code does, but why we analyze data this way and what the results mean for understanding football strategy.

Loading Play-by-Play Data

The foundation of nearly all football analytics is play-by-play (PBP) data—a detailed record of every play in every game. This dataset is remarkably comprehensive, capturing not just basic statistics like yards gained and final scores, but also rich contextual information including down and distance, field position, time remaining, score differential, weather conditions, and much more. Most importantly for our purposes, modern play-by-play datasets include advanced metrics like Expected Points Added (EPA), Win Probability Added (WPA), and Success Rate already calculated for each play.

Understanding Data Sources

Before we load data, it's worth understanding where this data comes from, how it gets to us, and why we can trust it. The data infrastructure underlying modern football analytics represents years of community effort and technical development.

Data Collection: The NFL collects detailed information about every play through multiple systems:

  • Official Game Statistics: Professional statisticians at every game record official play-by-play data, including down and distance, yards gained, players involved, and play outcomes. This data flows into the league's official database.

  • RFID Tracking Systems: Since 2015, the NFL has equipped player shoulder pads and the football itself with RFID chips that track position 10 times per second. This "Next Gen Stats" data captures player speeds, routes, separation, and more.

  • Video Analysis: The NFL films every game from multiple angles, and this video is used to verify statistics, identify personnel groupings, and generate additional insights.

  • Third-Party Data Providers: Organizations like Pro Football Focus manually chart additional information from video, including blocking assignments, route concepts, coverage schemes, and more.

Data Distribution: Several organizations make NFL data publicly available:

  • Pro Football Reference: Comprehensive historical statistics and play-by-play data
  • ESPN/NFL.com: Official statistics with some advanced metrics
  • NFL Next Gen Stats: Player tracking data with derived metrics
  • Official NFL API: JSON endpoints providing play-by-play data

However, accessing and parsing data from multiple sources would be time-consuming, error-prone, and would require significant data engineering skills. Each source has different formatting, different variable names, and different update schedules.

nflverse Ecosystem: This is where the nflverse project becomes invaluable. Created and maintained by community contributors (including Ben Baldwin, Sebastian Carl, and many others), nflverse provides clean, consistent, and easy-to-use play-by-play data through the nflfastR package for R and nfl_data_py library for Python.

The nflverse data pipeline:
1. Downloads data from official NFL sources
2. Cleans and standardizes variable names
3. Calculates advanced metrics (EPA, WPA, Success Rate, etc.)
4. Validates data quality and fixes known issues
5. Hosts the processed data in a public repository for fast access
6. Updates automatically after each week of games

This means that loading comprehensive, analysis-ready data is literally a single line of code. What once required days of data engineering work now takes seconds.

The Power of Open Source Data

The availability of high-quality, free NFL data through projects like nflverse has democratized football analytics. What once required expensive data subscriptions or manual data collection is now accessible to anyone with an internet connection and basic programming skills. This democratization has had several important effects: 1. **Increased Participation**: Thousands of analysts now publish football analytics work 2. **Higher Quality Discourse**: Public discussion of football strategy is more sophisticated 3. **Team Pressure**: Teams must adopt analytics to remain competitive 4. **Educational Opportunity**: Students can learn with real, professional-grade data 5. **Innovation**: More people analyzing data leads to more methodological innovations The nflverse project exemplifies the power of open-source collaboration. Dozens of contributors have donated time to build tools that benefit the entire community. This textbook exists because of their work.

Loading the Data

Let's load play-by-play data for the 2023 NFL season. We'll use this dataset throughout the chapter to demonstrate various concepts. The 2023 season is recent enough to be relevant but complete enough that all data has been finalized and verified.

#| label: load-pbp-data-r
#| message: false
#| warning: false
#| cache: true

# Load required libraries
# These must be loaded at the start of every R session where you'll use them
library(tidyverse)    # Collection of data manipulation packages
library(nflfastR)     # NFL play-by-play data with advanced metrics
library(nflplotR)     # NFL team logos and styling for plots
library(gt)           # Publication-quality table formatting

# Load play-by-play data for 2023 season
# The load_pbp() function:
# - Downloads data from the nflverse repository if not already cached
# - Returns a tibble (modern data frame) where each row = one play
# - Includes 372 variables with everything from basic info to advanced metrics
pbp_2023 <- load_pbp(2023)

# Display information about what we've loaded
# This helps us understand the scope and structure of the data
cat("===== DATASET OVERVIEW =====\n")
cat("Rows (plays):", format(nrow(pbp_2023), big.mark = ","), "\n")
cat("Columns (variables):", ncol(pbp_2023), "\n")
cat("Season:", unique(pbp_2023$season), "\n")
cat("Weeks:", min(pbp_2023$week), "through", max(pbp_2023$week), "\n")
cat("Unique games:", n_distinct(pbp_2023$game_id), "\n")
cat("Unique teams:", n_distinct(pbp_2023$posteam), "\n")

# Show a small sample of the data
# Select key columns that illustrate the data structure
cat("\n===== SAMPLE DATA =====\n")
pbp_2023 %>%
  # Select subset of interesting columns
  # posteam = possessing team (offense), defteam = defending team
  select(game_id, week, posteam, defteam, down, ydstogo,
         desc, yards_gained, epa) %>%
  # Take first 3 plays only (for brevity)
  head(3) %>%
  # Print in readable format
  print(width = 100)
#| label: load-pbp-data-py
#| message: false
#| warning: false
#| cache: true

# Import required libraries
# These must be imported in every Python script/notebook where you'll use them
import pandas as pd              # Data manipulation (like R data frames)
import numpy as np               # Numerical computing
import nfl_data_py as nfl        # NFL data access (Python equivalent of nflfastR)
import matplotlib.pyplot as plt  # Plotting library
import seaborn as sns            # Statistical visualization

# Configure pandas display options for better readability
# These settings control how DataFrames are displayed in output
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', 120)         # Wider display
pd.set_option('display.precision', 3)       # 3 decimal places

# Load play-by-play data for 2023 season
# import_pbp_data() takes a list of seasons (can load multiple at once)
# Returns a pandas DataFrame where each row = one play
pbp_2023 = nfl.import_pbp_data([2023])

# Display information about what we've loaded
print("===== DATASET OVERVIEW =====")
print(f"Rows (plays): {len(pbp_2023):,}")
print(f"Columns (variables): {len(pbp_2023.columns)}")
print(f"Season: {pbp_2023['season'].unique()[0]}")
print(f"Weeks: {pbp_2023['week'].min()} through {pbp_2023['week'].max()}")
print(f"Unique games: {pbp_2023['game_id'].nunique()}")
print(f"Unique teams: {pbp_2023['posteam'].nunique()}")

# Show a small sample of the data
print("\n===== SAMPLE DATA =====")
sample_columns = ['game_id', 'week', 'posteam', 'defteam', 'down',
                  'ydstogo', 'desc', 'yards_gained', 'epa']
print(pbp_2023[sample_columns].head(3).to_string(index=False))
Let's break down what's happening in this code step by step, explaining both the technical mechanics and the analytical reasoning. **1. Library Imports/Loading** (First section) In R, we use `library()` to load packages; in Python, we use `import` statements. Despite the syntax difference, the concept is identical: we're bringing pre-written functions and tools into our current session. Key packages and their purposes: - **tidyverse/pandas**: Core data manipulation tools. The tidyverse includes dplyr (data manipulation), ggplot2 (visualization), tidyr (data reshaping), and more. Pandas provides similar functionality for Python. - **nflfastR/nfl_data_py**: The crucial packages that provide access to NFL data. Without these, we'd need to manually download, parse, and clean data from multiple sources. - **nflplotR**: Adds NFL-specific plotting capabilities like team logos and official colors - **gt**: Creates beautiful, formatted tables (R only; Python has alternatives like plotly) **2. Data Loading** (Main function call) The key function differs between languages: - R: `pbp_2023 <- load_pbp(2023)` - Python: `pbp_2023 = nfl.import_pbp_data([2023])` Both functions do the same thing: 1. Check if 2023 data is already cached locally on your computer 2. If not, download from the nflverse data repository (hosted on GitHub) 3. Parse the data into a structured format (data frame/DataFrame) 4. Return the data for use in your analysis **First-time vs. Subsequent Loads**: The first time you load data for a season, it downloads from the internet (takes 30-60 seconds). Subsequent loads use cached data (takes 2-3 seconds). This is why we use `cache: true` in code chunk options—it tells Quarto to cache results and only re-run if code changes. **3. Data Inspection** (cat/print statements) The inspection code prints several pieces of information to help us understand what we're working with: - **Rows (plays)**: Typically 43,000-45,000 plays per season. This includes all regular season plays (offensive plays, special teams, penalties, etc.). With 272 regular season games, this averages to about 160 plays per game, which matches typical NFL totals. - **Columns (variables)**: Over 370 variables! This might seem overwhelming, but most analyses use only 10-20 variables. The comprehensive variable set means the data can support nearly any analysis without requiring external data joins. - **Season/Weeks**: Confirms we loaded the correct data. Regular season is weeks 1-18 (as of 2021 when the NFL expanded to 17 games). - **Unique games**: Should be 272 games (32 teams × 17 games ÷ 2, since each game involves two teams). A number far from 272 suggests data loading issues. - **Unique teams**: Should always be 32. Checking this helps catch data loading errors. **4. Sample Display** (Last section) We display the first few plays with key columns to see what the data actually looks like: - `game_id`: Unique identifier for each game (format: season_week_away_home) - `week`: Week number of the season - `posteam`: Team with possession (offensive team) - `defteam`: Defensive team - `down`: Down number (1st, 2nd, 3rd, or 4th) - `ydstogo`: Yards needed for first down - `desc`: Text description of the play (what actually happened) - `yards_gained`: Net yards gained (accounts for sacks, fumbles, etc.) - `epa`: Expected Points Added (the value of the play) This sample gives us immediate insight into data structure and confirms successful loading.

Interpreting the Output

When you run this code, you should see output similar to this:

===== DATASET OVERVIEW =====
Rows (plays): 43,871
Columns (variables): 372
Season: 2023
Weeks: 1 through 18
Unique games: 272
Unique teams: 32

What This Tells Us:

The ~44,000 plays represent every play from the 2023 regular season (playoffs are separate data). Let's put this in context:

  • Play Volume: 43,871 plays ÷ 272 games = approximately 161 plays per game. This includes offensive plays (typically 130-140 combined for both teams), plus special teams plays (kickoffs, punts, field goals) and penalty plays.

  • Data Completeness: The fact that we have exactly 272 games (32 teams × 17 games ÷ 2) confirms complete data. Missing games would indicate data loading issues.

  • Variable Count: The 372 variables are organized into categories:

  • Basic identifiers (game_id, play_id, team names)
  • Situation variables (down, distance, field position, time)
  • Play description (play type, description text, personnel)
  • Outcomes (yards gained, touchdown, turnover, penalty)
  • Advanced metrics (epa, wpa, cpoe, success, qb_epa)
  • Player identifiers (passer_id, rusher_id, receiver_id, etc.)
  • Tracking data (air_yards, yards_after_catch, time_to_throw)
  • Game context (score, win_probability, vegas_line)

You'll typically use 10-20 variables for any given analysis, but the comprehensive set ensures you can address nearly any research question.

Common Issue: Data Size and Memory

The play-by-play dataset for a single season is approximately 250-300 MB when loaded into memory. This is manageable on most modern computers (anything with 4+ GB RAM), but you should be aware of memory constraints when: **Working with multiple seasons**: Loading 10 seasons means 2.5-3 GB of data. This is still manageable on most computers, but older systems may struggle. **Memory-Constrained Environments**: If working on older hardware or cloud environments with limited memory: - Load only the columns you need: `select()` immediately after loading - Filter to specific teams or time periods: `filter()` reduces data size - Work with data in chunks: Load one season at a time, process, then load next - Use more memory-efficient data types: Convert text columns to factors (R) or categories (Python) Example of selective loading:
# Load only essential columns
pbp_2023 <- load_pbp(2023) %>%
  select(game_id, posteam, defteam, down, ydstogo, play_type,
         yards_gained, epa, wpa, success)
This reduces memory usage by ~80% while retaining data for most analyses.

Basic EPA Analysis

Now that we have data loaded, let's perform our first real analysis. We'll examine a fundamental question that gets at the heart of modern offensive strategy: Which play types are more effective on average—passes or runs?

The Analysis Question

This might seem like a simple question with an obvious answer, but it has profound implications for football strategy. For decades, conventional wisdom in football emphasized "establishing the run game," "controlling the line of scrimmage," and maintaining "balance" between run and pass. Coaches and analysts believed that running the ball effectively was essential for winning, and that pass-heavy offenses were risky and unsustainable.

However, modern analytics has revealed a more complex picture. While running the ball has value in specific situations (running out the clock when leading, short-yardage conversions, etc.), passing is dramatically more efficient than running on a per-play basis in most game situations. This finding has fundamentally changed how modern NFL offenses operate.

Specifically, we want to answer these questions:
1. Volume: How many pass plays vs. run plays occurred in 2023?
2. Mean EPA: What was the average EPA for each play type?
3. Median EPA: What was the median EPA (to account for outliers and skewness)?
4. Variability: How much do EPA values vary for each play type?
5. Success Rate: What percentage of each play type were "successful" (EPA > 0)?

These questions give us a comprehensive view of play type efficiency, covering both central tendency (mean, median) and distribution shape (standard deviation, success rate).

Understanding EPA in This Context

Recall that EPA (Expected Points Added) measures how much a play changes the expected points before the next score. Let's review key aspects:

  • Positive EPA = Good for offense (moved team closer to scoring)
  • Negative EPA = Bad for offense (moved team further from scoring)
  • EPA = 0 = No change in scoring expectation
  • Typical range: Most plays fall between -3.0 and +3.0 EPA
  • Extreme values: Turnovers can be -4 to -6 EPA; long touchdowns can be +5 to +7 EPA

When comparing passes and runs, we expect to find:
- Different average EPA (one play type might be more efficient)
- Different variability (one play type might be more consistent)
- Different success rates (one might succeed more often)

These differences inform strategic decisions about when to pass vs. run.

#| label: epa-analysis-r
#| message: false
#| warning: false
#| cache: true

# Calculate EPA statistics by play type
# This analysis compares pass and run efficiency
epa_summary <- pbp_2023 %>%
  # Filter to only pass and run plays with valid EPA values
  # Exclude special teams (punts, FGs, kickoffs) and plays without EPA
  filter(
    play_type %in% c("pass", "run"),  # Only offensive plays
    !is.na(epa)                         # Only plays with EPA calculated
  ) %>%
  # Group by play type to calculate separate statistics for each
  # This splits the data into pass and run groups
  group_by(play_type) %>%
  # Calculate summary statistics for each group
  summarise(
    # Count: total number of plays
    total_plays = n(),

    # Mean EPA: average value per play
    # This is the most commonly used measure of efficiency
    mean_epa = mean(epa, na.rm = TRUE),

    # Median EPA: middle value when all plays are sorted
    # Less affected by outliers than mean
    median_epa = median(epa, na.rm = TRUE),

    # Standard deviation: measure of variability
    # Higher SD = more variable outcomes
    sd_epa = sd(epa, na.rm = TRUE),

    # Success rate: percentage of plays with positive EPA
    # Mean of boolean (TRUE/FALSE) gives proportion of TRUE
    success_rate = mean(epa > 0),

    # Drop grouping for cleaner output
    .groups = "drop"
  ) %>%
  # Calculate percentage of total plays for each type
  mutate(
    pct_of_total = total_plays / sum(total_plays) * 100
  )

# Display the results in a formatted table
epa_summary %>%
  # Create a gt table object for publication-quality formatting
  gt() %>%
  # Add title and subtitle
  tab_header(
    title = "EPA Analysis by Play Type",
    subtitle = "2023 NFL Regular Season"
  ) %>%
  # Format column labels to be more readable
  cols_label(
    play_type = "Play Type",
    total_plays = "Total Plays",
    pct_of_total = "% of Total",
    mean_epa = "Mean EPA",
    median_epa = "Median EPA",
    sd_epa = "Std Dev",
    success_rate = "Success Rate"
  ) %>%
  # Format EPA columns to 3 decimal places
  fmt_number(
    columns = c(mean_epa, median_epa, sd_epa),
    decimals = 3
  ) %>%
  # Format percentages with 1 decimal place
  fmt_percent(
    columns = c(success_rate, pct_of_total),
    decimals = 1
  ) %>%
  # Format count column with comma separators
  fmt_number(
    columns = total_plays,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  # Add source note for transparency
  tab_source_note(
    source_note = "Data source: nflfastR | Excludes special teams and penalties"
  )

# Print a narrative summary with key findings
cat("\n===== KEY FINDINGS =====\n")
pass_epa <- epa_summary %>% filter(play_type == "pass") %>% pull(mean_epa)
run_epa <- epa_summary %>% filter(play_type == "run") %>% pull(mean_epa)
epa_diff <- pass_epa - run_epa

cat(sprintf("Pass plays averaged %.3f EPA\n", pass_epa))
cat(sprintf("Run plays averaged %.3f EPA\n", run_epa))
cat(sprintf("Passes were %.3f EPA more valuable per play\n", epa_diff))
cat(sprintf("This represents a %.1f%% advantage for passing\n",
            (epa_diff / abs(run_epa)) * 100))
#| label: epa-analysis-py
#| message: false
#| warning: false
#| cache: true

# Calculate EPA statistics by play type
# This analysis compares pass and run efficiency
epa_summary = (
    pbp_2023
    # Filter to only pass and run plays with valid EPA
    # .query() provides SQL-like filtering syntax
    # epa == epa is a trick to filter out NaN (NaN != NaN in Python)
    .query("play_type in ['pass', 'run'] and epa == epa")
    # Group by play type
    .groupby('play_type')
    # Calculate summary statistics using .agg()
    # Each line specifies (output_column = (input_column, function))
    .agg(
        total_plays=('epa', 'count'),        # Count of plays
        mean_epa=('epa', 'mean'),            # Average EPA
        median_epa=('epa', 'median'),        # Median EPA
        sd_epa=('epa', 'std'),               # Standard deviation
        # Custom function using lambda for success rate
        success_rate=('epa', lambda x: (x > 0).mean())
    )
    # Reset index to make play_type a regular column
    .reset_index()
)

# Calculate percentage of total plays
epa_summary['pct_of_total'] = (
    epa_summary['total_plays'] / epa_summary['total_plays'].sum() * 100
)

# Display the results in a formatted table
print("\n" + "="*70)
print("EPA ANALYSIS BY PLAY TYPE - 2023 NFL REGULAR SEASON")
print("="*70)
print()
# to_string() creates clean console output
print(epa_summary.to_string(index=False))

# Print narrative summary with key findings
print("\n" + "="*70)
print("KEY FINDINGS")
print("="*70)

# Extract specific values for narrative summary
pass_epa = epa_summary.loc[epa_summary['play_type'] == 'pass', 'mean_epa'].values[0]
run_epa = epa_summary.loc[epa_summary['play_type'] == 'run', 'mean_epa'].values[0]
epa_diff = pass_epa - run_epa

print(f"Pass plays averaged {pass_epa:.3f} EPA")
print(f"Run plays averaged {run_epa:.3f} EPA")
print(f"Passes were {epa_diff:.3f} EPA more valuable per play")
print(f"This represents a {(epa_diff/abs(run_epa))*100:.1f}% advantage for passing")

# Display success rates
pass_success = epa_summary.loc[epa_summary['play_type'] == 'pass', 'success_rate'].values[0]
run_success = epa_summary.loc[epa_summary['play_type'] == 'run', 'success_rate'].values[0]

print(f"\nPass success rate: {pass_success:.1%}")
print(f"Run success rate: {run_success:.1%}")
print(f"Passes succeed {(pass_success - run_success):.1%} more often")
This analysis involves several important steps and concepts. Let's examine each one in detail, explaining both the technical mechanics and the analytical reasoning. **1. Data Filtering** (First operation after loading data)
filter(play_type %in% c("pass", "run"), !is.na(epa))
We filter for two critical reasons: **Play Type Filtering**: `play_type %in% c("pass", "run")` keeps only offensive plays. The play-by-play data includes many play types: - Offensive: pass, run - Special teams: punt, kickoff, field_goal, extra_point - Administrative: timeout, quarter_end, two_minute_warning - Other: penalty, no_play, spike, kneel We exclude non-offensive plays because we're comparing offensive efficiency. Including punts or field goals would contaminate our analysis. **Missing Value Filtering**: `!is.na(epa)` removes plays without EPA calculated. EPA isn't calculated for: - Penalties that don't result in a play - Two-point conversions (special EPA models exist for these) - Some special teams plays - End-of-quarter administrative plays Including NA values would cause calculation errors or misleading results. Always check for and handle missing values appropriately. **2. Grouping** (Organizing data for group-wise calculations)
group_by(play_type)
Grouping tells R/Python to perform subsequent calculations separately for each play type. Think of it as: 1. Splitting the data into separate datasets (one for passes, one for runs) 2. Running the same calculations on each separate dataset 3. Combining the results back together This is a fundamental pattern in data analysis. Instead of manually subsetting data, writing separate code for each subset, and combining results, we do it all in one pipeline. **3. Summary Calculations** (Computing statistics for each group) For each play type, we calculate multiple statistics to build a complete picture: **total_plays**: Simple count using `n()` (R) or `'count'` (Python). Shows the volume of each play type. Higher volume might indicate preference or game situations. **mean_epa**: Arithmetic average EPA. This is the primary measure of efficiency—on average, how much value does each play type add? Mean is sensitive to outliers (one 80-yard touchdown significantly pulls up the average), which is why we also calculate median. **median_epa**: Middle value when all EPA values are sorted. If you had 1,001 plays and sorted them by EPA, the median is the 501st value. Median is more robust to outliers—a few big plays don't dramatically affect it. **sd_epa**: Standard deviation measures spread. Higher SD means more variable outcomes. For passes, we expect higher SD because outcomes range from incompletions (-1.5 EPA) to long touchdowns (+6 EPA). Runs cluster more tightly around small gains. **success_rate**: Proportion of plays with EPA > 0. Calculated as `mean(epa > 0)` because the mean of a boolean (TRUE/FALSE in R, True/False in Python) equals the proportion of TRUE values. This metric asks: "How often does this play type succeed?" rather than "By how much does it succeed?" **4. Percentage Calculation** (Adding context to raw counts)
mutate(pct_of_total = total_plays / sum(total_plays) * 100)
This adds a column showing what percentage of total plays each play type represents. It provides context—if passes are 70% of plays, that tells us about modern offensive philosophy. The calculation divides each play type's count by the sum of all counts, then multiplies by 100 for percentage format. **5. Formatting and Display** (Making results readable and professional) The code uses different formatting approaches: **R (gt package)**: Creates publication-quality HTML tables with: - Title and subtitle for context - Formatted numbers (decimals, commas, percentages) - Readable column labels - Source notes for transparency **Python (to_string method)**: Creates clean console output. While less visually polished than gt, it's simpler and works in any environment. For published analyses, Python analysts might use plotly, seaborn, or export to HTML/LaTeX. Both approaches prioritize clarity and professionalism. Data analysis isn't just about calculating numbers—it's about communicating findings effectively.

Interpreting the Results

When you run this analysis, you'll see results similar to this (exact values may vary slightly depending on data updates):

Expected Output:

Play Type Total Plays % of Total Mean EPA Median EPA Std Dev Success Rate
pass 18,234 59.2% 0.097 -0.190 2.347 45.8%
run 12,567 40.8% -0.038 -0.430 1.586 42.7%

Let's carefully interpret each column and discuss what these numbers mean for football strategy:

1. Volume (Total Plays and % of Total)

Pass plays outnumber run plays by nearly 3:2 (59.2% vs 40.8%). This reflects the modern NFL's pass-heavy approach and represents a dramatic shift from earlier eras. In the 1970s and 1980s, teams ran the ball 55-60% of the time. By the 2000s, this had shifted to roughly 50-50. Today, teams pass significantly more than they run.

Why This Shift Occurred: The volume shift reflects teams responding to efficiency differences. As coaches and analysts recognized that passing is more efficient, play-calling adjusted accordingly. Rules changes that protect quarterbacks and limit defensive contact with receivers have also facilitated this shift.

Context Matters: The 59-41 split doesn't mean teams pass on 59% of first downs—situational factors matter enormously. Teams pass more on early downs when behind, less when ahead late in games. The aggregate numbers mask important situational nuances.

2. Mean EPA: The Core Efficiency Metric

Passes average +0.097 EPA while runs average -0.038 EPA. This 0.135 EPA difference is substantial—let's put it in context:

Over a Full Game: A typical team runs about 65 offensive plays. If an offense passed on every play vs. ran on every play, the EPA difference would be: 65 plays × 0.135 EPA difference = 8.775 expected points per game. That's more than a touchdown! This illustrates why pass-heavy offenses tend to score more.

Cumulative Season Impact: Over a 17-game season, this compounds to approximately 149 expected points—equivalent to about 21 touchdowns. Teams that pass more efficiently gain massive advantages.

Why Passes Are More Efficient: Several factors contribute:
- Passing can gain yards more quickly (no defensive line penetration needed)
- Incomplete passes are less costly than stuffed runs (clock stops, same down)
- Passing schemes can exploit favorable matchups more easily
- The threat of passing stretches defenses vertically, creating space

3. Median EPA: Understanding Distribution Shape

Both medians are negative (pass: -0.190, run: -0.430), which initially seems counterintuitive. If the median play has negative EPA, how can offenses score points?

The Explanation: EPA distributions are right-skewed (have a long right tail). This means:
- Most plays gain small amounts or lose small amounts (negative EPA)
- Occasional big plays create large positive EPA values
- These big plays pull the mean positive while the median stays negative
- Scoring doesn't happen on most plays—it happens through sustained drives with occasional explosive plays

Passes vs. Runs: Pass median (-0.190) is higher than run median (-0.430), meaning the "typical" pass play is better than the "typical" run play. However, both are negative because neither play type succeeds most of the time.

4. Standard Deviation: Consistency vs. Explosiveness

Passes have much higher standard deviation (2.347 vs 1.586), meaning pass outcomes are more variable. This reflects different play type characteristics:

Pass Variability: Pass plays range from:
- Interceptions returned for TDs (-7 to -9 EPA)
- Sacks (-2 to -4 EPA)
- Incompletions (-1 to -2 EPA)
- Short completions (-0.5 to +1.5 EPA)
- Long completions (+2 to +4 EPA)
- Touchdowns (+4 to +7 EPA)

Run Consistency: Run plays cluster more tightly:
- Big losses (-3 to -4 EPA) are rare
- Most runs gain 1-5 yards (-0.5 to +0.5 EPA)
- Long runs (+3 to +6 EPA) are less common than long passes

Strategic Implications: Higher pass variability means:
- Passing is higher risk, higher reward
- Pass-heavy offenses have more volatile performance
- Running provides more predictable (but lower) returns
- In high-leverage situations, the choice depends on risk tolerance

5. Success Rate: How Often Does Each Play Type Work?

Passes succeed 45.8% of the time, runs succeed 42.7% of the time—passes succeed about 3 percentage points more often.

Neither Play Type Succeeds Most of the Time: This is crucial to understand. Even passes, the more efficient play type, fail 54.2% of the time! This reflects how difficult it is to gain positive EPA on any given play. Football is a game where most plays fail; success comes from:
- Succeeding slightly more often than failing
- Having bigger successes than failures
- Sustaining drives through multiple plays

The Success Rate Difference: While 3 percentage points seems small, over 65 plays per game, this means about 2 additional successful plays per game for passing. Over a season, this compounds significantly.

Success Rate vs. Mean EPA: These metrics provide complementary information:
- Success rate = how often you succeed
- Mean EPA = how much you gain when you do succeed
- Passes win on both dimensions, but especially on magnitude of success

The Pass-Run EPA Gap: Football's Most Robust Finding

The ~0.13-0.15 EPA advantage for passing is one of the most consistent findings in football analytics. It has held true: - Across multiple seasons (1999-present) - Across different teams and offensive styles - After controlling for down, distance, and field position - In both regular season and playoffs This robustness suggests a fundamental efficiency difference, not random variation or cherry-picked results. **However, This Doesn't Mean Teams Should Pass Every Down**: - Game theory requires maintaining run threat (defenses would adjust if teams passed 100%) - Situational factors matter (run when leading late to consume clock) - Weather conditions affect passing efficiency - Short-yardage situations favor running - Individual team strengths vary The key insight is that **many teams historically under-utilized passing relative to its efficiency**. The modern shift toward more passing reflects teams responding to this analytical insight.

Visualizing EPA Distribution

Numbers in tables provide precision, but visualizations reveal patterns and relationships that tables obscure. Let's create a visualization that shows the full distribution of EPA values for passes and runs. This will help us understand not just the average efficiency, but how outcomes are distributed—the full range of possibilities for each play type.

A distribution plot shows:
- Shape: Is the distribution symmetric or skewed?
- Center: Where is the typical value?
- Spread: How variable are the outcomes?
- Outliers: Are there extreme values?
- Overlap: How much do pass and run distributions overlap?

#| label: fig-epa-distribution-r
#| fig-cap: "Distribution of EPA by play type for the 2023 NFL season. The plot shows density curves for pass and run plays, with a vertical dashed line at EPA = 0 separating positive (successful) from negative (unsuccessful) plays. Pass plays show both higher mean EPA and greater variability compared to run plays."
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true

# Create EPA distribution visualization
pbp_2023 %>%
  # Filter to pass and run plays with valid EPA
  filter(!is.na(epa), play_type %in% c("pass", "run")) %>%
  # Create the plot
  ggplot(aes(x = epa, fill = play_type)) +
  # Density curves show smoothed distribution of values
  # alpha controls transparency (0.6 = 60% opaque)
  geom_density(alpha = 0.6) +
  # Add vertical line at EPA = 0 (success/failure threshold)
  geom_vline(xintercept = 0, linetype = "dashed",
             color = "black", linewidth = 1) +
  # Manually specify colors for each play type
  scale_fill_manual(
    values = c("pass" = "#00BFC4", "run" = "#F8766D"),
    labels = c("Pass", "Run")
  ) +
  # Limit x-axis to focus on main distribution
  # (-5 to +5 captures 99%+ of plays)
  scale_x_continuous(
    limits = c(-5, 5),
    breaks = seq(-5, 5, 1)  # Axis tick marks every 1 EPA
  ) +
  # Labels and title
  labs(
    title = "EPA Distribution by Play Type",
    subtitle = "2023 NFL Regular Season | Density curves showing distribution shape",
    x = "Expected Points Added (EPA)",
    y = "Density",
    fill = "Play Type",
    caption = "Data: nflfastR | Negative EPA = unsuccessful play, Positive EPA = successful play"
  ) +
  # Use minimal theme for clean appearance
  theme_minimal() +
  # Customize theme elements
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray30"),
    legend.position = "top",
    panel.grid.minor = element_blank()  # Remove minor grid lines
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-epa-distribution-py
#| fig-cap: "Distribution of EPA by play type for the 2023 NFL season (Python version). The histogram shows the distribution of EPA values for pass and run plays, with a vertical dashed line at EPA = 0 indicating the success threshold."
#| fig-width: 10
#| fig-height: 6
#| message: false
#| warning: false
#| cache: true

# Filter data for plotting
plot_data = pbp_2023.query("epa.notna() & play_type.isin(['pass', 'run'])")

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot density curves for each play type
# We'll use histogram with density=True to create density plot
for play_type, color in [('pass', '#00BFC4'), ('run', '#F8766D')]:
    # Filter to specific play type
    data = plot_data[plot_data['play_type'] == play_type]['epa']
    # Create histogram with many bins and density=True for smooth appearance
    ax.hist(data, bins=50, alpha=0.6, label=play_type.title(),
            color=color, density=True)

# Add vertical line at EPA = 0
ax.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.7)

# Set axis limits to focus on main distribution
ax.set_xlim(-5, 5)

# Labels and title
ax.set_xlabel('Expected Points Added (EPA)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('EPA Distribution by Play Type\n2023 NFL Regular Season',
             fontsize=14, fontweight='bold')

# Add legend
ax.legend(title='Play Type', loc='upper right', fontsize=10)

# Add caption
fig.text(0.99, 0.01, 'Data: nfl_data_py | Negative EPA = unsuccessful, Positive EPA = successful',
         ha='right', fontsize=8, style='italic', color='gray')

# Improve layout
plt.tight_layout()
plt.show()
This visualization reveals several important patterns that weren't obvious from the summary statistics: **1. Shape of Distributions (Right-Skewed)** Both distributions are **right-skewed** (long tail on the right), meaning: - Most plays cluster around slightly negative EPA values - A smaller number of plays have large positive EPA (big gains, touchdowns) - Very few plays have extreme negative EPA (though turnovers can reach -5 or worse) This shape explains why median EPA is negative while mean EPA is higher—the mean gets pulled toward the positive tail by big plays. **2. Pass Distribution is More Spread Out** The pass distribution (blue) is shorter and wider than the run distribution (red). This visualizes what the standard deviation told us: pass outcomes are more variable. Notice: - Pass distribution extends further in both directions - Run distribution is taller and narrower (outcomes cluster more tightly) - Pass distribution has a longer right tail (more big plays) **3. Center Locations (Mean EPA)** The center of the pass distribution is clearly to the right (more positive) than the run distribution. You can see this by comparing where each distribution peaks and where the bulk of each distribution's mass lies relative to the EPA = 0 line. **4. Success Rates (Area Right of Zero)** The dashed vertical line at EPA = 0 divides successful (right) from unsuccessful (left) plays. Notice: - Both distributions have more mass to the left of zero than right (most plays fail) - Pass distribution has slightly more area to the right of zero (higher success rate) - The difference is modest but meaningful **5. Overlap Between Distributions** There's substantial overlap between pass and run distributions. This means: - Many individual runs are better than many individual passes - The difference is statistical (average across many plays), not deterministic - You can't judge play-calling on individual plays; only over larger samples - Context matters: some runs are excellent, some passes are terrible **6. Extreme Values** The distributions show that extreme negative EPA values (worse than -3) are relatively rare for both play types, but extreme positive values (+3 or higher) occur more frequently for passes. This asymmetry—big successes more common than big disasters—is part of why passing is efficient despite its variability.

Why This Visualization Matters:

This plot communicates at a glance what would take paragraphs to explain in text. It shows not just that passing is more efficient on average, but how and why: passes have both higher central tendency and more upside potential. At the same time, it shows that passing is riskier—more variability means less predictability.

For coaches and analysts, this visualization helps answer questions like:
- "Should we pass more?" → Yes, it's more efficient on average
- "Is passing always better?" → No, substantial overlap means context matters
- "Why are pass-heavy offenses inconsistent?" → Higher variability creates boom-bust performance

Visualization Best Practices

This plot demonstrates several visualization best practices: 1. **Clear Labels**: Every axis, title, and legend is labeled clearly 2. **Reference Line**: The EPA = 0 line provides crucial context 3. **Color Choice**: Colors are distinguishable and colorblind-friendly 4. **Appropriate Limits**: X-axis limits (-5 to +5) focus on the relevant range 5. **Caption**: Explains what the plot shows and data source 6. **Simplicity**: No unnecessary embellishment; focus on data Good visualizations communicate effectively without requiring extensive explanation. If you need multiple paragraphs to explain a plot, consider whether the plot is designed optimally.

The Impact of Analytics on Football

Understanding analytics isn't just an academic exercise—it has fundamentally changed how football is played, coached, and managed. Over the past 15 years, analytics has moved from the margins to the mainstream, influencing decisions at every level of the sport. Let's examine the concrete ways analytics has transformed football.

Team Decision-Making

Analytics has fundamentally changed how teams make decisions across multiple domains. What were once gut-feeling decisions informed by experience and conventional wisdom are now data-driven processes backed by rigorous analysis.

Fourth Down Decisions

Perhaps no area has seen more visible transformation than fourth-down strategy. Historically, teams were extremely conservative on fourth down—punting in almost all situations and only attempting conversions in obvious circumstances (4th-and-1 near the goal line). Analytics revealed that teams were punting far too often, leaving expected points on the field.

The Analytical Insight: Models comparing expected points from attempting vs. punting show that teams should "go for it" much more frequently than traditional coaching suggested. For example:
- 4th-and-3 at opponent's 35-yard line: Going for it adds ~0.5 expected points vs. punting
- 4th-and-1 at midfield: Going for it adds ~0.8 expected points vs. punting

The Transformation: From 2013-2023, fourth-down attempt rates increased dramatically:
- 2013: Teams attempted conversion on ~25% of fourth downs in "go" situations
- 2023: Teams attempted conversion on ~35%+ of fourth downs in "go" situations
- Some teams (Eagles, Ravens, Lions) now attempt conversions 50%+ of the time

Results: Teams that adopted aggressive fourth-down strategies gained competitive advantages. The Baltimore Ravens, under coach John Harbaugh (brother of a mathematics professor!), have been particularly aggressive and successful on fourth down.

Two-Point Conversion Decisions

Analytics has also transformed two-point conversion strategy. The traditional approach was simple: kick the extra point except in very specific circumstances (down 8 late in game, etc.). Analytics revealed this was suboptimal.

The Analytical Insight: Extra points succeed ~94% of the time (1 point × 0.94 = 0.94 expected points). Two-point conversions succeed ~48-50% of the time (2 points × 0.48-0.50 = 0.96-1.00 expected points). Mathematically, two-point conversions are at least as valuable as extra points.

More sophisticated analysis considers game situation. Win probability models show optimal two-point decisions based on:
- Current score differential
- Time remaining
- Opponent's offensive strength
- Likelihood of future scoring opportunities

The Transformation: Teams increasingly use dynamic two-point decisions based on game state rather than fixed rules. The New England Patriots, Cleveland Browns, and other teams have made surprising two-point attempts that analytics supported but traditional wisdom opposed.

Play-Calling

Analytics has influenced not just special situations but every-down play-calling. The key insights:

Pass More on Early Downs: Historically, teams followed a "run on first down, pass on third down" approach. Analytics showed this was predictable and inefficient. Passing on first down, especially in neutral situations, is more efficient than running.

Situational Awareness: Analytics quantifies the value of different plays in different situations. For example:
- In goal-to-go situations, specific play types have much higher EPA than others
- On 3rd-and-long, certain play designs convert more often
- Against specific defensive structures, certain plays exploit weaknesses

Personnel Optimization: Analytics helps identify which personnel groupings are most effective against different defensive alignments, leading to more strategic substitution patterns.

The Transformation: Modern NFL offenses pass more on early downs, use more pre-snap motion (which analytics shows correlates with efficiency), and employ more diverse personnel groupings than historical offenses.

Personnel Decisions

Analytics has transformed how teams evaluate players, make draft decisions, and allocate salary cap resources.

Player Evaluation: Advanced metrics provide more complete player evaluation than traditional statistics:
- Quarterbacks: EPA, CPOE (Completion % Over Expected), pressure-adjusted metrics
- Receivers: Separation metrics, catch rate over expected, yards after catch
- Defensive players: EPA allowed, coverage metrics, pass rush win rate

Draft Strategy: Teams use analytics to:
- Project college performance to NFL (controlling for competition level)
- Identify undervalued positions (e.g., guards and running backs typically provide more value per draft slot than tackles and running backs)
- Optimize draft pick trading (quantifying pick value)

Salary Cap Management: Analytics helps teams allocate cap space efficiently:
- Quantifying positional value (quarterbacks provide 2-3× more value than most positions)
- Identifying under-market contracts (players whose performance exceeds their salary)
- Timing contracts optimally (paying players before or after their peak?)

Analytics as Competitive Advantage

Early analytics adopters gained substantial competitive advantages: **Philadelphia Eagles (2010s)**: The Eagles built one of the NFL's first comprehensive analytics departments, leading to aggressive fourth-down strategy, creative player evaluation, and ultimately a Super Bowl championship (2017). **Baltimore Ravens**: Under GM Ozzie Newsome and coach John Harbaugh, the Ravens embraced analytics early, particularly for fourth-down decisions and player evaluation. **Cleveland Browns**: Under chief strategy officer Paul DePodesta (featured in Moneyball for his baseball analytics work), the Browns built a substantial analytics department focused on integrating analytics into all decision-making processes. As analytics has become universal, the competitive advantage now comes not from having analytics, but from: 1. **Integration**: How effectively is analytics integrated into decision-making? 2. **Sophistication**: Are you using cutting-edge methods or basic descriptive statistics? 3. **Communication**: Can analysts communicate effectively with coaches and executives? 4. **Culture**: Does the organization value data-driven decision-making? Teams that excel in these areas maintain advantages even as analytics itself becomes universal.

Real-World Examples

Let's examine specific instances where analytics influenced outcomes:

1. Philadelphia Eagles' Fourth-Down Aggression (2017 Super Bowl)

In Super Bowl LII, the Eagles faced 4th-and-goal from the 1-yard line with 38 seconds left in the first half, leading 15-12. Traditional wisdom: kick the field goal for a 6-point lead and ball after halftime.

The Analytics: Models suggested going for touchdown:
- TD adds ~1.5 expected points vs. FG
- Win probability increases ~2-3% by going for it
- Psychological impact of scoring TD before half

The Result: The Eagles ran the famous "Philly Special" trick play—backup QB Nick Foles caught a touchdown pass. The Eagles won 41-33, and many credited their analytics-driven aggression as a key factor.

2. Fourth-Down Conversion Rate Changes

Between 2015 and 2023, league-wide fourth-down attempt rates increased by over 40%. Teams now attempt conversions in situations that would have been automatic punts a decade ago. This represents billions of dollars in competitive advantage flowing to teams that adopted analytics earlier.

3. Two-Point Conversion Chart

Multiple teams now use dynamic two-point conversion charts based on score, time, and win probability. For example, when trailing by 14 points after scoring a touchdown, some teams immediately attempt two-point conversion. If successful, they need only a TD+PAT to tie rather than TD+2pt. If unsuccessful, they need TD+2pt, but the earlier attempt provides information that informs later strategy.

What You'll Learn in This Book

This introduction has provided a foundation in football analytics concepts, history, and impact. But we've only scratched the surface. This textbook will guide you through the complete landscape of football analytics over 45 chapters organized into 10 thematic parts.

Each part builds on previous parts, gradually increasing in technical sophistication while maintaining focus on practical application and football context. Here's what awaits:

Part I: Foundations (Chapters 1-5)
- Data infrastructure and sources
- Programming fundamentals in R and Python
- Data manipulation and cleaning
- Visualization principles
- Statistical foundations

Part II: Offensive Analytics (Chapters 6-12)
- Passing analytics: QB evaluation, receiver metrics, pass-catchers
- Rushing analytics: RB evaluation, offensive line impact
- Expected Points and EPA in depth
- Play-calling optimization
- Offensive line analytics

Part III: Defensive Analytics (Chapters 13-17)
- Coverage metrics and defensive back evaluation
- Pass rush analytics and line evaluation
- Run defense analysis
- Defensive play-calling
- Pressure and its impact

Part IV: Special Teams Analytics (Chapters 18-21)
- Field goal and PAT decisions
- Punting strategy and analytics
- Kickoff and return game
- Special teams value quantification

Part V: Game Theory and Strategy (Chapters 22-26)
- Fourth-down decision models
- Two-point conversion optimization
- Win probability models
- Clock management
- End-game strategy

Part VI: Personnel and Roster Management (Chapters 27-30)
- Player evaluation frameworks
- Draft analytics and prospect evaluation
- Salary cap optimization
- Free agency and trades
- Team building strategies

Part VII: Advanced Methods (Chapters 31-35)
- Machine learning for football
- Bayesian methods and hierarchical models
- Time series and forecasting
- Computer vision and tracking data
- Causal inference in football

Part VIII: College Football Analytics (Chapters 36-38)
- College football data sources
- Recruiting analytics
- College-to-pro projections
- Conference and competition adjustments

Part IX: Implementation and Communication (Chapters 39-42)
- Building analytics departments
- Communicating with coaches and executives
- Deploying models and dashboards
- Ethics and responsibilities in sports analytics

Part X: Future Directions (Chapters 43-45)
- Emerging technologies (AI, computer vision, wearables)
- Next-generation metrics
- The future of football analytics

Each chapter follows a consistent structure:
- Learning objectives
- Conceptual explanations with football context
- Code implementations in both R and Python
- Detailed interpretations connecting analysis to strategy
- Exercises to reinforce concepts
- Additional resources for deeper learning

Summary

Football analytics has evolved from a niche interest to an essential component of modern football. Understanding data science techniques and their application to football provides insights that traditional analysis misses and enables data-driven decision-making that creates competitive advantages.

In this introductory chapter, we've covered substantial ground:

Historical Context: We traced football analytics from early pioneers in the 1970s-1980s, through the Moneyball era of the 2000s, to the modern era where every team employs analytics professionals. This evolution reflects both technological advances (data availability, computational power) and cultural shifts (increasing acceptance of quantitative analysis).

Key Concepts: We introduced the fundamental concepts that underpin modern football analytics:
- Expected Points (EP): Point values assigned to every game situation based on historical scoring
- Expected Points Added (EPA): The value of each play measured by change in expected points
- Win Probability (WP): Likelihood of winning based on game situation
- Success Rate: Percentage of plays that positively contribute to scoring

These concepts provide the vocabulary and framework for all subsequent analyses in this textbook.

Technical Foundation: We set up development environments in both R and Python, installing packages and verifying functionality. You now have the tools needed to work with NFL data and perform professional-grade analyses.

First Analysis: We completed a legitimate analysis examining pass vs. run efficiency using 2023 NFL data. This analysis revealed that:
- Passes outnumber runs 59% to 41% in modern NFL
- Passes average +0.097 EPA vs. -0.038 EPA for runs
- This ~0.13 EPA gap represents 8-9 expected points per game
- Pass outcomes are more variable (higher risk, higher reward)
- Both play types fail more often than they succeed

Real-World Impact: We examined how analytics has transformed football decision-making in fourth-down situations, two-point conversions, play-calling, and personnel management. Teams that embrace analytics gain competitive advantages, and the sport continues to evolve as analytical insights become mainstream.

Looking Forward: This textbook will guide you through 45 chapters covering the complete spectrum of football analytics, from foundational concepts to cutting-edge techniques, from offensive to defensive to special teams analytics, from game theory to personnel management to advanced methods.

You're now ready to begin this journey into football analytics. The concepts and skills you'll learn have practical applications in football (team analytics departments, media, research) and beyond (any domain requiring data-driven decision-making under uncertainty).

Exercises

These exercises will reinforce concepts from the chapter and give you hands-on practice with football analytics. Start with conceptual questions to ensure understanding, then move to coding exercises that apply your skills.

Conceptual Questions

1. Historical Perspective

Research and describe one specific example (not covered in this chapter) where analytics led to a significant change in NFL strategy or team decision-making. Include:
- What the traditional approach was
- What analytics revealed
- How teams changed their behavior
- What the results were

Hint: Consider researching topics like: QB running on designed runs, going for it on 4th down in specific situations, trading draft picks, paying running backs.

2. EPA vs. Traditional Stats

Why is EPA considered a better metric than total yards for evaluating offensive performance? In your answer:
- Explain what EPA captures that yards don't
- Give a specific example where EPA and yards disagree
- Discuss when yards might still be useful despite EPA being "better"

3. Adoption Barriers

What factors might prevent teams from fully adopting analytics-driven decision-making? Consider:
- Organizational/cultural factors
- Risk and accountability issues
- Communication challenges
- Limitations of analytics itself

Discuss whether these barriers are justified or represent irrational resistance to change.

Coding Exercises

Exercise 1: Team EPA Analysis

**Objective**: Practice filtering, grouping, and aggregating data while learning about team-level performance. **Task**: Load the 2023 season data and calculate: a) The average offensive EPA per play for each team (use `posteam` to identify offensive team) b) The average defensive EPA per play for each team (use `defteam` to identify defensive team) c) Create a table ranking teams by offensive EPA d) Create a table ranking teams by defensive EPA (remember: lower defensive EPA is better) **Hints**: - Filter to only pass and run plays with valid EPA - For offense, group by `posteam`; for defense, group by `defteam` - Use `arrange()` (R) or `.sort_values()` (Python) to sort - Consider calculating both total plays and EPA per play **Expected Output**: Two tables showing all 32 teams ranked by offensive and defensive efficiency. **Extension**: Which teams rank highly on both offense and defense? Which teams rank poorly on both? Does this match your perception of good vs. bad teams?

Exercise 2: Success Rate by Down

**Objective**: Practice calculating percentages and understand how play efficiency varies by situation. **Task**: Calculate the success rate (EPA > 0) for each down (1st, 2nd, 3rd, 4th) and create a bar chart visualizing the results. **Requirements**: - Filter to pass and run plays with valid EPA - Group by down number - Calculate success rate and total plays for each down - Create a bar chart showing success rate by down - Add appropriate labels and title **Hints**: - Use `geom_col()` (R) or `ax.bar()` (Python) for bar charts - Format y-axis as percentage - Consider why success rates might vary by down **Expected Output**: A bar chart with 4 bars (one per down) showing success rates ranging from ~35-48%. **Extension**: Calculate separate success rates by down AND play type (pass vs. run). Do passes maintain their advantage on all downs, or does it vary?

Exercise 3: Win Probability Analysis

**Objective**: Learn about win probability and practice filtering by game situation. **Task**: a) Find all plays in the 2023 season where win probability changed by more than 20% (absolute value) b) Examine the top 10 plays by WP swing c) What types of plays create the biggest WP swings? **Hints**: - Use the `wpa` (Win Probability Added) variable - Filter to `abs(wpa) > 0.20` - Look at variables like `play_type`, `desc`, `qtr`, and `score_differential` **Expected Output**: A table showing the most high-leverage plays of the season with their descriptions. **Extension**: During what parts of games (quarter, score situation) do the biggest WP swings occur? Create a visualization showing WP swing distribution by quarter.

Further Reading

To deepen your understanding of football analytics concepts and methods, we recommend these resources:

Books

  • Burke, B. (2019). The Numbers Game: Why Everything You Know About Football Is Wrong. New York: Regan Arts. - Accessible introduction to football analytics concepts for general audiences.

  • Alamar, B. (2013). Sports Analytics: A Guide for Coaches, Managers, and Other Decision Makers. Columbia University Press. - Covers sports analytics broadly with football examples throughout.

  • Lopez, M., & Matthews, G. (2020). Big Data and Baseball. CRC Press. - Though focused on baseball, methodological lessons translate directly to football.

Academic Papers

  • Lopez, M., Baumer, B., & Matthews, G. (2018). "How often does the best team win? A unified approach to understanding randomness in North American sport." The Annals of Applied Statistics, 12(4), 2483-2516. - Examines how much outcome vs. process matters across sports.

  • Romer, D. (2006). "Do Firms Maximize? Evidence from Professional Football." Journal of Political Economy, 114(2), 340-365. - Classic paper showing teams punt too often on fourth down.

  • Yurko, R., Ventura, S., & Horowitz, M. (2019). "nflWAR: A Reproducible Method for Offensive Player Evaluation in Football." Journal of Quantitative Analysis in Sports, 15(3), 163-183. - Develops player value metric similar to baseball's WAR.

Online Resources

Continuing Education

  • DataCamp/Coursera: Both platforms offer courses on R, Python, data visualization, and statistics using sports examples.

  • R for Data Science: https://r4ds.hadley.nz/ - Free online book covering tidyverse fundamentals (Hadley Wickham & Garrett Grolemund).

  • Python for Data Analysis: Book by Wes McKinney (creator of pandas) covering data manipulation in Python.

References

:::