Chapter 5: Football Data Visualization | Football Analytics Textbook

Learning ObjectivesBy the end of this chapter, you will be able to:

Play Type Distribution

A pie chart showing the breakdown of different play types throughout an NFL season.

Success Rate by Down & Distance

This heatmap visualizes how success rates vary based on down and yards to go. Darker green indicates higher success.

Understand and apply principles of effective data visualization to football analytics
Master ggplot2 (R) and matplotlib/seaborn (Python) for creating football visualizations
Select appropriate chart types for different data relationships and analytical questions
Create publication-quality football visualizations with proper color, labels, and annotations
Design interactive visualizations for presentations and web-based reports
Apply color theory and design principles specifically for football data
Use team colors and logos effectively with nflplotR and related tools
Tell compelling stories with football data through visual narratives

Introduction

In the high-stakes world of football analytics, your ability to communicate insights through effective visualization is just as critical as your analytical skills. Imagine you've discovered that a team's fourth-down decision-making costs them three expected points per game—a significant tactical advantage. If you present this finding in a dense spreadsheet with hundreds of rows, decision-makers' eyes will glaze over. But if you create a compelling visualization that instantly reveals the pattern, you can change coaching strategy and impact game outcomes.

This chapter explores the art and science of data visualization specifically tailored for football analytics. We'll move beyond generic charting to create visualizations that speak the language of football—using team colors, incorporating logos, and designing graphics that resonate with coaches, front office executives, and fans alike. Whether you're preparing a presentation for an NFL analytics department, writing a research paper, or creating content for social media, the visualization skills you develop here will be essential.

Effective football visualization requires balancing multiple concerns. Your graphics must be technically accurate, visually appealing, and immediately interpretable. They need to work for audiences ranging from statistically sophisticated analysts to coaches who think in terms of yards and points, not regression coefficients. They must look professional in PowerPoint presentations, academic papers, and Twitter posts. Most importantly, they must reveal insights—not just display data.

Throughout this chapter, we'll build from fundamental principles to advanced techniques. We'll start by understanding what makes a visualization effective, explore the grammar of graphics that underlies modern visualization tools, and then work through practical examples using real NFL data. By the end, you'll be able to create visualizations that would be at home in an NFL front office, a peer-reviewed journal, or on the desk of an ESPN analyst.

Why Visualization Matters in Football Analytics

Good visualization serves multiple critical functions in football analytics: - **Pattern Recognition**: Makes complex patterns immediately apparent that would be hidden in tables of numbers - **Decision Support**: Facilitates rapid decision-making under pressure (in-game analytics, draft decisions) - **Communication**: Bridges the gap between analysts and non-technical stakeholders like coaches and executives - **Credibility**: Enhances professionalism and trust in your analytical work - **Exploration**: Enables interactive discovery of insights in your data - **Persuasion**: Convinces decision-makers to act on your recommendations - **Memory**: Creates memorable insights that stick with audiences long after presentations end The most brilliant analysis is worthless if it can't be communicated effectively. Visualization is your primary tool for turning data into decisions.

Principles of Effective Visualization

Before we write a single line of code, we must internalize the principles that separate exceptional visualizations from mediocre ones. These principles transcend any particular tool or programming language—they apply whether you're using R, Python, Tableau, or even hand-drawing charts.

The Three Pillars of Good Visualization

Every effective data visualization rests on three fundamental pillars: clarity, accuracy, and aesthetics. Understanding these principles will guide every visualization decision you make throughout your career.

Pillar 1: Clarity – The Message Must Be Immediate

Clarity means your visualization's primary message should be apparent within 3-5 seconds of viewing. Your audience shouldn't need to study the graphic, squint at tiny labels, or puzzle over what the axes represent. This is especially critical in football analytics, where coaches may be reviewing your work between meetings or executives scanning reports before important decisions.

How to Achieve Clarity:

Remove Chartjunk: Eliminate decorative elements that don't convey information. Edward Tufte, the visualization pioneer, calls these elements "chartjunk"—grid lines you don't need, background colors that distract, or 3D effects that obscure data. Every element should serve a purpose.
Use Appropriate Chart Types: A scatter plot reveals correlations; a line chart shows trends over time; a bar chart compares categories. Choosing the wrong type creates confusion. We'll explore chart type selection in detail shortly.
Provide Clear Labels: Axes should have descriptive labels with units specified. Titles should be informative, not just "Figure 1." A good title tells readers what they're seeing: "Pass EPA Outperforms Rush EPA by 0.13 Points per Play" is better than "EPA by Play Type."
Ensure Readability: Use font sizes large enough to read (minimum 10-12 points for body text). Maintain sufficient contrast between text and background. Avoid light yellow text on white backgrounds or dark blue on black.

The Five-Second Test

After creating a visualization, step away from your computer for a few minutes. Then return and look at your graphic for exactly five seconds. Can you immediately identify: 1. What the visualization is about? 2. What the main pattern or finding is? 3. What the axes represent? If not, your visualization needs more clarity. This simple test prevents countless hours of confusion for your audience.

Pillar 2: Accuracy – The Visualization Must Not Mislead

Accuracy means your visualization represents the data truthfully. This goes beyond having correct numbers—it means the visual encoding (how data maps to visual properties) doesn't distort perception. Misleading visualizations, whether intentional or accidental, undermine trust and can lead to terrible decisions.

Common Accuracy Pitfalls and Solutions:

Truncated Axes: When comparing magnitudes (like team totals), axes should generally start at zero. If Team A has 350 total points and Team B has 340, showing only the 340-350 range makes the difference appear enormous when it's actually modest (2.9%). However, when showing changes or differences, non-zero axes may be appropriate—just be explicit about it.
Inappropriate Scales: Using logarithmic scales when your audience expects linear scales can confuse interpretation. Be explicit about scale choice in your labels.
Distorted Aspect Ratios: Stretching or squashing charts can exaggerate or minimize trends. Line charts are particularly sensitive to this—a 45-degree slope suggests balanced change, but you can make any trend look dramatic by manipulating the aspect ratio.
Cherry-Picked Data: Showing only a subset of data without disclosure misleads readers. If you're showing "Top 10 Offenses," make that clear—don't imply you're showing all teams.
Missing Context: Showing raw statistics without accounting for game situation, opponent quality, or sample size can mislead. Always provide necessary context.

Show Uncertainty When Relevant:

Football is probabilistic, not deterministic. When your analysis involves estimates (like player projections) or small sample sizes (like a quarterback's first three games), show confidence intervals or error bars. This honesty builds trust and prevents overconfident decision-making.

The Misleading Y-Axis

One of the most common visualization mistakes in sports media is the truncated y-axis used to exaggerate small differences. **Example**: Imagine two quarterbacks with passer ratings of 98.5 and 96.2. If you create a bar chart with a y-axis running from 95 to 100, the difference looks massive—one bar is twice as tall as the other. But if the axis runs from 0 to 158.3 (the maximum possible passer rating), the difference appears appropriately modest. **When to Truncate**: It's acceptable to truncate axes when showing *changes* or *differences* rather than total magnitudes, but always label clearly so readers understand what they're seeing.

Pillar 3: Aesthetics – The Visualization Should Be Visually Appealing

Aesthetics might seem superficial compared to accuracy, but visual appeal directly impacts how seriously your work is taken. A polished, professionally designed visualization signals that you care about quality and details. It makes your analysis more persuasive and memorable.

Elements of Aesthetic Excellence:

Thoughtful Color Choices: Colors should be purposeful, not arbitrary. Use team colors for authenticity, diverging color schemes (red-white-green) to show good/neutral/bad performance, and sequential schemes (light to dark blue) to show increasing quantities. Limit your palette to 3-5 colors per visualization to avoid overwhelming viewers.
Consistent Styling: All visualizations in a report or presentation should share a consistent aesthetic—same fonts, same color schemes, same axis formatting. This consistency appears professional and makes it easier for audiences to switch between graphics.
Balanced White Space: Don't cram too much into one graphic. White space (empty space around and within visualizations) helps viewers focus and prevents visual clutter. It's okay to have margins and padding—they make your content more digestible.
Audience Consideration: Design for your specific audience. Academic journals typically prefer conservative, clean designs. Social media graphics can be bolder and more stylized. Presentations to coaches might emphasize practical takeaways over statistical nuance.

The Power of Polish:

Professional sports organizations evaluate analysts partly on presentation skills. Two analysts might produce identical statistical findings, but the one who can package those findings in compelling visualizations will have greater impact. Aesthetics matter.

Professional Quality Standards

In NFL analytics roles, your visualizations will be seen by head coaches, general managers, and owners—people who make million-dollar decisions. Your graphics should meet professional standards: - Use high-resolution outputs (300 DPI for print, 150 DPI for presentations) - Eliminate typos and grammatical errors in labels - Maintain brand consistency (use team colors and logos appropriately) - Test visualizations with diverse viewers before finalizing - Export in appropriate formats (vector formats like PDF for publications, PNG for presentations) The difference between good and great is often just 15 minutes of polishing—but that difference can determine whether your recommendations are implemented.

Choosing the Right Chart Type

One of the most important visualization decisions is selecting the appropriate chart type for your data and message. Different chart types excel at revealing different patterns. Using the wrong type obscures insights or confuses viewers.

The following table provides guidance for common data relationships in football analytics:

Data Relationship	Best Chart Type	Football Analytics Example	When to Use
Distribution of a single variable	Histogram, Density Plot	Distribution of EPA values across all plays	When you want to understand the shape, spread, and central tendency of a variable
Comparison of categories	Bar Chart (vertical or horizontal)	Team offensive EPA rankings; QB completion percentages by team	When comparing discrete categories on a single metric
Relationship between two continuous variables	Scatter Plot	Pass EPA vs. Rush EPA by team; QB attempts vs. yards per attempt	When exploring correlations or patterns between two numeric variables
Trend over time	Line Chart	Win probability throughout a game; team performance across season	When showing how something changes sequentially over time or progression
Part-to-whole composition	Stacked Bar Chart, Pie Chart (use sparingly)	Play type distribution (% pass vs. run) by team	When showing how categories combine to make a whole
Comparing distributions	Box Plot, Violin Plot, Overlaid Density Plots	Comparing EPA distributions for pass vs. run plays	When comparing the full distribution shape across categories
Multiple variables across categories	Faceted (Small Multiples) Plots	EPA distributions by down (1st, 2nd, 3rd, 4th) and play type	When showing the same relationship across many subgroups
Statistical uncertainty	Error Bars, Confidence Intervals	Team EPA with 95% confidence intervals	When sample size is small or estimates have uncertainty
Geographic/spatial patterns	Heat Maps, Geographic Maps	Field position analysis; where on the field plays succeed	When location or position matters to the analysis

Principles for Chart Selection:

Match the Visual to the Data Structure: Categorical data (teams, play types) work well with bar charts. Continuous data (EPA, yards gained) suit histograms, density plots, or scatter plots.
Consider Your Primary Question: If asking "which team is best?", use a ranked bar chart. If asking "how are these two stats related?", use a scatter plot. The question guides the choice.
Think About Comparisons: Bar charts make categorical comparisons easy. Line charts work better for temporal comparisons. Scatter plots excel at multidimensional comparisons.
Avoid Overused Types: Pie charts are almost always inferior to bar charts for showing proportions (human vision is better at comparing lengths than angles). 3D charts add no information while making data harder to read. Avoid both.

Why Avoid Pie Charts?

Pie charts are ubiquitous in business presentations but rarely the best choice. Here's why: **The Problem**: Human visual perception is much better at comparing lengths (as in bar charts) than angles or areas (as in pie charts). When you have more than 2-3 categories, pie charts become very difficult to read accurately. **Example**: Try to determine if a pie slice representing 22% is bigger than one representing 20%. Now imagine the same comparison with bar lengths—it's instantly obvious. **When They're Acceptable**: Pie charts work for showing simple binary splits (win percentage: 65% wins, 35% losses) or when precise comparison isn't needed (general sense of distribution). Even then, a bar chart or simple statistic usually works better. **Better Alternative**: Use horizontal bar charts to show part-to-whole relationships. They're easier to label, easier to compare, and can display many more categories without becoming cluttered.

Color Theory for Football Visualizations

Color is one of the most powerful—and most frequently misused—elements of data visualization. In football analytics, color serves double duty: it must encode data accurately while also connecting to the visual language of football (team colors, traditional associations like red = bad defense, green = good offense).

Understanding Color Types

Sequential Colors: Use when data has a natural order from low to high. Examples include light blue to dark blue, or white to dark green. Sequential colors work well for continuous variables like EPA, yards gained, or win probability.

Football Example: Showing team success rates from low (light gray) to high (dark green)
When to Use: Any metric where "more" has a consistent meaning (more EPA, more wins, higher efficiency)

Diverging Colors: Use when data has a meaningful midpoint with extremes in both directions. Typically combines two sequential scales meeting at a neutral center. Common schemes include red-white-blue or red-yellow-green.

Football Example: EPA (negative to positive), or performance relative to league average
When to Use: Metrics where zero is meaningful (above/below average, positive/negative), or when comparing to a benchmark

Categorical Colors: Use distinct hues for different categories with no inherent order. Each category gets a clearly different color. Limit to 5-7 categories maximum for visibility.

Football Example: Different colors for AFC vs. NFC, or for different positions
When to Use: Nominal categories (team names, positions, play types) where there's no natural ordering

Team Colors: NFL teams have established color identities. Using authentic team colors adds recognition value and professionalism to your visualizations.

Football Example: Using Kansas City red for Chiefs data points, Miami teal for Dolphins
When to Use: Any team-specific visualization, especially when showing multiple teams simultaneously

Best Practices for Color Use

1. Limit Your Palette

Use no more than 5-7 colors in a single visualization. More colors make it impossible for viewers to distinguish categories or track patterns. If you need to show 32 NFL teams, consider:

Showing only top 10 and bottom 10
Using grayscale for most teams and highlighting 2-3 teams of interest in color
Creating small multiples (separate panels for different groups)

2. Ensure Colorblind Accessibility

Approximately 8% of males and 0.5% of females have some form of color vision deficiency (most commonly red-green colorblindness). Your visualizations should be readable for these viewers.

Solutions:
- Use colorblind-friendly palettes (many visualization libraries include these: viridis, ColorBrewer sets)
- Don't rely solely on color to convey information—use shapes, line types, or labels as well
- Test your visualizations with colorblind simulation tools
- Avoid red-green combinations for critical comparisons

3. Use Color to Highlight, Not Decorate

Every color should have a purpose. Don't add colors just because your software offers 256 options. Strategic use of color directs attention:

Gray out less important elements, use color for your key finding
Use a single bright color to highlight teams or players of interest
Reserve red for warning or bad performance, green for good (matching common conventions)

4. Consider Grayscale Printing

Many journals and reports are printed in black and white. Your visualizations should remain interpretable when printed without color:

Test by converting to grayscale before finalizing
Use different line styles (solid, dashed, dotted) not just colors
Include clear labels so color isn't the only distinguishing feature
Use patterns or fills in addition to colors for bars

Using NFL Team Colors

The nflplotR package in R and similar tools in Python provide access to official NFL team colors. This adds authenticity and immediate recognition to your visualizations.

Benefits:
- Instant Recognition: Viewers immediately identify teams by their colors
- Professional Appearance: Shows attention to detail and domain knowledge
- Brand Consistency: Aligns with how teams are presented in all NFL media

Considerations:
- Some team colors are very similar (multiple teams use red, blue, or black)
- Team logos work better than just colors when showing many teams simultaneously
- Ensure sufficient contrast between team colors and backgrounds

Common Color Mistakes in Football Analytics

**Rainbow Color Schemes for Continuous Data**: Using rainbow colors (red-orange-yellow-green-blue-violet) for continuous metrics is problematic because: - The perceptual distance between colors is uneven (yellow appears brighter than blue) - There's no natural "middle" to a rainbow - They're not colorblind-friendly - **Better Alternative**: Use sequential (light-to-dark) or diverging (red-white-blue) schemes **Too Many Colors**: Showing all 32 NFL teams in different colors creates visual chaos. Viewers can't track which color belongs to which team. - **Better Alternative**: Focus on top/bottom teams, or use team logos instead of colors **Poor Contrast**: Light yellow on white backgrounds, or dark blue on black. These combinations are nearly invisible. - **Better Alternative**: Test contrast ratios (aim for at least 4.5:1), use darker or lighter variants **Ignoring Colorblind Accessibility**: 1 in 12 males can't distinguish red from green reliably. - **Better Alternative**: Use colorblind-safe palettes, or supplement color with shapes/patterns

Grammar of Graphics: The ggplot2 Philosophy

Before diving into code, we need to understand the conceptual framework that underlies modern data visualization. The Grammar of Graphics, developed by statistician Leland Wilkinson and implemented beautifully in R's ggplot2 package, provides a systematic way to think about building visualizations.

Even if you primarily use Python, understanding this grammar will make you a better data visualizer. It shifts your thinking from "what chart template should I use?" to "how should I map my data to visual properties?" This shift enables you to create novel, customized visualizations that perfectly match your analytical needs.

The Core Components

The Grammar of Graphics breaks every visualization into seven fundamental components. Like combining words in sentences, you combine these components to create meaningful graphics:

1. Data: The dataset you want to visualize. In football analytics, this is typically play-by-play data, team statistics, or player performance metrics.

2. Aesthetics (aes): Mappings from data variables to visual properties. Common aesthetics include:
- x and y: position on the plot
- color: the color of elements
- fill: the fill color (for bars, areas)
- size: the size of points or lines
- shape: the shape of points (circle, triangle, square)
- alpha: transparency (0 = invisible, 1 = opaque)

3. Geometries (geom): The visual representations of data—points, lines, bars, areas, etc. Each geometry corresponds to a type of mark you draw on the plot:
- geom_point(): scatter plot points
- geom_line(): lines connecting points
- geom_bar() or geom_col(): bar charts
- geom_histogram(): histograms
- geom_density(): density curves
- geom_boxplot(): box plots

4. Scales: Control how data values map to aesthetic properties. Scales manage:
- Axis limits and breaks (tick marks)
- Color palettes
- Size ranges
- Transformations (log scale, square root scale)

5. Facets: Create small multiples—separate panels for different subsets of data. Faceting allows you to show how patterns vary across categories without overlapping too much data.

6. Coordinate Systems: Define how data positions map to the plot plane. Usually Cartesian (standard x/y axes), but can be polar (for radial plots) or geographic (for maps).

7. Themes: Control the overall visual appearance—fonts, background colors, grid lines, axis styling. Themes handle the "polish" that makes visualizations professional.

The Layered Approach to Building Graphics

The power of the Grammar of Graphics is its compositional nature. You build complex visualizations by layering simple components. This is like constructing sentences: you start with a subject and verb, then add modifiers, clauses, and punctuation.

Basic Template:

ggplot(data = <DATA>) +                      # Initialize with data
  aes(x = <VARIABLE>, y = <VARIABLE>) +      # Define aesthetic mappings
  geom_<TYPE>() +                            # Add geometric layer
  scale_<AESTHETIC>_<TYPE>() +               # Customize scales
  facet_<TYPE>(~<VARIABLE>) +                # Create facets (optional)
  labs(title = "...", x = "...", y = "...") + # Add labels
  theme_<NAME>()                             # Apply theme

Example Build-Up:

Let's see how we build a complex visualization layer by layer to understand EPA distribution:

# Layer 1: Just the data and axes (blank canvas)
ggplot(pbp_data, aes(x = epa))

# Layer 2: Add geometry (now we see the histogram)
ggplot(pbp_data, aes(x = epa)) +
  geom_histogram()

# Layer 3: Customize the bins and colors
ggplot(pbp_data, aes(x = epa)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7)

# Layer 4: Add a reference line at zero
ggplot(pbp_data, aes(x = epa)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red")

# Layer 5: Improve labels
ggplot(pbp_data, aes(x = epa)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of Expected Points Added",
    x = "EPA",
    y = "Number of Plays"
  )

# Layer 6: Apply professional theme
ggplot(pbp_data, aes(x = epa)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Distribution of Expected Points Added",
    subtitle = "2023 NFL Regular Season",
    x = "EPA",
    y = "Number of Plays",
    caption = "Data: nflfastR"
  ) +
  theme_minimal()

Each layer adds clarity and professionalism. This iterative approach lets you refine visualizations systematically rather than guessing at parameters.

Think in Layers, Not Templates

Many people approach visualization by searching for templates: "bar chart with error bars" or "scatter plot with trend line." This works for simple cases but limits creativity. Instead, think in layers: 1. What data do I want to show? 2. What visual properties should encode which variables? 3. What geometric marks best represent my data? 4. What refinements make the message clearer? This layered thinking lets you create custom visualizations perfectly suited to your specific analytical question. You're not constrained by pre-existing templates—you compose your own visual arguments.

Why This Matters for Football Analytics

The Grammar of Graphics is particularly valuable for football analytics because football data is complex and multidimensional. You're rarely just plotting one variable. Instead, you're typically showing:

EPA by play type (two variables: one continuous, one categorical)
Performance across down and distance situations (three or more variables)
Team statistics with uncertainty estimates (data + error)
Time-series patterns with seasonal trends (temporal data)
Spatial patterns on the field (coordinate data)

The grammar gives you a systematic framework for handling this complexity. Instead of forcing your data into predefined chart types, you map your variables to appropriate aesthetic properties and geometries. This flexibility is essential for revealing the patterns hidden in football's intricate data.

Grammar of Graphics in Python

While the Grammar of Graphics was popularized through R's `ggplot2`, similar principles apply in Python: **plotnine**: A direct port of ggplot2 to Python, using almost identical syntax **Altair**: Implements the Vega-Lite grammar (similar conceptual framework) **Seaborn**: Higher-level interface that handles common cases while still being compositional **Matplotlib**: The foundation—more imperative than declarative, but still compositional Even when using matplotlib's more procedural approach, thinking in terms of data-to-aesthetic mappings will improve your visualizations. The grammar is a mental framework, not just a library.

Setting Up Your Visualization Environment

Now that we understand the principles and theory, let's set up our programming environment for creating football visualizations. We'll load the necessary packages and configure sensible defaults that will save time throughout this chapter.

Loading Visualization Libraries

Different packages serve different purposes in the visualization ecosystem. Understanding what each contributes will help you choose the right tool for each situation.

#| label: setup-r
#| message: false
#| warning: false
#| cache: true

# Core data manipulation and visualization
# tidyverse: Meta-package including ggplot2, dplyr, tidyr, and more
# This is the foundation for modern R data analysis
library(tidyverse)

# Football data access
# nflfastR: Provides clean, analysis-ready NFL play-by-play data
# This package is maintained by the nflverse community
library(nflfastR)

# NFL-specific visualization enhancements
# nflplotR: Adds team logos, colors, and NFL-specific geometries to ggplot2
# Makes it easy to create professional NFL graphics
library(nflplotR)

# Additional visualization packages
# scales: Provides formatting functions for axes (percent, comma, dollar signs)
library(scales)

# patchwork: Combines multiple ggplot2 plots into composed layouts
# Essential for creating multi-panel figures
library(patchwork)

# plotly: Converts ggplot2 graphics to interactive web-based visualizations
# Great for presentations and exploratory analysis
library(plotly)

# ggrepel: Intelligent label placement that avoids overlaps
# Crucial for scatter plots with many labeled points
library(ggrepel)

# gt: Creates publication-quality tables
# While not strictly visualization, tables often accompany graphics
library(gt)

# Set global ggplot2 theme for all subsequent plots
# theme_minimal() provides a clean, professional appearance
# base_size = 12 ensures readable text
theme_set(theme_minimal(base_size = 12))

# Confirm successful loading
cat("✓ R visualization packages loaded successfully\n")
cat("✓ Default theme set to theme_minimal() with 12pt base font\n")
cat("✓ Ready to create NFL visualizations\n")

#| label: setup-py
#| message: false
#| warning: false
#| cache: true

# Core data manipulation
# pandas: The standard library for working with tabular data in Python
import pandas as pd

# numpy: Numerical computing library, essential for array operations
import numpy as np

# Football data access
# nfl_data_py: Python equivalent of nflfastR
# Provides access to nflverse data through Python
import nfl_data_py as nfl

# Visualization packages
# matplotlib: The foundational plotting library in Python
# Most other visualization libraries build on matplotlib
import matplotlib.pyplot as plt

# seaborn: Higher-level interface for statistical graphics
# Provides beautiful default styles and complex plot types
import seaborn as sns

# plotly for interactive visualizations
# plotly.express: High-level interface for quick interactive plots
import plotly.express as px
# plotly.graph_objects: Lower-level control for custom interactive plots
import plotly.graph_objects as go

# Configure matplotlib style
# Use seaborn's style system for better-looking default plots
plt.style.use('seaborn-v0_8-darkgrid')

# Set seaborn color palette
# 'husl' provides evenly-spaced colors that are perceptually distinct
sns.set_palette("husl")

# Configure pandas display options
# Show all columns when printing DataFrames (instead of truncating)
pd.options.display.max_columns = 50
# Increase width to prevent line wrapping in console
pd.options.display.width = 120

# Set default figure size for all matplotlib plots
# (10, 6) provides good aspect ratio for most screens and presentations
plt.rcParams['figure.figsize'] = (10, 6)

# Set DPI (dots per inch) for sharper on-screen display
# 100 DPI is good for screens; increase to 300 for print-quality
plt.rcParams['figure.dpi'] = 100

# Confirm successful setup
print("✓ Python visualization packages loaded successfully")
print("✓ matplotlib configured with seaborn-darkgrid style")
print(f"✓ Default figure size set to {plt.rcParams['figure.figsize']}")
print("✓ Ready to create NFL visualizations")

**Why These Specific Packages?** In **R**, the visualization ecosystem centers on `ggplot2` (part of the `tidyverse`). This package implements the Grammar of Graphics we discussed earlier. We supplement it with: - `nflplotR` for football-specific elements (team logos, colors) - `scales` for formatting axes professionally - `patchwork` for combining multiple plots - `plotly` for adding interactivity - `ggrepel` for smart label placement In **Python**, the ecosystem is more fragmented. We use: - `matplotlib` as the foundation (nearly everything builds on it) - `seaborn` for attractive statistical graphics - `plotly` for interactive visualizations - `nfl_data_py` for accessing the same nflverse data available in R **Configuration Choices:** We set sensible defaults that will make all subsequent visualizations look professional: - **Theme/Style**: Clean, minimal themes avoid visual clutter - **Figure Size**: (10, 6) inches works well for most screens and presentations - **Font Sizes**: 12-point base font ensures readability - **Color Palettes**: We use perceptually-uniform colors that work for colorblind viewers These defaults save time and ensure consistency. You can always override them for specific plots that need different settings.

Loading Sample Data

Throughout this chapter, we'll use play-by-play data from the 2023 NFL season. This dataset contains every play from every regular season game, along with contextual information and advanced metrics. Let's load it now and examine its structure.

#| label: load-data-r
#| message: false
#| warning: false
#| cache: true

# Load play-by-play data for 2023 season
# The load_pbp() function automatically downloads and caches data
# It returns a tibble (enhanced data frame) with one row per play
pbp <- load_pbp(2023) %>%
  # Filter to regular season only (exclude preseason and playoffs)
  # season_type: "REG" = regular season, "PRE" = preseason, "POST" = playoffs
  filter(season_type == "REG")

# Load team information (logos, colors, abbreviations)
# This tibble contains NFL team metadata we'll use for visualization
teams <- nflfastR::teams_colors_logos

# Display summary information
cat("===== DATA LOADED =====\n")
cat("Plays loaded:", format(nrow(pbp), big.mark = ","), "\n")
cat("Variables available:", ncol(pbp), "\n")
cat("Season:", unique(pbp$season), "\n")
cat("Weeks covered:", min(pbp$week), "to", max(pbp$week), "\n")
cat("Teams:", n_distinct(pbp$posteam), "(offensive teams tracked)\n")
cat("\n")

# Show a few key variables to understand data structure
cat("Sample of key variables:\n")
pbp %>%
  select(
    game_id,           # Unique game identifier
    week,              # Week of season (1-18)
    posteam,           # Team on offense (possessing team)
    defteam,           # Team on defense
    down,              # Down (1, 2, 3, or 4)
    ydstogo,           # Yards needed for first down
    play_type,         # Type of play (pass, run, punt, etc.)
    yards_gained,      # Yards gained on play
    epa                # Expected Points Added
  ) %>%
  head(5) %>%
  print(width = 120)

#| label: load-data-py
#| message: false
#| warning: false
#| cache: true

# Load play-by-play data for 2023 season
# import_pbp_data() takes a list of seasons (can load multiple years)
# Returns a pandas DataFrame with one row per play
pbp = nfl.import_pbp_data([2023])

# Filter to regular season only
# season_type: 'REG' = regular season, 'PRE' = preseason, 'POST' = playoffs
pbp = pbp[pbp['season_type'] == 'REG']

# Load team information (colors, logos, abbreviations)
# This DataFrame contains metadata about all NFL teams
teams = nfl.import_team_desc()

# Display summary information
print("===== DATA LOADED =====")
print(f"Plays loaded: {len(pbp):,}")
print(f"Variables available: {len(pbp.columns)}")
print(f"Season: {pbp['season'].unique()[0]}")
print(f"Weeks covered: {pbp['week'].min()} to {pbp['week'].max()}")
print(f"Teams: {pbp['posteam'].nunique()} (offensive teams tracked)")
print()

# Show a few key variables to understand data structure
print("Sample of key variables:")
key_vars = [
    'game_id',         # Unique game identifier
    'week',            # Week of season (1-18)
    'posteam',         # Team on offense (possessing team)
    'defteam',         # Team on defense
    'down',            # Down (1, 2, 3, or 4)
    'ydstogo',         # Yards needed for first down
    'play_type',       # Type of play (pass, run, punt, etc.)
    'yards_gained',    # Yards gained on play
    'epa'              # Expected Points Added
]
print(pbp[key_vars].head(5).to_string(index=False))

**What We Just Loaded:** The play-by-play dataset is the foundation of modern football analytics. Each row represents a single play, and each column represents either: - **Situational context**: What was the game situation? (down, distance, time, score) - **Participants**: Which teams and players were involved? - **Actions**: What happened on the play? (play type, result) - **Outcomes**: What were the consequences? (yards gained, EPA, win probability change) **Key Variables for Visualization:** From the 370+ variables available, these are most commonly used in visualizations: - **epa** (Expected Points Added): The value of the play in terms of expected points. This is our primary performance metric and will appear in most visualizations. - **play_type**: Categorical variable (pass, run, punt, field_goal, etc.). Essential for comparing different types of plays. - **posteam** / **defteam**: Team abbreviations. We'll use these for team-specific analyses and to apply team colors. - **down** / **ydstogo**: Down and distance information. Critical for situational analysis. - **game_id** / **week**: Identifiers for temporal analysis and grouping plays by game. **Data Quality Notes:** - Not all plays have EPA calculated (penalties, some special teams plays). We'll need to filter for `!is.na(epa)` in most analyses. - The dataset includes all play types, but many analyses focus on offensive plays (passes and runs). - Each season contains approximately 40,000-45,000 plays from regular season games.

This data will power all visualizations in this chapter. With it loaded, we're ready to create our first plots.

Basic Plot Types

Now we'll explore the fundamental plot types you'll use repeatedly in football analytics. For each type, we'll explain when to use it, create an example with real NFL data, and interpret what the visualization reveals. Understanding these building blocks will enable you to create more complex, customized visualizations later.

Histograms and Distributions

Histograms and density plots are essential for understanding how data is distributed. In football analytics, distributions help us understand:

The typical range of outcomes (what's normal vs. exceptional)
How variable performance is (tight clustering vs. wide spread)
Whether data follows expected patterns (normal distribution, skewness)
Where important thresholds lie (what percentage of plays are positive EPA)

When to Use Histograms:

Exploring a new dataset to understand variable ranges
Checking if data meets assumptions (e.g., normality for certain statistical tests)
Communicating how common different outcomes are
Identifying outliers or unusual patterns

Histograms divide the data range into bins and count how many observations fall in each bin. The height of each bar shows the frequency or count of observations in that range.

Let's examine the distribution of EPA (Expected Points Added) for pass plays. This will show us how valuable (or costly) different passing plays are.

#| label: fig-histogram-r
#| fig-cap: "Distribution of EPA for pass plays in the 2023 NFL regular season. The histogram shows a right-skewed distribution with a long right tail indicating occasional explosive passing plays, while the left tail shows catastrophic plays like interceptions and sacks."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Filter the data to pass plays only
# We need to exclude NA values in EPA, as some plays don't have EPA calculated
pass_plays <- pbp %>%
  filter(
    play_type == "pass",     # Only passing plays
    !is.na(epa)              # Exclude plays without EPA (penalties, etc.)
  )

# Create the histogram
ggplot(pass_plays, aes(x = epa)) +
  # geom_histogram creates the histogram bars
  # bins = 50: divide the EPA range into 50 equally-spaced bins
  # fill: bar color (using a professional blue)
  # color: outline color for each bar (white creates separation)
  # alpha: transparency (0.8 = slightly transparent for softer appearance)
  geom_histogram(
    bins = 50,
    fill = "#0066CC",
    color = "white",
    alpha = 0.8
  ) +

  # Add a vertical reference line at EPA = 0
  # This divides successful plays (positive EPA) from unsuccessful ones
  # linetype = "dashed": creates a dashed line
  # size = 1: makes the line moderately thick and visible
  geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "red",
    size = 1
  ) +

  # Customize the x-axis scale
  # limits: set the range we want to display (-5 to 10 EPA)
  # breaks: where to place tick marks on the axis
  scale_x_continuous(
    limits = c(-5, 10),
    breaks = seq(-5, 10, 2.5)
  ) +

  # Add descriptive labels
  # A good title explains what the reader is seeing
  # Subtitle adds context or key interpretation
  # Caption cites the data source and clarifies elements
  labs(
    title = "Distribution of EPA on Pass Plays",
    subtitle = "2023 NFL Regular Season | Most passes cluster near zero, with occasional explosive gains",
    x = "Expected Points Added (EPA)",
    y = "Number of Plays",
    caption = "Data: nflfastR | Dashed line indicates EPA = 0 (neutral plays)"
  ) +

  # Apply a clean theme and customize text
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray40"),
    plot.caption = element_text(size = 9, color = "gray50", hjust = 0)
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-histogram-py
#| fig-cap: "Distribution of EPA for pass plays - Python implementation. Same data as R version, showing the characteristic right-skewed distribution of passing play outcomes."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Filter data to pass plays with valid EPA
# .notna() excludes missing values (equivalent to !is.na() in R)
pass_plays = pbp[(pbp['play_type'] == 'pass') & (pbp['epa'].notna())]

# Create figure and axis objects
# figsize=(10, 6): dimensions in inches (matches R output)
fig, ax = plt.subplots(figsize=(10, 6))

# Create the histogram
# bins=50: number of bins (same as R version)
# color: bar color (same blue as R version)
# alpha: transparency level
# edgecolor: outline color for bars
ax.hist(
    pass_plays['epa'],
    bins=50,
    color='#0066CC',
    alpha=0.8,
    edgecolor='white'
)

# Add vertical reference line at EPA = 0
# linestyle='--': dashed line
# linewidth: thickness of the line
# label: legend label (though we won't show legend here)
ax.axvline(
    x=0,
    color='red',
    linestyle='--',
    linewidth=2,
    label='EPA = 0'
)

# Set x-axis limits to match R version
ax.set_xlim(-5, 10)

# Add axis labels with appropriate font sizes
ax.set_xlabel('Expected Points Added (EPA)', fontsize=12)
ax.set_ylabel('Number of Plays', fontsize=12)

# Add title (matplotlib uses \n for line breaks)
ax.set_title(
    'Distribution of EPA on Pass Plays\n2023 NFL Regular Season | Most passes cluster near zero, with occasional explosive gains',
    fontsize=14,
    fontweight='bold'
)

# Add caption as text in the bottom corner
# transform=ax.transAxes means coordinates are relative (0-1 range)
# (0.02, 0.98) places text in top-left
# verticalalignment='top': align text from top down
ax.text(
    0.02, 0.02,
    'Data: nfl_data_py | Dashed line indicates EPA = 0 (neutral plays)',
    transform=ax.transAxes,
    fontsize=9,
    verticalalignment='bottom',
    color='gray'
)

# Adjust layout to prevent label cutoff
plt.tight_layout()

# Display the plot
plt.show()

**What This Code Does:** Both R and Python versions create identical visualizations through similar steps: 1. **Filter the Data**: We extract only passing plays with valid EPA values. This reduces our ~43,000 total plays to ~18,000 pass plays. 2. **Create the Histogram**: The histogram divides the EPA range into 50 bins and counts how many pass plays fall in each bin. More bins show finer detail; fewer bins show broader patterns. 3. **Add Reference Line**: The vertical line at EPA = 0 is crucial for interpretation. It divides successful plays (positive EPA, to the right) from unsuccessful plays (negative EPA, to the left). 4. **Customize Aesthetics**: We choose professional colors (blue for data, red for the reference line), add appropriate transparency, and ensure bars have white outlines for clarity. 5. **Label Comprehensively**: Every element is labeled—axes have units, the title explains what we're seeing, and the caption cites the data source. **Code Comparison (R vs. Python):** - **R (ggplot2)**: Uses the `+` operator to layer components. Each line adds an element to the plot. - **Python (matplotlib)**: Uses method calls on the axis object (`ax`). Each method modifies the plot. Both approaches are compositional—you build up visualizations piece by piece.

Interpreting the Output:

When you run this code, you'll see a histogram that reveals several important patterns about NFL passing:

1. Right Skewness (Positive Skew):

The distribution has a long tail extending to the right. This means that while most passes result in modest EPA (clustered near zero), occasional passes generate enormous positive EPA—think of a 75-yard touchdown bomb. These rare explosive plays pull the mean EPA higher than the median.

2. Central Clustering:

The highest bars appear near zero and slightly positive. This tells us that the most common passing outcome is a modest gain or loss. Routine passes on first down that gain 5-7 yards fall into this range.

3. Left Tail (Negative Events):

The left tail extends to about -4 or -5 EPA. These are catastrophic passing plays—interceptions, sacks for big losses, or sacks that force a team out of field goal range. While less common than the explosive positive plays, they're still frequent enough to matter.

4. Success Rate:

Looking at the area on either side of the red line, you can visually estimate that slightly less than half of passes have positive EPA. (We calculated earlier that pass success rate is about 46%.) This means passing is inherently risky—even on successful drives, many individual passes lose expected points.

5. What This Means for Strategy:

The combination of high variance (wide spread) and right skew (big-play potential) explains why NFL offenses pass so frequently despite the risk. A completion that gains 8 yards on 3rd-and-7 might only add 0.3 EPA, but a 40-yard completion can add 3-4 EPA in one play. No running play offers that explosive potential.

Interpreting EPA Distributions

When looking at EPA histograms, ask yourself: **Shape:** - Is it symmetric (normal distribution) or skewed? - Where is the peak (mode)? - How wide is the spread (variance)? **Position:** - Is the bulk of the distribution above or below zero? - How far into negative territory does the left tail extend? - How high into positive territory does the right tail reach? **Football Implications:** - Right skew → potential for explosive plays - Wide spread → high variance in outcomes - Heavy left tail → catastrophic plays possible - Peak location → typical play outcome These patterns tell you about risk, upside, and consistency—all crucial for decision-making.

Density Plots

Density plots smooth histograms into continuous curves, making them excellent for comparing distributions across categories. Instead of counting observations in discrete bins, density plots estimate the probability distribution function—showing where data is concentrated.

When to Use Density Plots:

Comparing distributions across multiple groups (pass vs. run, different teams)
When exact counts aren't important but the shape of the distribution is
Presenting to audiences who find smooth curves more intuitive than bars
When you want to overlay multiple distributions without cluttering

Advantage Over Histograms:

Density plots avoid the arbitrary choice of bin width that affects histogram appearance. They also make overlaying multiple distributions clearer—overlapping histograms can be visually confusing, but overlapping density curves are easy to read.

Let's compare EPA distributions for passing vs. rushing plays to see how these two play types differ in their outcome patterns.

#| label: fig-density-r
#| fig-cap: "Comparing EPA distributions for pass and run plays using density curves. Pass plays show higher variance with both more explosive gains and more catastrophic losses, while run plays cluster more tightly around small negative values."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Filter to both pass and run plays with valid EPA
offensive_plays <- pbp %>%
  filter(
    play_type %in% c("pass", "run"),  # Include both types
    !is.na(epa)                        # Exclude missing EPA
  )

# Create density plot
ggplot(offensive_plays, aes(x = epa, fill = play_type)) +

  # geom_density creates the smooth density curves
  # alpha = 0.6: make curves semi-transparent so we can see overlap
  # adjust = 1.5: smoothing parameter (higher = smoother)
  geom_density(alpha = 0.6, adjust = 1.5) +

  # Add reference line at zero
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +

  # Use distinct colors for pass vs. run
  # These colors are colorblind-friendly and distinct
  scale_fill_manual(
    values = c("pass" = "#00BFC4", "run" = "#F8766D"),
    labels = c("Pass", "Run"),
    name = "Play Type"
  ) +

  # Limit x-axis to focus on the bulk of the data
  # Extreme outliers beyond ±5 EPA are rare
  scale_x_continuous(limits = c(-5, 8)) +

  # Add comprehensive labels
  labs(
    title = "EPA Distribution: Pass vs. Run",
    subtitle = "Pass plays show higher variance with more big gains and big losses | 2023 NFL Regular Season",
    x = "Expected Points Added",
    y = "Density",
    caption = "Data: nflfastR | Density curves are smoothed histograms"
  ) +

  # Apply theme and customizations
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    legend.position = "top",           # Put legend at top for visibility
    panel.grid.minor = element_blank() # Remove minor grid lines for cleaner look
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-density-py
#| fig-cap: "EPA density comparison by play type - Python. Same patterns as R version, showing pass variance vs. run consistency."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Filter data to pass and run plays with valid EPA
plot_data = pbp[
    pbp['play_type'].isin(['pass', 'run']) &
    pbp['epa'].notna()
]

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Create density plots for each play type
# We iterate through play types with their associated colors and labels
for play_type, color, label in [
    ('pass', '#00BFC4', 'Pass'),
    ('run', '#F8766D', 'Run')
]:
    # Extract EPA data for this play type
    data = plot_data[plot_data['play_type'] == play_type]['epa']

    # Create density plot (kernel density estimate)
    # bw_adjust equivalent to ggplot2's adjust parameter
    data.plot.kde(
        ax=ax,
        color=color,
        alpha=0.6,
        linewidth=2,
        label=label,
        bw_method=0.3  # Smoothing bandwidth
    )

# Add vertical reference line at zero
ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)

# Set x-axis limits to match R version
ax.set_xlim(-5, 8)

# Add labels
ax.set_xlabel('Expected Points Added', fontsize=12)
ax.set_ylabel('Density', fontsize=12)

# Add title with subtitle (using \n for line break)
ax.set_title(
    'EPA Distribution: Pass vs. Run\n' +
    'Pass plays show higher variance with more big gains and big losses | 2023 NFL Regular Season',
    fontsize=14,
    fontweight='bold',
    pad=20  # Add padding above title
)

# Add legend
# loc='upper right': position in upper right corner
# title: label for the legend
ax.legend(title='Play Type', loc='upper right', frameon=True, shadow=True)

# Add caption
ax.text(
    0.98, 0.02,
    'Data: nfl_data_py | Density curves are smoothed histograms',
    transform=ax.transAxes,
    ha='right',
    fontsize=9,
    style='italic',
    color='gray'
)

# Adjust layout and display
plt.tight_layout()
plt.show()

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

**What This Code Does:** 1. **Filter for Comparison**: We include both pass and run plays in the same dataset, keeping the `play_type` variable so we can distinguish them. 2. **Create Overlapping Densities**: The `fill = play_type` aesthetic (R) or separate plotting for each type (Python) creates two density curves—one for each play type. 3. **Choose Distinct Colors**: We use teal (`#00BFC4`) for passes and coral (`#F8766D`) for runs. These colors are: - Highly distinct (easy to tell apart) - Colorblind-friendly (distinguishable even for red-green colorblind viewers) - Professional and modern 4. **Add Transparency**: `alpha = 0.6` makes curves semi-transparent so we can see where they overlap. This overlap reveals which ranges are similar vs. different. 5. **Adjust Smoothing**: The `adjust` parameter (R) or `bw_method` (Python) controls smoothness. Higher values create smoother curves but may obscure details. We use moderate smoothing to show the overall shape clearly.

Interpreting the Comparison:

This visualization reveals fundamental differences between passing and rushing in the NFL:

1. Central Tendency (Where the Peaks Are):

Run plays: The peak (mode) is slightly left of zero, around -0.4 to -0.5 EPA. This means the most common running play loses a small amount of expected points—think of a 2-yard gain on first-and-10 that moves from 1st-and-10 to 2nd-and-8, a slightly worse situation.
Pass plays: The peak is closer to zero, around -0.2 EPA. Passes more frequently result in neutral or slightly positive EPA.

2. Variance (Width of the Distribution):

Run plays: The distribution is narrow and tightly clustered. Most runs result in small gains or losses (roughly -1 to +2 EPA). Very few runs result in extreme EPA values.
Pass plays: The distribution is much wider, extending from about -4 EPA (interceptions, sacks) to +7 EPA (long touchdown passes). This reflects the high-variance nature of passing.

3. Skewness (Tail Behavior):

Run plays: Modest right skew—occasional long runs, but even a 40-yard run might only add 2-3 EPA.
Pass plays: Pronounced right skew—the right tail extends much further, indicating the potential for explosive plays. A 60-yard touchdown pass can add 5-7 EPA instantly.

4. Catastrophic Left Tail:

Run plays: The left tail is short. Even fumbles (the worst running outcome) rarely lose more than 2-3 EPA.
Pass plays: The left tail extends to -4 or -5 EPA, representing interceptions in the opponent's territory or drive-killing sacks.

5. Overlap Analysis:

The overlapping area (where both curves are visible) shows EPA ranges where both play types are common. The non-overlapping areas highlight where one play type dominates:

Around -2 to -1 EPA: More passes than runs (incompletions, short sacks)
Around +3 to +7 EPA: Almost exclusively passes (explosive plays)
Around -0.5 to 0 EPA: More runs (modest gains/losses)

Strategic Implications:

These distribution differences explain modern NFL strategy:

Why teams pass more: Despite higher risk (wider distribution = more variance), the right-skewed distribution means passes offer explosive-play potential that runs can't match.
Why runs are still valuable: Lower variance makes runs more predictable. In situations where avoiding catastrophic plays matters (protecting a lead, running out the clock), the tighter distribution of runs is advantageous.
Risk-reward trade-off: Passing offers higher mean EPA (+0.097) but with much higher variance. Running offers lower mean EPA (-0.038) but more consistent outcomes.

Key Insight: Variance vs. Mean

A common misconception is that "running is safer than passing." While running has less variance (smaller spread), that doesn't make it safer in all situations: **Scenario 1**: 3rd-and-15 from your own 20-yard line. - A run will almost certainly gain 2-5 yards (low variance, high probability of failure) - A pass might be incomplete (bad), but could also gain 15+ yards for a first down (good) - **Verdict**: Passing is "safer" here because the low variance of running means certain failure **Scenario 2**: 3rd-and-1 protecting a lead late in the game. - A run will gain -1 to +3 yards (low variance, moderate success rate) - A pass could be a sack, interception, or incomplete (higher variance, more catastrophic outcomes possible) - **Verdict**: Running is "safer" here because we can tolerate failure (punt) but can't afford catastrophe (turnover) The lesson: "Safe" depends on context. Sometimes predictability is risky, and sometimes variance is risky.

Line Charts

Line charts excel at showing trends over time or ordered sequences. They're ideal for visualizing how things change—win probability during a game, team performance across a season, or cumulative statistics building up week by week.

When to Use Line Charts:

Showing trends over time (season progression, game flow)
Displaying cumulative totals (cumulative EPA, points scored)
Comparing trends across groups (multiple teams' season trajectories)
Showing continuous change in ordered data

Key Design Considerations:

Connect points with lines only when the order is meaningful (time, sequence)
Use different line styles or colors to distinguish multiple series
Add reference lines for important thresholds
Consider smoothing when data is noisy

Let's create a line chart showing how win probability changes throughout a game—this reveals the flow and key momentum shifts.

#| label: fig-line-chart-r
#| fig-cap: "Win probability chart for a specific 2023 NFL game. The line shows how the home team's win probability evolved from kickoff to final whistle, with steep changes indicating high-leverage plays like touchdowns or turnovers."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Select an exciting game from 2023 season
# Let's pick a close game with multiple lead changes
# We'll examine all plays from one game to show win probability evolution
game_plays <- pbp %>%
  filter(
    # Pick a specific game (you can change this to any game_id)
    game_id == "2023_01_KC_DET",  # Week 1: Chiefs at Lions
    !is.na(home_wp)                # Ensure we have win probability data
  ) %>%
  # Create a play sequence number for the x-axis
  mutate(play_number = row_number())

# Create line chart
ggplot(game_plays, aes(x = play_number, y = home_wp)) +

  # Add a filled area under the curve to emphasize the probability
  # This makes it visually clear what the home team's chances are
  geom_ribbon(
    aes(ymin = 0, ymax = home_wp),
    fill = "#0066CC",
    alpha = 0.3
  ) +

  # Add the line showing win probability trajectory
  # linewidth = 1.5: makes the line prominent and easy to follow
  geom_line(color = "#0066CC", linewidth = 1.5) +

  # Add a horizontal reference line at 50% (even odds)
  # This divides "more likely to win" from "more likely to lose"
  geom_hline(
    yintercept = 0.5,
    linetype = "dashed",
    color = "red",
    alpha = 0.5
  ) +

  # Format y-axis as percentage
  # expand: add a bit of space at top and bottom
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0, 1),
    expand = c(0, 0)
  ) +

  # Add descriptive labels
  labs(
    title = "Win Probability Chart: Chiefs at Lions",
    subtitle = "2023 NFL Season, Week 1 | Shows home team (Lions) win probability throughout game",
    x = "Play Number (Game Progression)",
    y = "Home Team Win Probability",
    caption = "Data: nflfastR | Red dashed line indicates 50% probability (even odds)"
  ) +

  # Apply theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    panel.grid.minor = element_blank()
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-line-chart-py
#| fig-cap: "Win probability chart - Python implementation. Shows the ebb and flow of the game through changing win probabilities."
#| fig-width: 10
#| fig-height: 6
#| cache: true

# Select game data
game_plays = (
    pbp[
        (pbp['game_id'] == '2023_01_KC_DET') &  # Same game as R version
        pbp['home_wp'].notna()                   # Valid win probability
    ]
    .reset_index(drop=True)
)

# Add play number for x-axis
game_plays['play_number'] = range(1, len(game_plays) + 1)

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the line
ax.plot(
    game_plays['play_number'],
    game_plays['home_wp'],
    color='#0066CC',
    linewidth=2,
    label='Home Team Win Probability'
)

# Fill area under the curve
ax.fill_between(
    game_plays['play_number'],
    0,
    game_plays['home_wp'],
    color='#0066CC',
    alpha=0.3
)

# Add horizontal reference line at 50%
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='50% (Even Odds)')

# Set y-axis limits and format as percentage
ax.set_ylim(0, 1)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))

# Add labels
ax.set_xlabel('Play Number (Game Progression)', fontsize=12)
ax.set_ylabel('Home Team Win Probability', fontsize=12)

# Add title
ax.set_title(
    'Win Probability Chart: Chiefs at Lions\n' +
    '2023 NFL Season, Week 1 | Shows home team (Lions) win probability throughout game',
    fontsize=14,
    fontweight='bold',
    pad=20
)

# Add legend
ax.legend(loc='best', frameon=True)

# Add caption
ax.text(
    0.98, 0.02,
    'Data: nfl_data_py | Red dashed line indicates 50% probability (even odds)',
    transform=ax.transAxes,
    ha='right',
    fontsize=9,
    style='italic',
    color='gray'
)

# Adjust layout and display
plt.tight_layout()
plt.show()

**What This Code Does:** 1. **Filter to Single Game**: We select all plays from one specific game using the `game_id` variable. Each game has a unique identifier like "2023_01_KC_DET" (season_week_away_home). 2. **Create Play Sequence**: We add a `play_number` variable that simply counts from 1 to the total number of plays. This gives us a sequential x-axis showing game progression. 3. **Plot Win Probability**: The y-axis shows `home_wp` (home team win probability), which ranges from 0 (0% chance) to 1 (100% chance). 4. **Add Visual Enhancements**: - **Ribbon/Fill**: The filled area under the curve helps viewers quickly see whether the home team is favored (large area) or not (small area) - **Reference Line at 50%**: Divides "home team favored" from "away team favored" - **Percentage Formatting**: Y-axis shows percentages (0% to 100%) instead of decimals 5. **Style and Label**: Professional styling with clear labels explaining what viewers are seeing. **Design Choices:** - **Sequential X-Axis**: Using play number (rather than game time) ensures evenly-spaced points even when plays are unevenly distributed over time - **Blue Color**: Professional, neutral color that doesn't imply favoritism for either team - **Filled Area**: Makes it easier to see at a glance which team is favored

Interpreting the Output:

A win probability chart tells the story of a game through numbers:

1. Game Flow Narrative:

Kickoff: Most games start near 50% (slight home field advantage might make it 52-53% for the home team)
Momentum Shifts: Steep upward or downward movements indicate high-leverage plays (touchdowns, turnovers, 4th down conversions)
Late-Game Tension: In close games, win probability oscillates around 50% until late, when one team pulls away
Blowouts: If one team dominates, win probability quickly approaches 95-100% and stays there

2. Identifying Key Plays:

The steepest slope changes indicate the most important plays:

Large positive jumps: Touchdowns, defensive touchdowns, turnovers in good field position
Large negative drops: Turnovers, opponent touchdowns, failed 4th down conversions
Gradual increases: Methodical drives that steadily improve field position

You can identify the exact plays that changed the game by looking for these dramatic shifts.

3. Game Excitement:

Win probability charts also quantify how exciting a game was:

Back-and-forth games: Multiple crossings of the 50% line indicate lead changes
Close finishes: Win probability near 50% in the 4th quarter means an uncertain outcome
Blowouts: Win probability above 90% for most of the game means it was never in doubt

Some analysts create an "excitement index" based on how much win probability changed and when those changes occurred.

4. Strategic Decisions:

Coaches and analysts use win probability charts to evaluate decisions:

Going for it on 4th down: Did it significantly increase win probability?
Two-point conversions: What was the WP impact?
Clock management: Did decisions maximize win probability?

By comparing the actual WP change to the expected WP change from different decisions, we can evaluate coaching choices.

Reading Win Probability Charts

**What to Look For:** 1. **Starting point**: Home teams typically start around 52-55% WP due to home field advantage 2. **Crossings of 50%**: Each crossing represents a lead change or shift in who's favored 3. **Steepness of changes**: Steeper = more impactful play 4. **Late-game behavior**: Does WP converge to 0% or 100%, or stay uncertain until the end? 5. **Plateaus**: Periods where WP stays relatively flat indicate consistent play without dramatic events **Common Patterns:** - **Gradual rise/fall**: One team steadily takes control through consistent execution - **Sawtooth pattern**: Back-and-forth scoring, exciting game - **Step function**: Blowout where WP jumps to 90%+ and stays there - **Late collapse**: WP high for one team most of game, then dramatic late reversal

Scatter Plots

Scatter plots reveal relationships between two continuous variables. They're essential for exploring correlations, identifying outliers, and understanding multidimensional patterns. In football analytics, scatter plots help us understand questions like: "Do teams that pass well also run well?" or "Is quarterback experience related to efficiency?"

When to Use Scatter Plots:

Exploring relationships between two performance metrics
Identifying teams or players with unusual combinations of traits
Checking for correlations before modeling
Showing performance across two dimensions simultaneously
Creating quadrant charts (dividing by thresholds to create categories)

Key Elements:

Point position: Each point represents one observation (team, player, game)
Additional aesthetics: Point size, color, or shape can encode a third variable
Trend lines: Add linear or smoothed trend lines to highlight relationships
Reference lines: Horizontal/vertical lines at means or thresholds divide the space

Let's create a scatter plot comparing teams' passing EPA to their rushing EPA. This reveals whether offensive efficiency in one area relates to efficiency in the other.

#| label: fig-scatter-r
#| fig-cap: "Team offensive efficiency: passing vs. rushing EPA per play. Each team logo represents one team's average performance in both dimensions. The scatter plot reveals whether teams that excel at passing also excel at rushing."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Calculate separate pass and rush EPA for each team
team_pass_rush <- pbp %>%
  filter(
    !is.na(epa),                           # Valid EPA values
    play_type %in% c("pass", "run")        # Only pass and run plays
  ) %>%
  # Group by both team and play type
  group_by(posteam, play_type) %>%
  # Calculate mean EPA for each team-playtype combination
  summarise(mean_epa = mean(epa), .groups = "drop") %>%

  # Reshape from long to wide format
  # Creates separate columns for pass and run EPA
  # Before: rows like (KC, pass, 0.15), (KC, run, -0.02)
  # After: row like (KC, 0.15, -0.02)
  pivot_wider(
    names_from = play_type,
    values_from = mean_epa
  )

# Create scatter plot with team logos
team_pass_rush %>%
  ggplot(aes(x = run, y = pass)) +

  # Add reference lines at zero for both dimensions
  # These divide the plot into quadrants
  # linetype = "dashed": creates dashed lines
  # alpha = 0.5: semi-transparent so they don't dominate
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +

  # Add smooth trend line
  # method = "lm": linear model (straight line)
  # se = TRUE: show confidence interval as a shaded area
  # color/alpha: make it subtle (not the focus)
  geom_smooth(
    method = "lm",
    se = TRUE,
    color = "gray50",
    alpha = 0.3
  ) +

  # Add team logos as points
  # aes(team_abbr = posteam): tells nflplotR which team each point is
  # width = 0.06: size of logos
  # alpha = 0.8: slight transparency
  nflplotR::geom_nfl_logos(aes(team_abbr = posteam), width = 0.06, alpha = 0.8) +

  # Add comprehensive labels
  labs(
    title = "Team Offensive Efficiency: Passing vs. Rushing",
    subtitle = "Teams in upper-right quadrant excel at both | 2023 NFL Regular Season",
    x = "Rush EPA per Play",
    y = "Pass EPA per Play",
    caption = "Data: nflfastR | Line shows linear relationship between pass and rush efficiency"
  ) +

  # Apply theme with customizations
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    panel.grid.minor = element_blank(),  # Remove minor grid for cleaner look
    aspect.ratio = 1                     # Square plot (equal x and y ranges)
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-scatter-py
#| fig-cap: "Pass EPA vs. Rush EPA by team - Python implementation. Team abbreviations label each point, showing the relationship between passing and rushing efficiency."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Calculate team pass and rush EPA
team_pass_rush = (
    pbp[pbp['epa'].notna() & pbp['play_type'].isin(['pass', 'run'])]
    .groupby(['posteam', 'play_type'])  # Group by team and play type
    .agg(mean_epa=('epa', 'mean'))      # Calculate mean EPA
    .reset_index()                       # Convert to regular DataFrame
    # Pivot to wide format (separate columns for pass and run)
    .pivot(index='posteam', columns='play_type', values='mean_epa')
    .reset_index()
)

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 8))

# Create scatter plot
scatter = ax.scatter(
    team_pass_rush['run'],   # x-axis: rushing EPA
    team_pass_rush['pass'],  # y-axis: passing EPA
    s=100,                   # size of points
    alpha=0.6,               # transparency
    color='#0066CC'          # point color
)

# Add team labels to each point
# This makes it clear which team each point represents
for idx, row in team_pass_rush.iterrows():
    ax.annotate(
        row['posteam'],               # Team abbreviation
        (row['run'], row['pass']),    # Position (x, y)
        fontsize=8,                   # Small font size
        ha='center',                  # Horizontal alignment: center
        va='center',                  # Vertical alignment: center
        fontweight='bold'             # Bold text for readability
    )

# Add reference lines at zero
# Divide plot into quadrants
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Add trend line
# Calculate linear fit
z = np.polyfit(
    team_pass_rush['run'].dropna(),   # x values (drop any missing)
    team_pass_rush['pass'].dropna(), # y values
    1                                 # degree = 1 (linear)
)
p = np.poly1d(z)  # Create polynomial function from coefficients

# Create x values for plotting the line
x_line = np.linspace(
    team_pass_rush['run'].min(),
    team_pass_rush['run'].max(),
    100
)

# Plot the trend line
ax.plot(
    x_line,
    p(x_line),
    color='gray',
    linestyle='-',
    alpha=0.3,
    label='Linear trend'
)

# Add labels
ax.set_xlabel('Rush EPA per Play', fontsize=12)
ax.set_ylabel('Pass EPA per Play', fontsize=12)

# Add title
ax.set_title(
    'Team Offensive Efficiency: Passing vs. Rushing\n' +
    'Teams in upper-right quadrant excel at both | 2023 NFL Regular Season',
    fontsize=14,
    fontweight='bold',
    pad=20
)

# Add caption
ax.text(
    0.98, 0.02,
    'Data: nfl_data_py | Line shows linear relationship between pass and rush efficiency',
    transform=ax.transAxes,
    ha='right',
    fontsize=9,
    style='italic',
    color='gray'
)

# Make it square (equal aspect ratio)
ax.set_aspect('equal', adjustable='box')

# Adjust layout and display
plt.tight_layout()
plt.show()

**What This Code Does:** 1. **Calculate Separate Metrics**: We group plays by team AND play type, then calculate mean EPA for each combination. This gives us two EPA values per team (one for passing, one for rushing). 2. **Reshape Data**: The `pivot_wider()` (R) or `.pivot()` (Python) operation transforms the data from long format (multiple rows per team) to wide format (one row per team with separate pass/run columns). This structure is necessary for scatter plots where x and y are different variables. 3. **Add Reference Lines**: Lines at x=0 and y=0 divide the plot into four quadrants: - **Upper-right**: Good at both passing and rushing - **Upper-left**: Good at passing, poor at rushing - **Lower-right**: Poor at passing, good at rushing - **Lower-left**: Poor at both 4. **Add Trend Line**: The `geom_smooth()` (R) or `np.polyfit()` (Python) creates a linear trend line showing the overall relationship. The shaded confidence interval (R) shows statistical uncertainty. 5. **Use Logos as Points** (R only): Instead of generic points, we plot team logos. This makes identification instant and adds professional polish. **Key Statistical Concept: Correlation** The trend line reveals whether pass and rush EPA are correlated: - **Positive slope**: Teams good at passing tend to also be good at rushing - **Negative slope**: Teams good at one tend to be poor at the other - **Flat (zero slope)**: No relationship between the two

Interpreting the Output:

This scatter plot reveals several important patterns about NFL offenses:

1. Weak Positive Correlation:

The trend line has a gentle positive slope, suggesting that teams that pass well tend to rush slightly better as well. This makes sense: good offensive lines help both passing (protection) and rushing (creating holes). However, the correlation is weak—plenty of teams deviate from the trend.

2. Quadrant Analysis:

Looking at which quadrant teams fall into reveals offensive identity:

Upper-right quadrant (positive pass EPA, positive rush EPA): Balanced, elite offenses. These are the most complete offensive teams.
Upper-left quadrant (positive pass EPA, negative rush EPA): Pass-first offenses. These teams win through the air despite struggling on the ground. Many modern offenses fall here.
Lower-right quadrant (negative pass EPA, positive rush EPA): Run-first offenses. Rare in the modern NFL, but some teams still emphasize rushing.
Lower-left quadrant (negative pass EPA, negative rush EPA): Struggling offenses. These teams are inefficient in both facets.

3. Outliers and Clusters:

Some teams deviate significantly from the trend line. These are interesting cases:
- Teams far above the trend line: passing much better than their rushing would predict
- Teams far below: rushing much better than their passing would predict
- Tight clusters: similar offensive profiles

4. The Pass-Run Continuum:

This visualization shows that the pass-vs-run debate isn't binary. Teams exist along a continuum of passing/rushing efficiency. Some excel at one, some at both, and some at neither.

5. Strategic Implications:

The weak correlation suggests that passing and rushing ability are somewhat independent. You can build an effective offense by excelling at one dimension (usually passing) without necessarily being good at the other. This supports the modern pass-first approach—since passing has higher EPA, teams optimize for passing even if it means accepting poor rushing numbers.

Reading Quadrant Charts

Quadrant charts (scatter plots with reference lines dividing the space) are powerful for categorizing observations: **How to Read:** 1. Identify the reference lines (often at zero or mean values) 2. Understand what each quadrant represents 3. Look for clusters (groups of similar cases) 4. Identify outliers (unusual combinations) 5. Examine the trend line for overall relationships **When to Use:** - Comparing performance on two dimensions - Creating typologies (categories based on two criteria) - Identifying balanced vs. specialized entities - Making draft or roster decisions (combine two scouting metrics) **Labeling Quadrants:** Consider adding text labels to each quadrant explaining what it represents: - "Elite" (high on both) - "Struggling" (low on both) - "Pass-first" / "Run-first" (high on one, low on other) This helps audiences who aren't statistically sophisticated understand the plot instantly.

Advanced Visualization Techniques

Beyond basic plot types, advanced techniques help you create publication-quality visualizations that tell compelling stories with your data. These techniques combine multiple elements, use sophisticated color schemes, and incorporate NFL-specific branding to create professional graphics.

Using NFL Team Colors and Logos

One of the most effective ways to make football visualizations immediately recognizable and professional is to incorporate official team colors and logos. The nflplotR package in R makes this remarkably easy.

Benefits of Team Branding:

Instant Recognition: Viewers immediately identify teams without reading labels
Professional Appearance: Signals domain expertise and attention to detail
Visual Appeal: Team colors and logos make graphics more engaging
Brand Consistency: Aligns with how fans and media already think about teams

Let's create a visualization that showcases these capabilities—a ranked bar chart with team logos and colors.

#| label: fig-team-logos-r
#| fig-cap: "Top 10 NFL offenses by EPA per play with team logos and colors. This visualization demonstrates how nflplotR enhances football graphics with authentic branding elements."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Calculate team offensive EPA
team_offense <- pbp %>%
  filter(
    !is.na(epa),
    play_type %in% c("pass", "run")
  ) %>%
  group_by(posteam) %>%
  summarise(
    plays = n(),
    epa_per_play = mean(epa),
    .groups = "drop"
  ) %>%
  # Get top 10 teams
  arrange(desc(epa_per_play)) %>%
  head(10)

# Create visualization with team logos and colors
team_offense %>%
  # Reorder teams by EPA for proper ranking
  mutate(posteam = fct_reorder(posteam, epa_per_play)) %>%
  ggplot(aes(x = posteam, y = epa_per_play)) +

  # Create bars using team colors
  # nflplotR automatically assigns each team its primary color
  geom_col(aes(fill = posteam), width = 0.7, show.legend = FALSE) +

  # Use official NFL team colors for the bars
  # type = "primary": uses each team's primary brand color
  nflplotR::scale_fill_nfl(type = "primary") +

  # Add team logos on top of bars
  # This adds immediate visual recognition
  # width = 0.05: logos are 5% of plot width
  nflplotR::geom_nfl_logos(
    aes(team_abbr = posteam),
    width = 0.05,
    alpha = 0.9
  ) +

  # Add reference line at zero
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +

  # Format y-axis
  scale_y_continuous(
    labels = scales::number_format(accuracy = 0.01),
    expand = expansion(mult = c(0.1, 0.15))
  ) +

  # Add labels
  labs(
    title = "Top 10 NFL Offenses by EPA per Play",
    subtitle = "2023 Regular Season | Team colors and logos show offensive rankings",
    x = NULL,
    y = "EPA per Play",
    caption = "Data: nflfastR | Logos and colors from nflplotR package"
  ) +

  # Flip to horizontal for better readability
  coord_flip() +

  # Apply minimal theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    axis.text.y = element_text(size = 10, face = "bold"),
    panel.grid.major.y = element_blank()
  )

#| label: fig-team-logos-py
#| fig-cap: "Top 10 offenses - Python version. While Python doesn't have direct logo support like nflplotR, we can still create professional team-based visualizations."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Calculate team offensive EPA
team_offense = (
    pbp[pbp['epa'].notna() & pbp['play_type'].isin(['pass', 'run'])]
    .groupby('posteam')
    .agg(
        plays=('epa', 'count'),
        epa_per_play=('epa', 'mean')
    )
    .reset_index()
    .nlargest(10, 'epa_per_play')
    .sort_values('epa_per_play')  # Sort for horizontal bar chart
)

# Create figure
fig, ax = plt.subplots(figsize=(10, 8))

# Create horizontal bar chart
# Using a gradient color scheme
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(team_offense)))

bars = ax.barh(
    team_offense['posteam'],
    team_offense['epa_per_play'],
    color=colors,
    alpha=0.8
)

# Add value labels on bars
for idx, (team, epa) in enumerate(zip(team_offense['posteam'], team_offense['epa_per_play'])):
    ax.text(
        epa + 0.001,  # Slightly to the right of bar
        idx,
        f'{epa:.3f}',
        va='center',
        fontsize=9,
        fontweight='bold'
    )

# Add reference line at zero
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Add labels
ax.set_xlabel('EPA per Play', fontsize=12)
ax.set_ylabel('Team', fontsize=12)

# Add title
ax.set_title(
    'Top 10 NFL Offenses by EPA per Play\n' +
    '2023 Regular Season | Ranked by offensive efficiency',
    fontsize=14,
    fontweight='bold',
    pad=20
)

# Add caption
ax.text(
    0.98, 0.02,
    'Data: nfl_data_py | Higher EPA = More efficient offense',
    transform=ax.transAxes,
    ha='right',
    fontsize=9,
    style='italic',
    color='gray'
)

# Grid for easier reading
ax.grid(axis='x', alpha=0.3)

# Adjust layout
plt.tight_layout()
plt.show()

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

**nflplotR Capabilities (R):** The nflplotR package provides several powerful functions: 1. **`scale_fill_nfl()` / `scale_color_nfl()`**: Automatically assigns official team colors to plot elements. Supports primary, secondary, and alternate colors. 2. **`geom_nfl_logos()`**: Adds team logos as geometric elements. Logos automatically resize based on plot dimensions. 3. **`geom_nfl_wordmarks()`**: Adds team wordmarks (team names in official fonts). 4. **`element_nfl_logo()`**: Uses logos as theme elements (backgrounds, axis labels). **Design Choices:** - **Horizontal Orientation**: Team names read more easily horizontally - **Logos on Bars**: Provides immediate team identification - **Team Colors**: Each bar uses that team's official primary color - **Clean Background**: Minimal theme keeps focus on data **Python Considerations:** Python doesn't have a direct equivalent to nflplotR, but you can: - Use custom color palettes based on team colors - Import and display logos using `matplotlib.image` - Create custom functions to map teams to colors - Use the `sportypy` package for some team branding features

Why This Matters:

Team branding transforms a generic bar chart into a professional NFL graphic. Compare these two versions:

Generic version: Blue bars, team abbreviations as labels
Branded version: Team-colored bars, team logos, instant recognition

The branded version:
- Takes the same amount of time to create (one additional line of code)
- Looks dramatically more professional
- Communicates more effectively (visual recognition is faster than reading)
- Shows domain expertise (you know football, not just statistics)

For presentations to coaches, executives, or fans, this polish makes a significant difference in how your work is received.

Best Practices for Team Branding

**When to Use Team Colors/Logos:** - **Always**: When showing team-specific data - **Rankings**: Makes it easy to find specific teams - **Comparisons**: Helps viewers track teams across multiple plots **When to Be Careful:** - **Many teams at once**: 32 teams with different colors can be overwhelming - **Colorblind accessibility**: Some team color combinations are problematic - **Small graphics**: Logos may not be legible if too small **Solutions:** - Show subsets (top 10, bottom 10, specific division) - Use grayscale for most teams, color for teams of interest - Ensure logos are at least 0.05-0.08 of plot width - Test printed versions (colors may not print well)

Small multiples—also called faceting or trellis charts—show the same visualization repeated for different subsets of data. This technique is powerful for revealing patterns across categories without overplotting.

When to Use Small Multiples:

Comparing patterns across many categories (teams, positions, downs)
Showing how relationships vary by group
Avoiding overplotting when you have many categories
Creating grid layouts that facilitate comparison

Edward Tufte, the visualization pioneer, calls small multiples "the best design solution for many problems" because they enable viewers to naturally compare patterns across groups.

Let's create small multiples showing EPA distributions for each down.

#| label: fig-facets-r
#| fig-cap: "EPA distributions by down using faceted density plots. Each panel shows the distribution for a different down, revealing how EPA patterns change as downs progress."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Prepare data
down_epa <- pbp %>%
  filter(
    !is.na(epa),
    play_type %in% c("pass", "run"),
    !is.na(down),
    down %in% 1:4  # Only regular downs
  )

# Create faceted density plot
down_epa %>%
  ggplot(aes(x = epa, fill = play_type)) +

  # Create density curves
  geom_density(alpha = 0.6, adjust = 1.5) +

  # Add reference line at zero
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +

  # Create separate panel for each down
  # ncol = 2: arrange in 2 columns
  # scales = "free_y": allow different y-axis scales (different # of plays per down)
  facet_wrap(
    ~down,
    ncol = 2,
    labeller = labeller(down = c(
      "1" = "1st Down",
      "2" = "2nd Down",
      "3" = "3rd Down",
      "4" = "4th Down"
    ))
  ) +

  # Use consistent colors
  scale_fill_manual(
    values = c("pass" = "#00BFC4", "run" = "#F8766D"),
    labels = c("Pass", "Run"),
    name = "Play Type"
  ) +

  # Limit x-axis
  scale_x_continuous(limits = c(-5, 8)) +

  # Add labels
  labs(
    title = "EPA Distributions by Down and Play Type",
    subtitle = "2023 NFL Regular Season | Patterns change as downs progress",
    x = "Expected Points Added",
    y = "Density",
    caption = "Data: nflfastR | Each panel shows a different down"
  ) +

  # Apply theme
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    legend.position = "top",
    strip.text = element_text(face = "bold", size = 12),  # Facet labels
    panel.spacing = unit(1, "lines")  # Space between panels
  )

📊 Visualization Output

The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.

#| label: fig-facets-py
#| fig-cap: "EPA by down - Python faceted version. Shows how EPA distributions vary across downs for both pass and run plays."
#| fig-width: 10
#| fig-height: 8
#| cache: true

# Prepare data
down_epa = pbp[
    pbp['epa'].notna() &
    pbp['play_type'].isin(['pass', 'run']) &
    pbp['down'].notna() &
    pbp['down'].isin([1, 2, 3, 4])
].copy()

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
axes = axes.flatten()

# Plot for each down
for idx, down_num in enumerate([1, 2, 3, 4]):
    ax = axes[idx]

    # Filter to this down
    down_data = down_epa[down_epa['down'] == down_num]

    # Plot density for each play type
    for play_type, color, label in [
        ('pass', '#00BFC4', 'Pass'),
        ('run', '#F8766D', 'Run')
    ]:
        data = down_data[down_data['play_type'] == play_type]['epa']
        data.plot.kde(
            ax=ax,
            color=color,
            alpha=0.6,
            linewidth=2,
            label=label,
            bw_method=0.3
        )

    # Add reference line
    ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)

    # Set title for this panel
    ax.set_title(f'{down_num}{"st" if down_num==1 else "nd" if down_num==2 else "rd" if down_num==3 else "th"} Down',
                 fontweight='bold', fontsize=12)

    # Set x-axis limits
    ax.set_xlim(-5, 8)

    # Add legend to first panel only
    if idx == 0:
        ax.legend(title='Play Type', loc='upper right')

    # Labels
    ax.set_xlabel('Expected Points Added')
    ax.set_ylabel('Density')

# Overall title
fig.suptitle(
    'EPA Distributions by Down and Play Type\n2023 NFL Regular Season',
    fontsize=14,
    fontweight='bold',
    y=0.995
)

# Adjust layout
plt.tight_layout()
plt.show()

**What Faceting Does:** 1. **Splits Data**: Divides data into subsets based on a categorical variable (here, down) 2. **Creates Panels**: Makes a separate plot for each subset 3. **Uses Common Scales**: By default, all panels share the same x and y scales (makes comparison easier) 4. **Arranges in Grid**: Organizes panels in rows and columns **Design Choices:** - **2x2 Grid**: Four downs arranged in a 2-column grid - **Shared X-Axis**: EPA scale is same across panels for easy comparison - **Free Y-Axis** (R): Different y-scales because there are many more 1st down plays than 4th down plays - **Consistent Colors**: Pass and run use same colors across all panels - **Bold Panel Labels**: Clear labels for each down **Code Comparison:** - **R**: `facet_wrap(~down)` creates facets automatically - **Python**: Manual subplot creation with loop—more code but more control

Interpreting the Patterns:

Examining EPA distributions across downs reveals how situational pressure changes play outcomes:

1st Down Patterns:

Most EPA values cluster near zero
Wide spread for both passes and runs
Success rates are relatively balanced

2nd Down Patterns:

Similar to 1st down but slightly more variable
Distance-to-go affects play selection and outcomes
Mix of short-yardage and long-yardage situations

3rd Down Patterns:

EPA distributions become more bimodal (two peaks)
One peak near zero (failed conversions leading to punts)
Another peak at positive values (successful conversions)
Pass plays show much higher variance than runs

4th Down Patterns:

Most extreme distributions
Strong bimodality: plays either succeed dramatically or fail dramatically
Fewer total plays (many 4th downs are punts/field goals, not included here)
Very high stakes visible in the distribution shape

Strategic Insights:

The changing distributions across downs reflect increasing pressure and decreasing margin for error. On 1st down, offenses have flexibility—mistakes can be overcome. By 4th down, plays must succeed or the drive ends. This pressure manifests in the distribution shapes: more bimodal (success or failure, little middle ground) and more extreme outcomes.

Small Multiples Best Practices

**Effective Use:** 1. **Consistent Scales**: Use the same scales across panels unless there's a good reason not to 2. **Logical Ordering**: Arrange panels in a meaningful order (chronological, by value, by category) 3. **Readable Labels**: Make panel labels clear and descriptive 4. **Limited Panels**: Don't exceed 12-16 panels (becomes hard to process) 5. **Common Elements**: Use same colors, styles, and reference lines across panels **When to Facet:** - **Many categories**: Too many to show on one plot without overplotting - **Comparing patterns**: Want to see if relationships differ across groups - **Temporal progression**: Showing how patterns evolve over time **Alternatives:** - **Color/shape encoding**: If you have 2-3 categories, use color instead - **Animation**: For temporal data, animated plots can show changes - **Interactive filters**: Web-based dashboards with category selectors

Creating great visualizations is only half the battle—you also need to export and share them effectively. Different contexts require different formats, resolutions, and styling.

Export Formats and Best Practices

Common Export Formats:

Format	Best For	Pros	Cons
PNG	Presentations, web, social media	Universal support, good compression	Raster (pixelates when scaled)
PDF	Publications, print, reports	Vector (scales infinitely), professional	Large file size, some compatibility issues
SVG	Web graphics, further editing	Vector, editable in design tools	Not supported in some contexts
JPEG	Photos, web (rarely for data viz)	Small file size	Lossy compression, not ideal for text/lines

Resolution Guidelines:

Screen/Web: 100-150 DPI (dots per inch)
Presentations: 150 DPI
Print: 300 DPI minimum
Publications: 600 DPI for line graphics

#| label: export-r
#| eval: false
#| echo: true

# Create a plot to export
my_plot <- pbp %>%
  filter(!is.na(epa), play_type %in% c("pass", "run")) %>%
  ggplot(aes(x = epa, fill = play_type)) +
  geom_density(alpha = 0.6) +
  theme_minimal() +
  labs(title = "EPA Distribution by Play Type")

# Export as PNG for presentations
ggsave(
  filename = "epa_distribution.png",
  plot = my_plot,
  width = 10,          # Width in inches
  height = 6,          # Height in inches
  dpi = 150,           # Resolution (dots per inch)
  bg = "white"         # Background color
)

# Export as PDF for publication
ggsave(
  filename = "epa_distribution.pdf",
  plot = my_plot,
  width = 10,
  height = 6,
  device = cairo_pdf  # Better font rendering
)

# Export high-resolution PNG for print
ggsave(
  filename = "epa_distribution_print.png",
  plot = my_plot,
  width = 10,
  height = 6,
  dpi = 300,          # High resolution
  bg = "white"
)

# Export for Twitter/social media
# Twitter recommends 2:1 aspect ratio
ggsave(
  filename = "epa_distribution_twitter.png",
  plot = my_plot,
  width = 10,
  height = 5,         # 2:1 aspect ratio
  dpi = 150
)

#| label: export-py
#| eval: false
#| echo: true

# Create a plot to export
fig, ax = plt.subplots(figsize=(10, 6))

# Plot code here...
# (assuming plot is created on ax)

# Export as PNG for presentations
fig.savefig(
    'epa_distribution.png',
    dpi=150,              # Resolution
    bbox_inches='tight',  # Remove extra whitespace
    facecolor='white',    # Background color
    edgecolor='none'      # No border
)

# Export as PDF for publication
fig.savefig(
    'epa_distribution.pdf',
    format='pdf',
    bbox_inches='tight',
    facecolor='white'
)

# Export high-resolution PNG for print
fig.savefig(
    'epa_distribution_print.png',
    dpi=300,              # High resolution
    bbox_inches='tight',
    facecolor='white'
)

# Export for Twitter/social media
# Create new figure with 2:1 aspect ratio
fig_twitter, ax_twitter = plt.subplots(figsize=(10, 5))
# Recreate plot on this figure...
fig_twitter.savefig(
    'epa_distribution_twitter.png',
    dpi=150,
    bbox_inches='tight',
    facecolor='white'
)

plt.close('all')  # Close all figures to free memory

**Key Export Parameters:** 1. **Dimensions** (`width`, `height`): Specify in inches. Most journals and presentations expect specific dimensions. Common sizes: - Full-width single column: 7-8 inches - Full-width two column: 3.5-4 inches per column - Presentation slide: 10-12 inches wide 2. **Resolution** (`dpi`): Higher DPI = sharper image but larger file - 100 DPI: Quick exports for review - 150 DPI: Presentations, web - 300 DPI: Print, publications - 600 DPI: High-quality print (journals often request) 3. **Background** (`bg`, `facecolor`): Set to white to avoid transparent backgrounds that may render poorly in some contexts 4. **Cropping** (`bbox_inches='tight'` in Python): Removes extra whitespace around the plot **Format Selection:** - **PNG**: Most versatile, works everywhere, good compression - **PDF**: Best for documents that will be printed or edited - **SVG**: Best if you need to edit in Adobe Illustrator or similar tools - **JPEG**: Avoid for data visualizations (lossy compression blurs text)

Publication Checklist

Before exporting final visualizations for publication: **Technical:** - [ ] Resolution is 300+ DPI for print, 150+ for web - [ ] Dimensions match publication requirements - [ ] All text is legible at final size - [ ] Colors work in grayscale (if printed in B&W) **Content:** - [ ] All axes are labeled with units - [ ] Title is informative and complete - [ ] Legend is clear and positioned well - [ ] Data source is cited in caption - [ ] No typos or errors in labels **Style:** - [ ] Fonts are professional and readable - [ ] Colors are consistent with other figures - [ ] White space is balanced - [ ] No unnecessary decorations (chartjunk removed) Taking 10 minutes to check these items prevents countless revisions later.

Summary

Effective data visualization is a critical skill for football analytics. In this chapter, we've covered the complete landscape of creating professional, informative football visualizations:

Core Principles:

Clarity: Visualizations should communicate their message within 3-5 seconds
Accuracy: Visual encoding must truthfully represent data without distortion
Aesthetics: Professional appearance enhances credibility and impact

Fundamental Chart Types:

Histograms: Show distributions, reveal shape and spread
Density Plots: Compare distributions smoothly across groups
Bar Charts: Compare categories, show rankings
Line Charts: Display trends over time or ordered sequences
Scatter Plots: Reveal relationships between two variables

Advanced Techniques:

Team Branding: Using official colors and logos for recognition and professionalism
Small Multiples: Comparing patterns across many categories
Color Theory: Choosing colors that encode data accurately and work for all viewers
Grammar of Graphics: Thinking in layers to build custom visualizations

Technical Skills:

Loading and configuring visualization packages in R and Python
Creating plots with both ggplot2 and matplotlib/seaborn
Customizing aesthetics for publication quality
Exporting in appropriate formats and resolutions

Football-Specific Applications:

Visualizing EPA distributions to understand play type variance
Creating win probability charts to show game flow
Using team colors and logos for authentic NFL graphics
Comparing offensive efficiency across multiple dimensions

The visualization skills you've developed here will serve you throughout your football analytics career. Whether you're presenting to coaches, publishing research, or sharing insights on social media, your ability to create clear, accurate, and beautiful visualizations will set your work apart.

Remember: the goal of visualization is insight, not decoration. Every element should serve a purpose. Every choice should enhance understanding. And every visualization should tell a story that matters.

Exercises

Exercise 1: Team Performance Dashboard

**Objective**: Create a multi-panel dashboard showing different aspects of a single team's performance. **Task**: Choose one team and create a 2x2 grid of visualizations showing: 1. EPA distribution (histogram or density plot) 2. EPA by quarter (bar chart showing how they perform in each quarter) 3. Success rate by down (bar chart) 4. Pass vs. rush EPA comparison (scatter plot with league average reference lines) **Requirements**: - Use team colors consistently across all plots - Add appropriate labels and titles - Include a main title for the entire dashboard - Make it publication-ready (could be handed to a coach) **Hints**: - Use `patchwork` in R or `plt.subplots()` in Python - Filter data to single team early - Consider using `scale_fill_team()` or custom colors

Exercise 2: Interactive Win Probability

**Objective**: Create an interactive win probability chart using plotly. **Task**: 1. Select an exciting close game from 2023 2. Create an interactive win probability chart where: - Hovering shows play description, quarter, time, score - Clicking a point highlights that play - The chart is zoomable and pannable 3. Export as an HTML file that can be shared **Requirements**: - Use `plotly` (available in both R and Python) - Include informative hover text - Make it visually appealing - Add a title and caption **Hints**: - `ggplotly()` in R converts ggplot to plotly - `plotly.express` in Python provides quick interactive plots - Use `tooltip` aesthetic for hover information

Exercise 3: Positional EPA Comparison

**Objective**: Compare EPA across player positions using advanced visualization. **Task**: Create a visualization comparing quarterback EPA by team: 1. Calculate average EPA for each team's quarterbacks 2. Create a visualization that shows: - Team rankings - Confidence intervals (if you have enough data) - Highlight playoff teams vs. non-playoff teams 3. Add team logos or colors **Requirements**: - Filter to only QB pass plays - Calculate both mean EPA and sample size - Consider showing uncertainty if appropriate - Make it professional and polished **Extensions**: - Compare to receiver EPA - Show EPA by target (which receivers get targeted) - Break down by quarter or game situation

References

:::

Learning ObjectivesBy the end of this chapter, you will be able to:

Play Type Distribution

Success Rate by Down & Distance

Introduction

Why Visualization Matters in Football Analytics

Principles of Effective Visualization

The Three Pillars of Good Visualization

Pillar 1: Clarity – The Message Must Be Immediate

The Five-Second Test

Pillar 2: Accuracy – The Visualization Must Not Mislead

The Misleading Y-Axis

Pillar 3: Aesthetics – The Visualization Should Be Visually Appealing

Professional Quality Standards

Choosing the Right Chart Type

Why Avoid Pie Charts?

Color Theory for Football Visualizations

Understanding Color Types

Best Practices for Color Use

Using NFL Team Colors

Common Color Mistakes in Football Analytics

Grammar of Graphics: The ggplot2 Philosophy

The Core Components

The Layered Approach to Building Graphics

Think in Layers, Not Templates

Why This Matters for Football Analytics

Grammar of Graphics in Python

Setting Up Your Visualization Environment

Loading Visualization Libraries

Loading Sample Data

Basic Plot Types

Histograms and Distributions

📊 Visualization Output

Interpreting EPA Distributions

Density Plots

📊 Visualization Output

📊 Visualization Output

Key Insight: Variance vs. Mean

Line Charts

📊 Visualization Output

Reading Win Probability Charts

Scatter Plots

📊 Visualization Output

Reading Quadrant Charts

Advanced Visualization Techniques

Using NFL Team Colors and Logos

📊 Visualization Output

Best Practices for Team Branding

Creating Small Multiples (Facets)

📊 Visualization Output

Small Multiples Best Practices

Exporting and Sharing Visualizations

Export Formats and Best Practices

Publication Checklist

Summary

Exercises

Exercise 1: Team Performance Dashboard

Exercise 2: Interactive Win Probability

Exercise 3: Positional EPA Comparison

Further Reading

References