Learning ObjectivesBy the end of this chapter, you will be able to:
Play Type Distribution
A pie chart showing the breakdown of different play types throughout an NFL season.
Success Rate by Down & Distance
This heatmap visualizes how success rates vary based on down and yards to go. Darker green indicates higher success.
- Understand and apply principles of effective data visualization to football analytics
- Master ggplot2 (R) and matplotlib/seaborn (Python) for creating football visualizations
- Select appropriate chart types for different data relationships and analytical questions
- Create publication-quality football visualizations with proper color, labels, and annotations
- Design interactive visualizations for presentations and web-based reports
- Apply color theory and design principles specifically for football data
- Use team colors and logos effectively with nflplotR and related tools
- Tell compelling stories with football data through visual narratives
Introduction
In the high-stakes world of football analytics, your ability to communicate insights through effective visualization is just as critical as your analytical skills. Imagine you've discovered that a team's fourth-down decision-making costs them three expected points per game—a significant tactical advantage. If you present this finding in a dense spreadsheet with hundreds of rows, decision-makers' eyes will glaze over. But if you create a compelling visualization that instantly reveals the pattern, you can change coaching strategy and impact game outcomes.
This chapter explores the art and science of data visualization specifically tailored for football analytics. We'll move beyond generic charting to create visualizations that speak the language of football—using team colors, incorporating logos, and designing graphics that resonate with coaches, front office executives, and fans alike. Whether you're preparing a presentation for an NFL analytics department, writing a research paper, or creating content for social media, the visualization skills you develop here will be essential.
Effective football visualization requires balancing multiple concerns. Your graphics must be technically accurate, visually appealing, and immediately interpretable. They need to work for audiences ranging from statistically sophisticated analysts to coaches who think in terms of yards and points, not regression coefficients. They must look professional in PowerPoint presentations, academic papers, and Twitter posts. Most importantly, they must reveal insights—not just display data.
Throughout this chapter, we'll build from fundamental principles to advanced techniques. We'll start by understanding what makes a visualization effective, explore the grammar of graphics that underlies modern visualization tools, and then work through practical examples using real NFL data. By the end, you'll be able to create visualizations that would be at home in an NFL front office, a peer-reviewed journal, or on the desk of an ESPN analyst.
Why Visualization Matters in Football Analytics
Good visualization serves multiple critical functions in football analytics: - **Pattern Recognition**: Makes complex patterns immediately apparent that would be hidden in tables of numbers - **Decision Support**: Facilitates rapid decision-making under pressure (in-game analytics, draft decisions) - **Communication**: Bridges the gap between analysts and non-technical stakeholders like coaches and executives - **Credibility**: Enhances professionalism and trust in your analytical work - **Exploration**: Enables interactive discovery of insights in your data - **Persuasion**: Convinces decision-makers to act on your recommendations - **Memory**: Creates memorable insights that stick with audiences long after presentations end The most brilliant analysis is worthless if it can't be communicated effectively. Visualization is your primary tool for turning data into decisions.Principles of Effective Visualization
Before we write a single line of code, we must internalize the principles that separate exceptional visualizations from mediocre ones. These principles transcend any particular tool or programming language—they apply whether you're using R, Python, Tableau, or even hand-drawing charts.
The Three Pillars of Good Visualization
Every effective data visualization rests on three fundamental pillars: clarity, accuracy, and aesthetics. Understanding these principles will guide every visualization decision you make throughout your career.
Pillar 1: Clarity – The Message Must Be Immediate
Clarity means your visualization's primary message should be apparent within 3-5 seconds of viewing. Your audience shouldn't need to study the graphic, squint at tiny labels, or puzzle over what the axes represent. This is especially critical in football analytics, where coaches may be reviewing your work between meetings or executives scanning reports before important decisions.
How to Achieve Clarity:
-
Remove Chartjunk: Eliminate decorative elements that don't convey information. Edward Tufte, the visualization pioneer, calls these elements "chartjunk"—grid lines you don't need, background colors that distract, or 3D effects that obscure data. Every element should serve a purpose.
-
Use Appropriate Chart Types: A scatter plot reveals correlations; a line chart shows trends over time; a bar chart compares categories. Choosing the wrong type creates confusion. We'll explore chart type selection in detail shortly.
-
Provide Clear Labels: Axes should have descriptive labels with units specified. Titles should be informative, not just "Figure 1." A good title tells readers what they're seeing: "Pass EPA Outperforms Rush EPA by 0.13 Points per Play" is better than "EPA by Play Type."
-
Ensure Readability: Use font sizes large enough to read (minimum 10-12 points for body text). Maintain sufficient contrast between text and background. Avoid light yellow text on white backgrounds or dark blue on black.
The Five-Second Test
After creating a visualization, step away from your computer for a few minutes. Then return and look at your graphic for exactly five seconds. Can you immediately identify: 1. What the visualization is about? 2. What the main pattern or finding is? 3. What the axes represent? If not, your visualization needs more clarity. This simple test prevents countless hours of confusion for your audience.Pillar 2: Accuracy – The Visualization Must Not Mislead
Accuracy means your visualization represents the data truthfully. This goes beyond having correct numbers—it means the visual encoding (how data maps to visual properties) doesn't distort perception. Misleading visualizations, whether intentional or accidental, undermine trust and can lead to terrible decisions.
Common Accuracy Pitfalls and Solutions:
-
Truncated Axes: When comparing magnitudes (like team totals), axes should generally start at zero. If Team A has 350 total points and Team B has 340, showing only the 340-350 range makes the difference appear enormous when it's actually modest (2.9%). However, when showing changes or differences, non-zero axes may be appropriate—just be explicit about it.
-
Inappropriate Scales: Using logarithmic scales when your audience expects linear scales can confuse interpretation. Be explicit about scale choice in your labels.
-
Distorted Aspect Ratios: Stretching or squashing charts can exaggerate or minimize trends. Line charts are particularly sensitive to this—a 45-degree slope suggests balanced change, but you can make any trend look dramatic by manipulating the aspect ratio.
-
Cherry-Picked Data: Showing only a subset of data without disclosure misleads readers. If you're showing "Top 10 Offenses," make that clear—don't imply you're showing all teams.
-
Missing Context: Showing raw statistics without accounting for game situation, opponent quality, or sample size can mislead. Always provide necessary context.
Show Uncertainty When Relevant:
Football is probabilistic, not deterministic. When your analysis involves estimates (like player projections) or small sample sizes (like a quarterback's first three games), show confidence intervals or error bars. This honesty builds trust and prevents overconfident decision-making.
The Misleading Y-Axis
One of the most common visualization mistakes in sports media is the truncated y-axis used to exaggerate small differences. **Example**: Imagine two quarterbacks with passer ratings of 98.5 and 96.2. If you create a bar chart with a y-axis running from 95 to 100, the difference looks massive—one bar is twice as tall as the other. But if the axis runs from 0 to 158.3 (the maximum possible passer rating), the difference appears appropriately modest. **When to Truncate**: It's acceptable to truncate axes when showing *changes* or *differences* rather than total magnitudes, but always label clearly so readers understand what they're seeing.Pillar 3: Aesthetics – The Visualization Should Be Visually Appealing
Aesthetics might seem superficial compared to accuracy, but visual appeal directly impacts how seriously your work is taken. A polished, professionally designed visualization signals that you care about quality and details. It makes your analysis more persuasive and memorable.
Elements of Aesthetic Excellence:
-
Thoughtful Color Choices: Colors should be purposeful, not arbitrary. Use team colors for authenticity, diverging color schemes (red-white-green) to show good/neutral/bad performance, and sequential schemes (light to dark blue) to show increasing quantities. Limit your palette to 3-5 colors per visualization to avoid overwhelming viewers.
-
Consistent Styling: All visualizations in a report or presentation should share a consistent aesthetic—same fonts, same color schemes, same axis formatting. This consistency appears professional and makes it easier for audiences to switch between graphics.
-
Balanced White Space: Don't cram too much into one graphic. White space (empty space around and within visualizations) helps viewers focus and prevents visual clutter. It's okay to have margins and padding—they make your content more digestible.
-
Audience Consideration: Design for your specific audience. Academic journals typically prefer conservative, clean designs. Social media graphics can be bolder and more stylized. Presentations to coaches might emphasize practical takeaways over statistical nuance.
The Power of Polish:
Professional sports organizations evaluate analysts partly on presentation skills. Two analysts might produce identical statistical findings, but the one who can package those findings in compelling visualizations will have greater impact. Aesthetics matter.
Professional Quality Standards
In NFL analytics roles, your visualizations will be seen by head coaches, general managers, and owners—people who make million-dollar decisions. Your graphics should meet professional standards: - Use high-resolution outputs (300 DPI for print, 150 DPI for presentations) - Eliminate typos and grammatical errors in labels - Maintain brand consistency (use team colors and logos appropriately) - Test visualizations with diverse viewers before finalizing - Export in appropriate formats (vector formats like PDF for publications, PNG for presentations) The difference between good and great is often just 15 minutes of polishing—but that difference can determine whether your recommendations are implemented.Choosing the Right Chart Type
One of the most important visualization decisions is selecting the appropriate chart type for your data and message. Different chart types excel at revealing different patterns. Using the wrong type obscures insights or confuses viewers.
The following table provides guidance for common data relationships in football analytics:
| Data Relationship | Best Chart Type | Football Analytics Example | When to Use |
|---|---|---|---|
| Distribution of a single variable | Histogram, Density Plot | Distribution of EPA values across all plays | When you want to understand the shape, spread, and central tendency of a variable |
| Comparison of categories | Bar Chart (vertical or horizontal) | Team offensive EPA rankings; QB completion percentages by team | When comparing discrete categories on a single metric |
| Relationship between two continuous variables | Scatter Plot | Pass EPA vs. Rush EPA by team; QB attempts vs. yards per attempt | When exploring correlations or patterns between two numeric variables |
| Trend over time | Line Chart | Win probability throughout a game; team performance across season | When showing how something changes sequentially over time or progression |
| Part-to-whole composition | Stacked Bar Chart, Pie Chart (use sparingly) | Play type distribution (% pass vs. run) by team | When showing how categories combine to make a whole |
| Comparing distributions | Box Plot, Violin Plot, Overlaid Density Plots | Comparing EPA distributions for pass vs. run plays | When comparing the full distribution shape across categories |
| Multiple variables across categories | Faceted (Small Multiples) Plots | EPA distributions by down (1st, 2nd, 3rd, 4th) and play type | When showing the same relationship across many subgroups |
| Statistical uncertainty | Error Bars, Confidence Intervals | Team EPA with 95% confidence intervals | When sample size is small or estimates have uncertainty |
| Geographic/spatial patterns | Heat Maps, Geographic Maps | Field position analysis; where on the field plays succeed | When location or position matters to the analysis |
Principles for Chart Selection:
-
Match the Visual to the Data Structure: Categorical data (teams, play types) work well with bar charts. Continuous data (EPA, yards gained) suit histograms, density plots, or scatter plots.
-
Consider Your Primary Question: If asking "which team is best?", use a ranked bar chart. If asking "how are these two stats related?", use a scatter plot. The question guides the choice.
-
Think About Comparisons: Bar charts make categorical comparisons easy. Line charts work better for temporal comparisons. Scatter plots excel at multidimensional comparisons.
-
Avoid Overused Types: Pie charts are almost always inferior to bar charts for showing proportions (human vision is better at comparing lengths than angles). 3D charts add no information while making data harder to read. Avoid both.
Why Avoid Pie Charts?
Pie charts are ubiquitous in business presentations but rarely the best choice. Here's why: **The Problem**: Human visual perception is much better at comparing lengths (as in bar charts) than angles or areas (as in pie charts). When you have more than 2-3 categories, pie charts become very difficult to read accurately. **Example**: Try to determine if a pie slice representing 22% is bigger than one representing 20%. Now imagine the same comparison with bar lengths—it's instantly obvious. **When They're Acceptable**: Pie charts work for showing simple binary splits (win percentage: 65% wins, 35% losses) or when precise comparison isn't needed (general sense of distribution). Even then, a bar chart or simple statistic usually works better. **Better Alternative**: Use horizontal bar charts to show part-to-whole relationships. They're easier to label, easier to compare, and can display many more categories without becoming cluttered.Color Theory for Football Visualizations
Color is one of the most powerful—and most frequently misused—elements of data visualization. In football analytics, color serves double duty: it must encode data accurately while also connecting to the visual language of football (team colors, traditional associations like red = bad defense, green = good offense).
Understanding Color Types
Sequential Colors: Use when data has a natural order from low to high. Examples include light blue to dark blue, or white to dark green. Sequential colors work well for continuous variables like EPA, yards gained, or win probability.
- Football Example: Showing team success rates from low (light gray) to high (dark green)
- When to Use: Any metric where "more" has a consistent meaning (more EPA, more wins, higher efficiency)
Diverging Colors: Use when data has a meaningful midpoint with extremes in both directions. Typically combines two sequential scales meeting at a neutral center. Common schemes include red-white-blue or red-yellow-green.
- Football Example: EPA (negative to positive), or performance relative to league average
- When to Use: Metrics where zero is meaningful (above/below average, positive/negative), or when comparing to a benchmark
Categorical Colors: Use distinct hues for different categories with no inherent order. Each category gets a clearly different color. Limit to 5-7 categories maximum for visibility.
- Football Example: Different colors for AFC vs. NFC, or for different positions
- When to Use: Nominal categories (team names, positions, play types) where there's no natural ordering
Team Colors: NFL teams have established color identities. Using authentic team colors adds recognition value and professionalism to your visualizations.
- Football Example: Using Kansas City red for Chiefs data points, Miami teal for Dolphins
- When to Use: Any team-specific visualization, especially when showing multiple teams simultaneously
Best Practices for Color Use
1. Limit Your Palette
Use no more than 5-7 colors in a single visualization. More colors make it impossible for viewers to distinguish categories or track patterns. If you need to show 32 NFL teams, consider:
- Showing only top 10 and bottom 10
- Using grayscale for most teams and highlighting 2-3 teams of interest in color
- Creating small multiples (separate panels for different groups)
2. Ensure Colorblind Accessibility
Approximately 8% of males and 0.5% of females have some form of color vision deficiency (most commonly red-green colorblindness). Your visualizations should be readable for these viewers.
Solutions:
- Use colorblind-friendly palettes (many visualization libraries include these: viridis, ColorBrewer sets)
- Don't rely solely on color to convey information—use shapes, line types, or labels as well
- Test your visualizations with colorblind simulation tools
- Avoid red-green combinations for critical comparisons
3. Use Color to Highlight, Not Decorate
Every color should have a purpose. Don't add colors just because your software offers 256 options. Strategic use of color directs attention:
- Gray out less important elements, use color for your key finding
- Use a single bright color to highlight teams or players of interest
- Reserve red for warning or bad performance, green for good (matching common conventions)
4. Consider Grayscale Printing
Many journals and reports are printed in black and white. Your visualizations should remain interpretable when printed without color:
- Test by converting to grayscale before finalizing
- Use different line styles (solid, dashed, dotted) not just colors
- Include clear labels so color isn't the only distinguishing feature
- Use patterns or fills in addition to colors for bars
Using NFL Team Colors
The nflplotR package in R and similar tools in Python provide access to official NFL team colors. This adds authenticity and immediate recognition to your visualizations.
Benefits:
- Instant Recognition: Viewers immediately identify teams by their colors
- Professional Appearance: Shows attention to detail and domain knowledge
- Brand Consistency: Aligns with how teams are presented in all NFL media
Considerations:
- Some team colors are very similar (multiple teams use red, blue, or black)
- Team logos work better than just colors when showing many teams simultaneously
- Ensure sufficient contrast between team colors and backgrounds
Common Color Mistakes in Football Analytics
**Rainbow Color Schemes for Continuous Data**: Using rainbow colors (red-orange-yellow-green-blue-violet) for continuous metrics is problematic because: - The perceptual distance between colors is uneven (yellow appears brighter than blue) - There's no natural "middle" to a rainbow - They're not colorblind-friendly - **Better Alternative**: Use sequential (light-to-dark) or diverging (red-white-blue) schemes **Too Many Colors**: Showing all 32 NFL teams in different colors creates visual chaos. Viewers can't track which color belongs to which team. - **Better Alternative**: Focus on top/bottom teams, or use team logos instead of colors **Poor Contrast**: Light yellow on white backgrounds, or dark blue on black. These combinations are nearly invisible. - **Better Alternative**: Test contrast ratios (aim for at least 4.5:1), use darker or lighter variants **Ignoring Colorblind Accessibility**: 1 in 12 males can't distinguish red from green reliably. - **Better Alternative**: Use colorblind-safe palettes, or supplement color with shapes/patternsGrammar of Graphics: The ggplot2 Philosophy
Before diving into code, we need to understand the conceptual framework that underlies modern data visualization. The Grammar of Graphics, developed by statistician Leland Wilkinson and implemented beautifully in R's ggplot2 package, provides a systematic way to think about building visualizations.
Even if you primarily use Python, understanding this grammar will make you a better data visualizer. It shifts your thinking from "what chart template should I use?" to "how should I map my data to visual properties?" This shift enables you to create novel, customized visualizations that perfectly match your analytical needs.
The Core Components
The Grammar of Graphics breaks every visualization into seven fundamental components. Like combining words in sentences, you combine these components to create meaningful graphics:
1. Data: The dataset you want to visualize. In football analytics, this is typically play-by-play data, team statistics, or player performance metrics.
2. Aesthetics (aes): Mappings from data variables to visual properties. Common aesthetics include:
- x and y: position on the plot
- color: the color of elements
- fill: the fill color (for bars, areas)
- size: the size of points or lines
- shape: the shape of points (circle, triangle, square)
- alpha: transparency (0 = invisible, 1 = opaque)
3. Geometries (geom): The visual representations of data—points, lines, bars, areas, etc. Each geometry corresponds to a type of mark you draw on the plot:
- geom_point(): scatter plot points
- geom_line(): lines connecting points
- geom_bar() or geom_col(): bar charts
- geom_histogram(): histograms
- geom_density(): density curves
- geom_boxplot(): box plots
4. Scales: Control how data values map to aesthetic properties. Scales manage:
- Axis limits and breaks (tick marks)
- Color palettes
- Size ranges
- Transformations (log scale, square root scale)
5. Facets: Create small multiples—separate panels for different subsets of data. Faceting allows you to show how patterns vary across categories without overlapping too much data.
6. Coordinate Systems: Define how data positions map to the plot plane. Usually Cartesian (standard x/y axes), but can be polar (for radial plots) or geographic (for maps).
7. Themes: Control the overall visual appearance—fonts, background colors, grid lines, axis styling. Themes handle the "polish" that makes visualizations professional.
The Layered Approach to Building Graphics
The power of the Grammar of Graphics is its compositional nature. You build complex visualizations by layering simple components. This is like constructing sentences: you start with a subject and verb, then add modifiers, clauses, and punctuation.
Basic Template:
ggplot(data = <DATA>) + # Initialize with data
aes(x = <VARIABLE>, y = <VARIABLE>) + # Define aesthetic mappings
geom_<TYPE>() + # Add geometric layer
scale_<AESTHETIC>_<TYPE>() + # Customize scales
facet_<TYPE>(~<VARIABLE>) + # Create facets (optional)
labs(title = "...", x = "...", y = "...") + # Add labels
theme_<NAME>() # Apply theme
Example Build-Up:
Let's see how we build a complex visualization layer by layer to understand EPA distribution:
# Layer 1: Just the data and axes (blank canvas)
ggplot(pbp_data, aes(x = epa))
# Layer 2: Add geometry (now we see the histogram)
ggplot(pbp_data, aes(x = epa)) +
geom_histogram()
# Layer 3: Customize the bins and colors
ggplot(pbp_data, aes(x = epa)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7)
# Layer 4: Add a reference line at zero
ggplot(pbp_data, aes(x = epa)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red")
# Layer 5: Improve labels
ggplot(pbp_data, aes(x = epa)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Distribution of Expected Points Added",
x = "EPA",
y = "Number of Plays"
)
# Layer 6: Apply professional theme
ggplot(pbp_data, aes(x = epa)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Distribution of Expected Points Added",
subtitle = "2023 NFL Regular Season",
x = "EPA",
y = "Number of Plays",
caption = "Data: nflfastR"
) +
theme_minimal()
Each layer adds clarity and professionalism. This iterative approach lets you refine visualizations systematically rather than guessing at parameters.
Think in Layers, Not Templates
Many people approach visualization by searching for templates: "bar chart with error bars" or "scatter plot with trend line." This works for simple cases but limits creativity. Instead, think in layers: 1. What data do I want to show? 2. What visual properties should encode which variables? 3. What geometric marks best represent my data? 4. What refinements make the message clearer? This layered thinking lets you create custom visualizations perfectly suited to your specific analytical question. You're not constrained by pre-existing templates—you compose your own visual arguments.Why This Matters for Football Analytics
The Grammar of Graphics is particularly valuable for football analytics because football data is complex and multidimensional. You're rarely just plotting one variable. Instead, you're typically showing:
- EPA by play type (two variables: one continuous, one categorical)
- Performance across down and distance situations (three or more variables)
- Team statistics with uncertainty estimates (data + error)
- Time-series patterns with seasonal trends (temporal data)
- Spatial patterns on the field (coordinate data)
The grammar gives you a systematic framework for handling this complexity. Instead of forcing your data into predefined chart types, you map your variables to appropriate aesthetic properties and geometries. This flexibility is essential for revealing the patterns hidden in football's intricate data.
Grammar of Graphics in Python
While the Grammar of Graphics was popularized through R's `ggplot2`, similar principles apply in Python: **plotnine**: A direct port of ggplot2 to Python, using almost identical syntax **Altair**: Implements the Vega-Lite grammar (similar conceptual framework) **Seaborn**: Higher-level interface that handles common cases while still being compositional **Matplotlib**: The foundation—more imperative than declarative, but still compositional Even when using matplotlib's more procedural approach, thinking in terms of data-to-aesthetic mappings will improve your visualizations. The grammar is a mental framework, not just a library.Setting Up Your Visualization Environment
Now that we understand the principles and theory, let's set up our programming environment for creating football visualizations. We'll load the necessary packages and configure sensible defaults that will save time throughout this chapter.
Loading Visualization Libraries
Different packages serve different purposes in the visualization ecosystem. Understanding what each contributes will help you choose the right tool for each situation.
#| label: setup-r
#| message: false
#| warning: false
#| cache: true
# Core data manipulation and visualization
# tidyverse: Meta-package including ggplot2, dplyr, tidyr, and more
# This is the foundation for modern R data analysis
library(tidyverse)
# Football data access
# nflfastR: Provides clean, analysis-ready NFL play-by-play data
# This package is maintained by the nflverse community
library(nflfastR)
# NFL-specific visualization enhancements
# nflplotR: Adds team logos, colors, and NFL-specific geometries to ggplot2
# Makes it easy to create professional NFL graphics
library(nflplotR)
# Additional visualization packages
# scales: Provides formatting functions for axes (percent, comma, dollar signs)
library(scales)
# patchwork: Combines multiple ggplot2 plots into composed layouts
# Essential for creating multi-panel figures
library(patchwork)
# plotly: Converts ggplot2 graphics to interactive web-based visualizations
# Great for presentations and exploratory analysis
library(plotly)
# ggrepel: Intelligent label placement that avoids overlaps
# Crucial for scatter plots with many labeled points
library(ggrepel)
# gt: Creates publication-quality tables
# While not strictly visualization, tables often accompany graphics
library(gt)
# Set global ggplot2 theme for all subsequent plots
# theme_minimal() provides a clean, professional appearance
# base_size = 12 ensures readable text
theme_set(theme_minimal(base_size = 12))
# Confirm successful loading
cat("✓ R visualization packages loaded successfully\n")
cat("✓ Default theme set to theme_minimal() with 12pt base font\n")
cat("✓ Ready to create NFL visualizations\n")
#| label: setup-py
#| message: false
#| warning: false
#| cache: true
# Core data manipulation
# pandas: The standard library for working with tabular data in Python
import pandas as pd
# numpy: Numerical computing library, essential for array operations
import numpy as np
# Football data access
# nfl_data_py: Python equivalent of nflfastR
# Provides access to nflverse data through Python
import nfl_data_py as nfl
# Visualization packages
# matplotlib: The foundational plotting library in Python
# Most other visualization libraries build on matplotlib
import matplotlib.pyplot as plt
# seaborn: Higher-level interface for statistical graphics
# Provides beautiful default styles and complex plot types
import seaborn as sns
# plotly for interactive visualizations
# plotly.express: High-level interface for quick interactive plots
import plotly.express as px
# plotly.graph_objects: Lower-level control for custom interactive plots
import plotly.graph_objects as go
# Configure matplotlib style
# Use seaborn's style system for better-looking default plots
plt.style.use('seaborn-v0_8-darkgrid')
# Set seaborn color palette
# 'husl' provides evenly-spaced colors that are perceptually distinct
sns.set_palette("husl")
# Configure pandas display options
# Show all columns when printing DataFrames (instead of truncating)
pd.options.display.max_columns = 50
# Increase width to prevent line wrapping in console
pd.options.display.width = 120
# Set default figure size for all matplotlib plots
# (10, 6) provides good aspect ratio for most screens and presentations
plt.rcParams['figure.figsize'] = (10, 6)
# Set DPI (dots per inch) for sharper on-screen display
# 100 DPI is good for screens; increase to 300 for print-quality
plt.rcParams['figure.dpi'] = 100
# Confirm successful setup
print("✓ Python visualization packages loaded successfully")
print("✓ matplotlib configured with seaborn-darkgrid style")
print(f"✓ Default figure size set to {plt.rcParams['figure.figsize']}")
print("✓ Ready to create NFL visualizations")
Loading Sample Data
Throughout this chapter, we'll use play-by-play data from the 2023 NFL season. This dataset contains every play from every regular season game, along with contextual information and advanced metrics. Let's load it now and examine its structure.
#| label: load-data-r
#| message: false
#| warning: false
#| cache: true
# Load play-by-play data for 2023 season
# The load_pbp() function automatically downloads and caches data
# It returns a tibble (enhanced data frame) with one row per play
pbp <- load_pbp(2023) %>%
# Filter to regular season only (exclude preseason and playoffs)
# season_type: "REG" = regular season, "PRE" = preseason, "POST" = playoffs
filter(season_type == "REG")
# Load team information (logos, colors, abbreviations)
# This tibble contains NFL team metadata we'll use for visualization
teams <- nflfastR::teams_colors_logos
# Display summary information
cat("===== DATA LOADED =====\n")
cat("Plays loaded:", format(nrow(pbp), big.mark = ","), "\n")
cat("Variables available:", ncol(pbp), "\n")
cat("Season:", unique(pbp$season), "\n")
cat("Weeks covered:", min(pbp$week), "to", max(pbp$week), "\n")
cat("Teams:", n_distinct(pbp$posteam), "(offensive teams tracked)\n")
cat("\n")
# Show a few key variables to understand data structure
cat("Sample of key variables:\n")
pbp %>%
select(
game_id, # Unique game identifier
week, # Week of season (1-18)
posteam, # Team on offense (possessing team)
defteam, # Team on defense
down, # Down (1, 2, 3, or 4)
ydstogo, # Yards needed for first down
play_type, # Type of play (pass, run, punt, etc.)
yards_gained, # Yards gained on play
epa # Expected Points Added
) %>%
head(5) %>%
print(width = 120)
#| label: load-data-py
#| message: false
#| warning: false
#| cache: true
# Load play-by-play data for 2023 season
# import_pbp_data() takes a list of seasons (can load multiple years)
# Returns a pandas DataFrame with one row per play
pbp = nfl.import_pbp_data([2023])
# Filter to regular season only
# season_type: 'REG' = regular season, 'PRE' = preseason, 'POST' = playoffs
pbp = pbp[pbp['season_type'] == 'REG']
# Load team information (colors, logos, abbreviations)
# This DataFrame contains metadata about all NFL teams
teams = nfl.import_team_desc()
# Display summary information
print("===== DATA LOADED =====")
print(f"Plays loaded: {len(pbp):,}")
print(f"Variables available: {len(pbp.columns)}")
print(f"Season: {pbp['season'].unique()[0]}")
print(f"Weeks covered: {pbp['week'].min()} to {pbp['week'].max()}")
print(f"Teams: {pbp['posteam'].nunique()} (offensive teams tracked)")
print()
# Show a few key variables to understand data structure
print("Sample of key variables:")
key_vars = [
'game_id', # Unique game identifier
'week', # Week of season (1-18)
'posteam', # Team on offense (possessing team)
'defteam', # Team on defense
'down', # Down (1, 2, 3, or 4)
'ydstogo', # Yards needed for first down
'play_type', # Type of play (pass, run, punt, etc.)
'yards_gained', # Yards gained on play
'epa' # Expected Points Added
]
print(pbp[key_vars].head(5).to_string(index=False))
This data will power all visualizations in this chapter. With it loaded, we're ready to create our first plots.
Basic Plot Types
Now we'll explore the fundamental plot types you'll use repeatedly in football analytics. For each type, we'll explain when to use it, create an example with real NFL data, and interpret what the visualization reveals. Understanding these building blocks will enable you to create more complex, customized visualizations later.
Histograms and Distributions
Histograms and density plots are essential for understanding how data is distributed. In football analytics, distributions help us understand:
- The typical range of outcomes (what's normal vs. exceptional)
- How variable performance is (tight clustering vs. wide spread)
- Whether data follows expected patterns (normal distribution, skewness)
- Where important thresholds lie (what percentage of plays are positive EPA)
When to Use Histograms:
- Exploring a new dataset to understand variable ranges
- Checking if data meets assumptions (e.g., normality for certain statistical tests)
- Communicating how common different outcomes are
- Identifying outliers or unusual patterns
Histograms divide the data range into bins and count how many observations fall in each bin. The height of each bar shows the frequency or count of observations in that range.
Let's examine the distribution of EPA (Expected Points Added) for pass plays. This will show us how valuable (or costly) different passing plays are.
#| label: fig-histogram-r
#| fig-cap: "Distribution of EPA for pass plays in the 2023 NFL regular season. The histogram shows a right-skewed distribution with a long right tail indicating occasional explosive passing plays, while the left tail shows catastrophic plays like interceptions and sacks."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Filter the data to pass plays only
# We need to exclude NA values in EPA, as some plays don't have EPA calculated
pass_plays <- pbp %>%
filter(
play_type == "pass", # Only passing plays
!is.na(epa) # Exclude plays without EPA (penalties, etc.)
)
# Create the histogram
ggplot(pass_plays, aes(x = epa)) +
# geom_histogram creates the histogram bars
# bins = 50: divide the EPA range into 50 equally-spaced bins
# fill: bar color (using a professional blue)
# color: outline color for each bar (white creates separation)
# alpha: transparency (0.8 = slightly transparent for softer appearance)
geom_histogram(
bins = 50,
fill = "#0066CC",
color = "white",
alpha = 0.8
) +
# Add a vertical reference line at EPA = 0
# This divides successful plays (positive EPA) from unsuccessful ones
# linetype = "dashed": creates a dashed line
# size = 1: makes the line moderately thick and visible
geom_vline(
xintercept = 0,
linetype = "dashed",
color = "red",
size = 1
) +
# Customize the x-axis scale
# limits: set the range we want to display (-5 to 10 EPA)
# breaks: where to place tick marks on the axis
scale_x_continuous(
limits = c(-5, 10),
breaks = seq(-5, 10, 2.5)
) +
# Add descriptive labels
# A good title explains what the reader is seeing
# Subtitle adds context or key interpretation
# Caption cites the data source and clarifies elements
labs(
title = "Distribution of EPA on Pass Plays",
subtitle = "2023 NFL Regular Season | Most passes cluster near zero, with occasional explosive gains",
x = "Expected Points Added (EPA)",
y = "Number of Plays",
caption = "Data: nflfastR | Dashed line indicates EPA = 0 (neutral plays)"
) +
# Apply a clean theme and customize text
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12, color = "gray40"),
plot.caption = element_text(size = 9, color = "gray50", hjust = 0)
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-histogram-py
#| fig-cap: "Distribution of EPA for pass plays - Python implementation. Same data as R version, showing the characteristic right-skewed distribution of passing play outcomes."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Filter data to pass plays with valid EPA
# .notna() excludes missing values (equivalent to !is.na() in R)
pass_plays = pbp[(pbp['play_type'] == 'pass') & (pbp['epa'].notna())]
# Create figure and axis objects
# figsize=(10, 6): dimensions in inches (matches R output)
fig, ax = plt.subplots(figsize=(10, 6))
# Create the histogram
# bins=50: number of bins (same as R version)
# color: bar color (same blue as R version)
# alpha: transparency level
# edgecolor: outline color for bars
ax.hist(
pass_plays['epa'],
bins=50,
color='#0066CC',
alpha=0.8,
edgecolor='white'
)
# Add vertical reference line at EPA = 0
# linestyle='--': dashed line
# linewidth: thickness of the line
# label: legend label (though we won't show legend here)
ax.axvline(
x=0,
color='red',
linestyle='--',
linewidth=2,
label='EPA = 0'
)
# Set x-axis limits to match R version
ax.set_xlim(-5, 10)
# Add axis labels with appropriate font sizes
ax.set_xlabel('Expected Points Added (EPA)', fontsize=12)
ax.set_ylabel('Number of Plays', fontsize=12)
# Add title (matplotlib uses \n for line breaks)
ax.set_title(
'Distribution of EPA on Pass Plays\n2023 NFL Regular Season | Most passes cluster near zero, with occasional explosive gains',
fontsize=14,
fontweight='bold'
)
# Add caption as text in the bottom corner
# transform=ax.transAxes means coordinates are relative (0-1 range)
# (0.02, 0.98) places text in top-left
# verticalalignment='top': align text from top down
ax.text(
0.02, 0.02,
'Data: nfl_data_py | Dashed line indicates EPA = 0 (neutral plays)',
transform=ax.transAxes,
fontsize=9,
verticalalignment='bottom',
color='gray'
)
# Adjust layout to prevent label cutoff
plt.tight_layout()
# Display the plot
plt.show()
Interpreting the Output:
When you run this code, you'll see a histogram that reveals several important patterns about NFL passing:
1. Right Skewness (Positive Skew):
The distribution has a long tail extending to the right. This means that while most passes result in modest EPA (clustered near zero), occasional passes generate enormous positive EPA—think of a 75-yard touchdown bomb. These rare explosive plays pull the mean EPA higher than the median.
2. Central Clustering:
The highest bars appear near zero and slightly positive. This tells us that the most common passing outcome is a modest gain or loss. Routine passes on first down that gain 5-7 yards fall into this range.
3. Left Tail (Negative Events):
The left tail extends to about -4 or -5 EPA. These are catastrophic passing plays—interceptions, sacks for big losses, or sacks that force a team out of field goal range. While less common than the explosive positive plays, they're still frequent enough to matter.
4. Success Rate:
Looking at the area on either side of the red line, you can visually estimate that slightly less than half of passes have positive EPA. (We calculated earlier that pass success rate is about 46%.) This means passing is inherently risky—even on successful drives, many individual passes lose expected points.
5. What This Means for Strategy:
The combination of high variance (wide spread) and right skew (big-play potential) explains why NFL offenses pass so frequently despite the risk. A completion that gains 8 yards on 3rd-and-7 might only add 0.3 EPA, but a 40-yard completion can add 3-4 EPA in one play. No running play offers that explosive potential.
Interpreting EPA Distributions
When looking at EPA histograms, ask yourself: **Shape:** - Is it symmetric (normal distribution) or skewed? - Where is the peak (mode)? - How wide is the spread (variance)? **Position:** - Is the bulk of the distribution above or below zero? - How far into negative territory does the left tail extend? - How high into positive territory does the right tail reach? **Football Implications:** - Right skew → potential for explosive plays - Wide spread → high variance in outcomes - Heavy left tail → catastrophic plays possible - Peak location → typical play outcome These patterns tell you about risk, upside, and consistency—all crucial for decision-making.Density Plots
Density plots smooth histograms into continuous curves, making them excellent for comparing distributions across categories. Instead of counting observations in discrete bins, density plots estimate the probability distribution function—showing where data is concentrated.
When to Use Density Plots:
- Comparing distributions across multiple groups (pass vs. run, different teams)
- When exact counts aren't important but the shape of the distribution is
- Presenting to audiences who find smooth curves more intuitive than bars
- When you want to overlay multiple distributions without cluttering
Advantage Over Histograms:
Density plots avoid the arbitrary choice of bin width that affects histogram appearance. They also make overlaying multiple distributions clearer—overlapping histograms can be visually confusing, but overlapping density curves are easy to read.
Let's compare EPA distributions for passing vs. rushing plays to see how these two play types differ in their outcome patterns.
#| label: fig-density-r
#| fig-cap: "Comparing EPA distributions for pass and run plays using density curves. Pass plays show higher variance with both more explosive gains and more catastrophic losses, while run plays cluster more tightly around small negative values."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Filter to both pass and run plays with valid EPA
offensive_plays <- pbp %>%
filter(
play_type %in% c("pass", "run"), # Include both types
!is.na(epa) # Exclude missing EPA
)
# Create density plot
ggplot(offensive_plays, aes(x = epa, fill = play_type)) +
# geom_density creates the smooth density curves
# alpha = 0.6: make curves semi-transparent so we can see overlap
# adjust = 1.5: smoothing parameter (higher = smoother)
geom_density(alpha = 0.6, adjust = 1.5) +
# Add reference line at zero
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
# Use distinct colors for pass vs. run
# These colors are colorblind-friendly and distinct
scale_fill_manual(
values = c("pass" = "#00BFC4", "run" = "#F8766D"),
labels = c("Pass", "Run"),
name = "Play Type"
) +
# Limit x-axis to focus on the bulk of the data
# Extreme outliers beyond ±5 EPA are rare
scale_x_continuous(limits = c(-5, 8)) +
# Add comprehensive labels
labs(
title = "EPA Distribution: Pass vs. Run",
subtitle = "Pass plays show higher variance with more big gains and big losses | 2023 NFL Regular Season",
x = "Expected Points Added",
y = "Density",
caption = "Data: nflfastR | Density curves are smoothed histograms"
) +
# Apply theme and customizations
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11, color = "gray40"),
legend.position = "top", # Put legend at top for visibility
panel.grid.minor = element_blank() # Remove minor grid lines for cleaner look
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-density-py
#| fig-cap: "EPA density comparison by play type - Python. Same patterns as R version, showing pass variance vs. run consistency."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Filter data to pass and run plays with valid EPA
plot_data = pbp[
pbp['play_type'].isin(['pass', 'run']) &
pbp['epa'].notna()
]
# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Create density plots for each play type
# We iterate through play types with their associated colors and labels
for play_type, color, label in [
('pass', '#00BFC4', 'Pass'),
('run', '#F8766D', 'Run')
]:
# Extract EPA data for this play type
data = plot_data[plot_data['play_type'] == play_type]['epa']
# Create density plot (kernel density estimate)
# bw_adjust equivalent to ggplot2's adjust parameter
data.plot.kde(
ax=ax,
color=color,
alpha=0.6,
linewidth=2,
label=label,
bw_method=0.3 # Smoothing bandwidth
)
# Add vertical reference line at zero
ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)
# Set x-axis limits to match R version
ax.set_xlim(-5, 8)
# Add labels
ax.set_xlabel('Expected Points Added', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
# Add title with subtitle (using \n for line break)
ax.set_title(
'EPA Distribution: Pass vs. Run\n' +
'Pass plays show higher variance with more big gains and big losses | 2023 NFL Regular Season',
fontsize=14,
fontweight='bold',
pad=20 # Add padding above title
)
# Add legend
# loc='upper right': position in upper right corner
# title: label for the legend
ax.legend(title='Play Type', loc='upper right', frameon=True, shadow=True)
# Add caption
ax.text(
0.98, 0.02,
'Data: nfl_data_py | Density curves are smoothed histograms',
transform=ax.transAxes,
ha='right',
fontsize=9,
style='italic',
color='gray'
)
# Adjust layout and display
plt.tight_layout()
plt.show()
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Interpreting the Comparison:
This visualization reveals fundamental differences between passing and rushing in the NFL:
1. Central Tendency (Where the Peaks Are):
-
Run plays: The peak (mode) is slightly left of zero, around -0.4 to -0.5 EPA. This means the most common running play loses a small amount of expected points—think of a 2-yard gain on first-and-10 that moves from 1st-and-10 to 2nd-and-8, a slightly worse situation.
-
Pass plays: The peak is closer to zero, around -0.2 EPA. Passes more frequently result in neutral or slightly positive EPA.
2. Variance (Width of the Distribution):
-
Run plays: The distribution is narrow and tightly clustered. Most runs result in small gains or losses (roughly -1 to +2 EPA). Very few runs result in extreme EPA values.
-
Pass plays: The distribution is much wider, extending from about -4 EPA (interceptions, sacks) to +7 EPA (long touchdown passes). This reflects the high-variance nature of passing.
3. Skewness (Tail Behavior):
-
Run plays: Modest right skew—occasional long runs, but even a 40-yard run might only add 2-3 EPA.
-
Pass plays: Pronounced right skew—the right tail extends much further, indicating the potential for explosive plays. A 60-yard touchdown pass can add 5-7 EPA instantly.
4. Catastrophic Left Tail:
-
Run plays: The left tail is short. Even fumbles (the worst running outcome) rarely lose more than 2-3 EPA.
-
Pass plays: The left tail extends to -4 or -5 EPA, representing interceptions in the opponent's territory or drive-killing sacks.
5. Overlap Analysis:
The overlapping area (where both curves are visible) shows EPA ranges where both play types are common. The non-overlapping areas highlight where one play type dominates:
- Around -2 to -1 EPA: More passes than runs (incompletions, short sacks)
- Around +3 to +7 EPA: Almost exclusively passes (explosive plays)
- Around -0.5 to 0 EPA: More runs (modest gains/losses)
Strategic Implications:
These distribution differences explain modern NFL strategy:
-
Why teams pass more: Despite higher risk (wider distribution = more variance), the right-skewed distribution means passes offer explosive-play potential that runs can't match.
-
Why runs are still valuable: Lower variance makes runs more predictable. In situations where avoiding catastrophic plays matters (protecting a lead, running out the clock), the tighter distribution of runs is advantageous.
-
Risk-reward trade-off: Passing offers higher mean EPA (+0.097) but with much higher variance. Running offers lower mean EPA (-0.038) but more consistent outcomes.
Key Insight: Variance vs. Mean
A common misconception is that "running is safer than passing." While running has less variance (smaller spread), that doesn't make it safer in all situations: **Scenario 1**: 3rd-and-15 from your own 20-yard line. - A run will almost certainly gain 2-5 yards (low variance, high probability of failure) - A pass might be incomplete (bad), but could also gain 15+ yards for a first down (good) - **Verdict**: Passing is "safer" here because the low variance of running means certain failure **Scenario 2**: 3rd-and-1 protecting a lead late in the game. - A run will gain -1 to +3 yards (low variance, moderate success rate) - A pass could be a sack, interception, or incomplete (higher variance, more catastrophic outcomes possible) - **Verdict**: Running is "safer" here because we can tolerate failure (punt) but can't afford catastrophe (turnover) The lesson: "Safe" depends on context. Sometimes predictability is risky, and sometimes variance is risky.Line Charts
Line charts excel at showing trends over time or ordered sequences. They're ideal for visualizing how things change—win probability during a game, team performance across a season, or cumulative statistics building up week by week.
When to Use Line Charts:
- Showing trends over time (season progression, game flow)
- Displaying cumulative totals (cumulative EPA, points scored)
- Comparing trends across groups (multiple teams' season trajectories)
- Showing continuous change in ordered data
Key Design Considerations:
- Connect points with lines only when the order is meaningful (time, sequence)
- Use different line styles or colors to distinguish multiple series
- Add reference lines for important thresholds
- Consider smoothing when data is noisy
Let's create a line chart showing how win probability changes throughout a game—this reveals the flow and key momentum shifts.
#| label: fig-line-chart-r
#| fig-cap: "Win probability chart for a specific 2023 NFL game. The line shows how the home team's win probability evolved from kickoff to final whistle, with steep changes indicating high-leverage plays like touchdowns or turnovers."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Select an exciting game from 2023 season
# Let's pick a close game with multiple lead changes
# We'll examine all plays from one game to show win probability evolution
game_plays <- pbp %>%
filter(
# Pick a specific game (you can change this to any game_id)
game_id == "2023_01_KC_DET", # Week 1: Chiefs at Lions
!is.na(home_wp) # Ensure we have win probability data
) %>%
# Create a play sequence number for the x-axis
mutate(play_number = row_number())
# Create line chart
ggplot(game_plays, aes(x = play_number, y = home_wp)) +
# Add a filled area under the curve to emphasize the probability
# This makes it visually clear what the home team's chances are
geom_ribbon(
aes(ymin = 0, ymax = home_wp),
fill = "#0066CC",
alpha = 0.3
) +
# Add the line showing win probability trajectory
# linewidth = 1.5: makes the line prominent and easy to follow
geom_line(color = "#0066CC", linewidth = 1.5) +
# Add a horizontal reference line at 50% (even odds)
# This divides "more likely to win" from "more likely to lose"
geom_hline(
yintercept = 0.5,
linetype = "dashed",
color = "red",
alpha = 0.5
) +
# Format y-axis as percentage
# expand: add a bit of space at top and bottom
scale_y_continuous(
labels = scales::percent_format(),
limits = c(0, 1),
expand = c(0, 0)
) +
# Add descriptive labels
labs(
title = "Win Probability Chart: Chiefs at Lions",
subtitle = "2023 NFL Season, Week 1 | Shows home team (Lions) win probability throughout game",
x = "Play Number (Game Progression)",
y = "Home Team Win Probability",
caption = "Data: nflfastR | Red dashed line indicates 50% probability (even odds)"
) +
# Apply theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11, color = "gray40"),
panel.grid.minor = element_blank()
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-line-chart-py
#| fig-cap: "Win probability chart - Python implementation. Shows the ebb and flow of the game through changing win probabilities."
#| fig-width: 10
#| fig-height: 6
#| cache: true
# Select game data
game_plays = (
pbp[
(pbp['game_id'] == '2023_01_KC_DET') & # Same game as R version
pbp['home_wp'].notna() # Valid win probability
]
.reset_index(drop=True)
)
# Add play number for x-axis
game_plays['play_number'] = range(1, len(game_plays) + 1)
# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the line
ax.plot(
game_plays['play_number'],
game_plays['home_wp'],
color='#0066CC',
linewidth=2,
label='Home Team Win Probability'
)
# Fill area under the curve
ax.fill_between(
game_plays['play_number'],
0,
game_plays['home_wp'],
color='#0066CC',
alpha=0.3
)
# Add horizontal reference line at 50%
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='50% (Even Odds)')
# Set y-axis limits and format as percentage
ax.set_ylim(0, 1)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
# Add labels
ax.set_xlabel('Play Number (Game Progression)', fontsize=12)
ax.set_ylabel('Home Team Win Probability', fontsize=12)
# Add title
ax.set_title(
'Win Probability Chart: Chiefs at Lions\n' +
'2023 NFL Season, Week 1 | Shows home team (Lions) win probability throughout game',
fontsize=14,
fontweight='bold',
pad=20
)
# Add legend
ax.legend(loc='best', frameon=True)
# Add caption
ax.text(
0.98, 0.02,
'Data: nfl_data_py | Red dashed line indicates 50% probability (even odds)',
transform=ax.transAxes,
ha='right',
fontsize=9,
style='italic',
color='gray'
)
# Adjust layout and display
plt.tight_layout()
plt.show()
Interpreting the Output:
A win probability chart tells the story of a game through numbers:
1. Game Flow Narrative:
- Kickoff: Most games start near 50% (slight home field advantage might make it 52-53% for the home team)
- Momentum Shifts: Steep upward or downward movements indicate high-leverage plays (touchdowns, turnovers, 4th down conversions)
- Late-Game Tension: In close games, win probability oscillates around 50% until late, when one team pulls away
- Blowouts: If one team dominates, win probability quickly approaches 95-100% and stays there
2. Identifying Key Plays:
The steepest slope changes indicate the most important plays:
- Large positive jumps: Touchdowns, defensive touchdowns, turnovers in good field position
- Large negative drops: Turnovers, opponent touchdowns, failed 4th down conversions
- Gradual increases: Methodical drives that steadily improve field position
You can identify the exact plays that changed the game by looking for these dramatic shifts.
3. Game Excitement:
Win probability charts also quantify how exciting a game was:
- Back-and-forth games: Multiple crossings of the 50% line indicate lead changes
- Close finishes: Win probability near 50% in the 4th quarter means an uncertain outcome
- Blowouts: Win probability above 90% for most of the game means it was never in doubt
Some analysts create an "excitement index" based on how much win probability changed and when those changes occurred.
4. Strategic Decisions:
Coaches and analysts use win probability charts to evaluate decisions:
- Going for it on 4th down: Did it significantly increase win probability?
- Two-point conversions: What was the WP impact?
- Clock management: Did decisions maximize win probability?
By comparing the actual WP change to the expected WP change from different decisions, we can evaluate coaching choices.
Reading Win Probability Charts
**What to Look For:** 1. **Starting point**: Home teams typically start around 52-55% WP due to home field advantage 2. **Crossings of 50%**: Each crossing represents a lead change or shift in who's favored 3. **Steepness of changes**: Steeper = more impactful play 4. **Late-game behavior**: Does WP converge to 0% or 100%, or stay uncertain until the end? 5. **Plateaus**: Periods where WP stays relatively flat indicate consistent play without dramatic events **Common Patterns:** - **Gradual rise/fall**: One team steadily takes control through consistent execution - **Sawtooth pattern**: Back-and-forth scoring, exciting game - **Step function**: Blowout where WP jumps to 90%+ and stays there - **Late collapse**: WP high for one team most of game, then dramatic late reversalScatter Plots
Scatter plots reveal relationships between two continuous variables. They're essential for exploring correlations, identifying outliers, and understanding multidimensional patterns. In football analytics, scatter plots help us understand questions like: "Do teams that pass well also run well?" or "Is quarterback experience related to efficiency?"
When to Use Scatter Plots:
- Exploring relationships between two performance metrics
- Identifying teams or players with unusual combinations of traits
- Checking for correlations before modeling
- Showing performance across two dimensions simultaneously
- Creating quadrant charts (dividing by thresholds to create categories)
Key Elements:
- Point position: Each point represents one observation (team, player, game)
- Additional aesthetics: Point size, color, or shape can encode a third variable
- Trend lines: Add linear or smoothed trend lines to highlight relationships
- Reference lines: Horizontal/vertical lines at means or thresholds divide the space
Let's create a scatter plot comparing teams' passing EPA to their rushing EPA. This reveals whether offensive efficiency in one area relates to efficiency in the other.
#| label: fig-scatter-r
#| fig-cap: "Team offensive efficiency: passing vs. rushing EPA per play. Each team logo represents one team's average performance in both dimensions. The scatter plot reveals whether teams that excel at passing also excel at rushing."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Calculate separate pass and rush EPA for each team
team_pass_rush <- pbp %>%
filter(
!is.na(epa), # Valid EPA values
play_type %in% c("pass", "run") # Only pass and run plays
) %>%
# Group by both team and play type
group_by(posteam, play_type) %>%
# Calculate mean EPA for each team-playtype combination
summarise(mean_epa = mean(epa), .groups = "drop") %>%
# Reshape from long to wide format
# Creates separate columns for pass and run EPA
# Before: rows like (KC, pass, 0.15), (KC, run, -0.02)
# After: row like (KC, 0.15, -0.02)
pivot_wider(
names_from = play_type,
values_from = mean_epa
)
# Create scatter plot with team logos
team_pass_rush %>%
ggplot(aes(x = run, y = pass)) +
# Add reference lines at zero for both dimensions
# These divide the plot into quadrants
# linetype = "dashed": creates dashed lines
# alpha = 0.5: semi-transparent so they don't dominate
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
# Add smooth trend line
# method = "lm": linear model (straight line)
# se = TRUE: show confidence interval as a shaded area
# color/alpha: make it subtle (not the focus)
geom_smooth(
method = "lm",
se = TRUE,
color = "gray50",
alpha = 0.3
) +
# Add team logos as points
# aes(team_abbr = posteam): tells nflplotR which team each point is
# width = 0.06: size of logos
# alpha = 0.8: slight transparency
nflplotR::geom_nfl_logos(aes(team_abbr = posteam), width = 0.06, alpha = 0.8) +
# Add comprehensive labels
labs(
title = "Team Offensive Efficiency: Passing vs. Rushing",
subtitle = "Teams in upper-right quadrant excel at both | 2023 NFL Regular Season",
x = "Rush EPA per Play",
y = "Pass EPA per Play",
caption = "Data: nflfastR | Line shows linear relationship between pass and rush efficiency"
) +
# Apply theme with customizations
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11, color = "gray40"),
panel.grid.minor = element_blank(), # Remove minor grid for cleaner look
aspect.ratio = 1 # Square plot (equal x and y ranges)
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-scatter-py
#| fig-cap: "Pass EPA vs. Rush EPA by team - Python implementation. Team abbreviations label each point, showing the relationship between passing and rushing efficiency."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Calculate team pass and rush EPA
team_pass_rush = (
pbp[pbp['epa'].notna() & pbp['play_type'].isin(['pass', 'run'])]
.groupby(['posteam', 'play_type']) # Group by team and play type
.agg(mean_epa=('epa', 'mean')) # Calculate mean EPA
.reset_index() # Convert to regular DataFrame
# Pivot to wide format (separate columns for pass and run)
.pivot(index='posteam', columns='play_type', values='mean_epa')
.reset_index()
)
# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 8))
# Create scatter plot
scatter = ax.scatter(
team_pass_rush['run'], # x-axis: rushing EPA
team_pass_rush['pass'], # y-axis: passing EPA
s=100, # size of points
alpha=0.6, # transparency
color='#0066CC' # point color
)
# Add team labels to each point
# This makes it clear which team each point represents
for idx, row in team_pass_rush.iterrows():
ax.annotate(
row['posteam'], # Team abbreviation
(row['run'], row['pass']), # Position (x, y)
fontsize=8, # Small font size
ha='center', # Horizontal alignment: center
va='center', # Vertical alignment: center
fontweight='bold' # Bold text for readability
)
# Add reference lines at zero
# Divide plot into quadrants
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
# Add trend line
# Calculate linear fit
z = np.polyfit(
team_pass_rush['run'].dropna(), # x values (drop any missing)
team_pass_rush['pass'].dropna(), # y values
1 # degree = 1 (linear)
)
p = np.poly1d(z) # Create polynomial function from coefficients
# Create x values for plotting the line
x_line = np.linspace(
team_pass_rush['run'].min(),
team_pass_rush['run'].max(),
100
)
# Plot the trend line
ax.plot(
x_line,
p(x_line),
color='gray',
linestyle='-',
alpha=0.3,
label='Linear trend'
)
# Add labels
ax.set_xlabel('Rush EPA per Play', fontsize=12)
ax.set_ylabel('Pass EPA per Play', fontsize=12)
# Add title
ax.set_title(
'Team Offensive Efficiency: Passing vs. Rushing\n' +
'Teams in upper-right quadrant excel at both | 2023 NFL Regular Season',
fontsize=14,
fontweight='bold',
pad=20
)
# Add caption
ax.text(
0.98, 0.02,
'Data: nfl_data_py | Line shows linear relationship between pass and rush efficiency',
transform=ax.transAxes,
ha='right',
fontsize=9,
style='italic',
color='gray'
)
# Make it square (equal aspect ratio)
ax.set_aspect('equal', adjustable='box')
# Adjust layout and display
plt.tight_layout()
plt.show()
Interpreting the Output:
This scatter plot reveals several important patterns about NFL offenses:
1. Weak Positive Correlation:
The trend line has a gentle positive slope, suggesting that teams that pass well tend to rush slightly better as well. This makes sense: good offensive lines help both passing (protection) and rushing (creating holes). However, the correlation is weak—plenty of teams deviate from the trend.
2. Quadrant Analysis:
Looking at which quadrant teams fall into reveals offensive identity:
-
Upper-right quadrant (positive pass EPA, positive rush EPA): Balanced, elite offenses. These are the most complete offensive teams.
-
Upper-left quadrant (positive pass EPA, negative rush EPA): Pass-first offenses. These teams win through the air despite struggling on the ground. Many modern offenses fall here.
-
Lower-right quadrant (negative pass EPA, positive rush EPA): Run-first offenses. Rare in the modern NFL, but some teams still emphasize rushing.
-
Lower-left quadrant (negative pass EPA, negative rush EPA): Struggling offenses. These teams are inefficient in both facets.
3. Outliers and Clusters:
Some teams deviate significantly from the trend line. These are interesting cases:
- Teams far above the trend line: passing much better than their rushing would predict
- Teams far below: rushing much better than their passing would predict
- Tight clusters: similar offensive profiles
4. The Pass-Run Continuum:
This visualization shows that the pass-vs-run debate isn't binary. Teams exist along a continuum of passing/rushing efficiency. Some excel at one, some at both, and some at neither.
5. Strategic Implications:
The weak correlation suggests that passing and rushing ability are somewhat independent. You can build an effective offense by excelling at one dimension (usually passing) without necessarily being good at the other. This supports the modern pass-first approach—since passing has higher EPA, teams optimize for passing even if it means accepting poor rushing numbers.
Reading Quadrant Charts
Quadrant charts (scatter plots with reference lines dividing the space) are powerful for categorizing observations: **How to Read:** 1. Identify the reference lines (often at zero or mean values) 2. Understand what each quadrant represents 3. Look for clusters (groups of similar cases) 4. Identify outliers (unusual combinations) 5. Examine the trend line for overall relationships **When to Use:** - Comparing performance on two dimensions - Creating typologies (categories based on two criteria) - Identifying balanced vs. specialized entities - Making draft or roster decisions (combine two scouting metrics) **Labeling Quadrants:** Consider adding text labels to each quadrant explaining what it represents: - "Elite" (high on both) - "Struggling" (low on both) - "Pass-first" / "Run-first" (high on one, low on other) This helps audiences who aren't statistically sophisticated understand the plot instantly.Advanced Visualization Techniques
Beyond basic plot types, advanced techniques help you create publication-quality visualizations that tell compelling stories with your data. These techniques combine multiple elements, use sophisticated color schemes, and incorporate NFL-specific branding to create professional graphics.
Using NFL Team Colors and Logos
One of the most effective ways to make football visualizations immediately recognizable and professional is to incorporate official team colors and logos. The nflplotR package in R makes this remarkably easy.
Benefits of Team Branding:
- Instant Recognition: Viewers immediately identify teams without reading labels
- Professional Appearance: Signals domain expertise and attention to detail
- Visual Appeal: Team colors and logos make graphics more engaging
- Brand Consistency: Aligns with how fans and media already think about teams
Let's create a visualization that showcases these capabilities—a ranked bar chart with team logos and colors.
#| label: fig-team-logos-r
#| fig-cap: "Top 10 NFL offenses by EPA per play with team logos and colors. This visualization demonstrates how nflplotR enhances football graphics with authentic branding elements."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Calculate team offensive EPA
team_offense <- pbp %>%
filter(
!is.na(epa),
play_type %in% c("pass", "run")
) %>%
group_by(posteam) %>%
summarise(
plays = n(),
epa_per_play = mean(epa),
.groups = "drop"
) %>%
# Get top 10 teams
arrange(desc(epa_per_play)) %>%
head(10)
# Create visualization with team logos and colors
team_offense %>%
# Reorder teams by EPA for proper ranking
mutate(posteam = fct_reorder(posteam, epa_per_play)) %>%
ggplot(aes(x = posteam, y = epa_per_play)) +
# Create bars using team colors
# nflplotR automatically assigns each team its primary color
geom_col(aes(fill = posteam), width = 0.7, show.legend = FALSE) +
# Use official NFL team colors for the bars
# type = "primary": uses each team's primary brand color
nflplotR::scale_fill_nfl(type = "primary") +
# Add team logos on top of bars
# This adds immediate visual recognition
# width = 0.05: logos are 5% of plot width
nflplotR::geom_nfl_logos(
aes(team_abbr = posteam),
width = 0.05,
alpha = 0.9
) +
# Add reference line at zero
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
# Format y-axis
scale_y_continuous(
labels = scales::number_format(accuracy = 0.01),
expand = expansion(mult = c(0.1, 0.15))
) +
# Add labels
labs(
title = "Top 10 NFL Offenses by EPA per Play",
subtitle = "2023 Regular Season | Team colors and logos show offensive rankings",
x = NULL,
y = "EPA per Play",
caption = "Data: nflfastR | Logos and colors from nflplotR package"
) +
# Flip to horizontal for better readability
coord_flip() +
# Apply minimal theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.text.y = element_text(size = 10, face = "bold"),
panel.grid.major.y = element_blank()
)
#| label: fig-team-logos-py
#| fig-cap: "Top 10 offenses - Python version. While Python doesn't have direct logo support like nflplotR, we can still create professional team-based visualizations."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Calculate team offensive EPA
team_offense = (
pbp[pbp['epa'].notna() & pbp['play_type'].isin(['pass', 'run'])]
.groupby('posteam')
.agg(
plays=('epa', 'count'),
epa_per_play=('epa', 'mean')
)
.reset_index()
.nlargest(10, 'epa_per_play')
.sort_values('epa_per_play') # Sort for horizontal bar chart
)
# Create figure
fig, ax = plt.subplots(figsize=(10, 8))
# Create horizontal bar chart
# Using a gradient color scheme
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(team_offense)))
bars = ax.barh(
team_offense['posteam'],
team_offense['epa_per_play'],
color=colors,
alpha=0.8
)
# Add value labels on bars
for idx, (team, epa) in enumerate(zip(team_offense['posteam'], team_offense['epa_per_play'])):
ax.text(
epa + 0.001, # Slightly to the right of bar
idx,
f'{epa:.3f}',
va='center',
fontsize=9,
fontweight='bold'
)
# Add reference line at zero
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
# Add labels
ax.set_xlabel('EPA per Play', fontsize=12)
ax.set_ylabel('Team', fontsize=12)
# Add title
ax.set_title(
'Top 10 NFL Offenses by EPA per Play\n' +
'2023 Regular Season | Ranked by offensive efficiency',
fontsize=14,
fontweight='bold',
pad=20
)
# Add caption
ax.text(
0.98, 0.02,
'Data: nfl_data_py | Higher EPA = More efficient offense',
transform=ax.transAxes,
ha='right',
fontsize=9,
style='italic',
color='gray'
)
# Grid for easier reading
ax.grid(axis='x', alpha=0.3)
# Adjust layout
plt.tight_layout()
plt.show()
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Why This Matters:
Team branding transforms a generic bar chart into a professional NFL graphic. Compare these two versions:
Generic version: Blue bars, team abbreviations as labels
Branded version: Team-colored bars, team logos, instant recognition
The branded version:
- Takes the same amount of time to create (one additional line of code)
- Looks dramatically more professional
- Communicates more effectively (visual recognition is faster than reading)
- Shows domain expertise (you know football, not just statistics)
For presentations to coaches, executives, or fans, this polish makes a significant difference in how your work is received.
Best Practices for Team Branding
**When to Use Team Colors/Logos:** - **Always**: When showing team-specific data - **Rankings**: Makes it easy to find specific teams - **Comparisons**: Helps viewers track teams across multiple plots **When to Be Careful:** - **Many teams at once**: 32 teams with different colors can be overwhelming - **Colorblind accessibility**: Some team color combinations are problematic - **Small graphics**: Logos may not be legible if too small **Solutions:** - Show subsets (top 10, bottom 10, specific division) - Use grayscale for most teams, color for teams of interest - Ensure logos are at least 0.05-0.08 of plot width - Test printed versions (colors may not print well)Creating Small Multiples (Facets)
Small multiples—also called faceting or trellis charts—show the same visualization repeated for different subsets of data. This technique is powerful for revealing patterns across categories without overplotting.
When to Use Small Multiples:
- Comparing patterns across many categories (teams, positions, downs)
- Showing how relationships vary by group
- Avoiding overplotting when you have many categories
- Creating grid layouts that facilitate comparison
Edward Tufte, the visualization pioneer, calls small multiples "the best design solution for many problems" because they enable viewers to naturally compare patterns across groups.
Let's create small multiples showing EPA distributions for each down.
#| label: fig-facets-r
#| fig-cap: "EPA distributions by down using faceted density plots. Each panel shows the distribution for a different down, revealing how EPA patterns change as downs progress."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Prepare data
down_epa <- pbp %>%
filter(
!is.na(epa),
play_type %in% c("pass", "run"),
!is.na(down),
down %in% 1:4 # Only regular downs
)
# Create faceted density plot
down_epa %>%
ggplot(aes(x = epa, fill = play_type)) +
# Create density curves
geom_density(alpha = 0.6, adjust = 1.5) +
# Add reference line at zero
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
# Create separate panel for each down
# ncol = 2: arrange in 2 columns
# scales = "free_y": allow different y-axis scales (different # of plays per down)
facet_wrap(
~down,
ncol = 2,
labeller = labeller(down = c(
"1" = "1st Down",
"2" = "2nd Down",
"3" = "3rd Down",
"4" = "4th Down"
))
) +
# Use consistent colors
scale_fill_manual(
values = c("pass" = "#00BFC4", "run" = "#F8766D"),
labels = c("Pass", "Run"),
name = "Play Type"
) +
# Limit x-axis
scale_x_continuous(limits = c(-5, 8)) +
# Add labels
labs(
title = "EPA Distributions by Down and Play Type",
subtitle = "2023 NFL Regular Season | Patterns change as downs progress",
x = "Expected Points Added",
y = "Density",
caption = "Data: nflfastR | Each panel shows a different down"
) +
# Apply theme
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 11, color = "gray40"),
legend.position = "top",
strip.text = element_text(face = "bold", size = 12), # Facet labels
panel.spacing = unit(1, "lines") # Space between panels
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-facets-py
#| fig-cap: "EPA by down - Python faceted version. Shows how EPA distributions vary across downs for both pass and run plays."
#| fig-width: 10
#| fig-height: 8
#| cache: true
# Prepare data
down_epa = pbp[
pbp['epa'].notna() &
pbp['play_type'].isin(['pass', 'run']) &
pbp['down'].notna() &
pbp['down'].isin([1, 2, 3, 4])
].copy()
# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
axes = axes.flatten()
# Plot for each down
for idx, down_num in enumerate([1, 2, 3, 4]):
ax = axes[idx]
# Filter to this down
down_data = down_epa[down_epa['down'] == down_num]
# Plot density for each play type
for play_type, color, label in [
('pass', '#00BFC4', 'Pass'),
('run', '#F8766D', 'Run')
]:
data = down_data[down_data['play_type'] == play_type]['epa']
data.plot.kde(
ax=ax,
color=color,
alpha=0.6,
linewidth=2,
label=label,
bw_method=0.3
)
# Add reference line
ax.axvline(x=0, color='black', linestyle='--', alpha=0.5)
# Set title for this panel
ax.set_title(f'{down_num}{"st" if down_num==1 else "nd" if down_num==2 else "rd" if down_num==3 else "th"} Down',
fontweight='bold', fontsize=12)
# Set x-axis limits
ax.set_xlim(-5, 8)
# Add legend to first panel only
if idx == 0:
ax.legend(title='Play Type', loc='upper right')
# Labels
ax.set_xlabel('Expected Points Added')
ax.set_ylabel('Density')
# Overall title
fig.suptitle(
'EPA Distributions by Down and Play Type\n2023 NFL Regular Season',
fontsize=14,
fontweight='bold',
y=0.995
)
# Adjust layout
plt.tight_layout()
plt.show()
Interpreting the Patterns:
Examining EPA distributions across downs reveals how situational pressure changes play outcomes:
1st Down Patterns:
- Most EPA values cluster near zero
- Wide spread for both passes and runs
- Success rates are relatively balanced
2nd Down Patterns:
- Similar to 1st down but slightly more variable
- Distance-to-go affects play selection and outcomes
- Mix of short-yardage and long-yardage situations
3rd Down Patterns:
- EPA distributions become more bimodal (two peaks)
- One peak near zero (failed conversions leading to punts)
- Another peak at positive values (successful conversions)
- Pass plays show much higher variance than runs
4th Down Patterns:
- Most extreme distributions
- Strong bimodality: plays either succeed dramatically or fail dramatically
- Fewer total plays (many 4th downs are punts/field goals, not included here)
- Very high stakes visible in the distribution shape
Strategic Insights:
The changing distributions across downs reflect increasing pressure and decreasing margin for error. On 1st down, offenses have flexibility—mistakes can be overcome. By 4th down, plays must succeed or the drive ends. This pressure manifests in the distribution shapes: more bimodal (success or failure, little middle ground) and more extreme outcomes.
Small Multiples Best Practices
**Effective Use:** 1. **Consistent Scales**: Use the same scales across panels unless there's a good reason not to 2. **Logical Ordering**: Arrange panels in a meaningful order (chronological, by value, by category) 3. **Readable Labels**: Make panel labels clear and descriptive 4. **Limited Panels**: Don't exceed 12-16 panels (becomes hard to process) 5. **Common Elements**: Use same colors, styles, and reference lines across panels **When to Facet:** - **Many categories**: Too many to show on one plot without overplotting - **Comparing patterns**: Want to see if relationships differ across groups - **Temporal progression**: Showing how patterns evolve over time **Alternatives:** - **Color/shape encoding**: If you have 2-3 categories, use color instead - **Animation**: For temporal data, animated plots can show changes - **Interactive filters**: Web-based dashboards with category selectorsExporting and Sharing Visualizations
Creating great visualizations is only half the battle—you also need to export and share them effectively. Different contexts require different formats, resolutions, and styling.
Export Formats and Best Practices
Common Export Formats:
| Format | Best For | Pros | Cons |
|---|---|---|---|
| PNG | Presentations, web, social media | Universal support, good compression | Raster (pixelates when scaled) |
| Publications, print, reports | Vector (scales infinitely), professional | Large file size, some compatibility issues | |
| SVG | Web graphics, further editing | Vector, editable in design tools | Not supported in some contexts |
| JPEG | Photos, web (rarely for data viz) | Small file size | Lossy compression, not ideal for text/lines |
Resolution Guidelines:
- Screen/Web: 100-150 DPI (dots per inch)
- Presentations: 150 DPI
- Print: 300 DPI minimum
- Publications: 600 DPI for line graphics
#| label: export-r
#| eval: false
#| echo: true
# Create a plot to export
my_plot <- pbp %>%
filter(!is.na(epa), play_type %in% c("pass", "run")) %>%
ggplot(aes(x = epa, fill = play_type)) +
geom_density(alpha = 0.6) +
theme_minimal() +
labs(title = "EPA Distribution by Play Type")
# Export as PNG for presentations
ggsave(
filename = "epa_distribution.png",
plot = my_plot,
width = 10, # Width in inches
height = 6, # Height in inches
dpi = 150, # Resolution (dots per inch)
bg = "white" # Background color
)
# Export as PDF for publication
ggsave(
filename = "epa_distribution.pdf",
plot = my_plot,
width = 10,
height = 6,
device = cairo_pdf # Better font rendering
)
# Export high-resolution PNG for print
ggsave(
filename = "epa_distribution_print.png",
plot = my_plot,
width = 10,
height = 6,
dpi = 300, # High resolution
bg = "white"
)
# Export for Twitter/social media
# Twitter recommends 2:1 aspect ratio
ggsave(
filename = "epa_distribution_twitter.png",
plot = my_plot,
width = 10,
height = 5, # 2:1 aspect ratio
dpi = 150
)
#| label: export-py
#| eval: false
#| echo: true
# Create a plot to export
fig, ax = plt.subplots(figsize=(10, 6))
# Plot code here...
# (assuming plot is created on ax)
# Export as PNG for presentations
fig.savefig(
'epa_distribution.png',
dpi=150, # Resolution
bbox_inches='tight', # Remove extra whitespace
facecolor='white', # Background color
edgecolor='none' # No border
)
# Export as PDF for publication
fig.savefig(
'epa_distribution.pdf',
format='pdf',
bbox_inches='tight',
facecolor='white'
)
# Export high-resolution PNG for print
fig.savefig(
'epa_distribution_print.png',
dpi=300, # High resolution
bbox_inches='tight',
facecolor='white'
)
# Export for Twitter/social media
# Create new figure with 2:1 aspect ratio
fig_twitter, ax_twitter = plt.subplots(figsize=(10, 5))
# Recreate plot on this figure...
fig_twitter.savefig(
'epa_distribution_twitter.png',
dpi=150,
bbox_inches='tight',
facecolor='white'
)
plt.close('all') # Close all figures to free memory
Publication Checklist
Before exporting final visualizations for publication: **Technical:** - [ ] Resolution is 300+ DPI for print, 150+ for web - [ ] Dimensions match publication requirements - [ ] All text is legible at final size - [ ] Colors work in grayscale (if printed in B&W) **Content:** - [ ] All axes are labeled with units - [ ] Title is informative and complete - [ ] Legend is clear and positioned well - [ ] Data source is cited in caption - [ ] No typos or errors in labels **Style:** - [ ] Fonts are professional and readable - [ ] Colors are consistent with other figures - [ ] White space is balanced - [ ] No unnecessary decorations (chartjunk removed) Taking 10 minutes to check these items prevents countless revisions later.Summary
Effective data visualization is a critical skill for football analytics. In this chapter, we've covered the complete landscape of creating professional, informative football visualizations:
Core Principles:
- Clarity: Visualizations should communicate their message within 3-5 seconds
- Accuracy: Visual encoding must truthfully represent data without distortion
- Aesthetics: Professional appearance enhances credibility and impact
Fundamental Chart Types:
- Histograms: Show distributions, reveal shape and spread
- Density Plots: Compare distributions smoothly across groups
- Bar Charts: Compare categories, show rankings
- Line Charts: Display trends over time or ordered sequences
- Scatter Plots: Reveal relationships between two variables
Advanced Techniques:
- Team Branding: Using official colors and logos for recognition and professionalism
- Small Multiples: Comparing patterns across many categories
- Color Theory: Choosing colors that encode data accurately and work for all viewers
- Grammar of Graphics: Thinking in layers to build custom visualizations
Technical Skills:
- Loading and configuring visualization packages in R and Python
- Creating plots with both ggplot2 and matplotlib/seaborn
- Customizing aesthetics for publication quality
- Exporting in appropriate formats and resolutions
Football-Specific Applications:
- Visualizing EPA distributions to understand play type variance
- Creating win probability charts to show game flow
- Using team colors and logos for authentic NFL graphics
- Comparing offensive efficiency across multiple dimensions
The visualization skills you've developed here will serve you throughout your football analytics career. Whether you're presenting to coaches, publishing research, or sharing insights on social media, your ability to create clear, accurate, and beautiful visualizations will set your work apart.
Remember: the goal of visualization is insight, not decoration. Every element should serve a purpose. Every choice should enhance understanding. And every visualization should tell a story that matters.
Exercises
Exercise 1: Team Performance Dashboard
**Objective**: Create a multi-panel dashboard showing different aspects of a single team's performance. **Task**: Choose one team and create a 2x2 grid of visualizations showing: 1. EPA distribution (histogram or density plot) 2. EPA by quarter (bar chart showing how they perform in each quarter) 3. Success rate by down (bar chart) 4. Pass vs. rush EPA comparison (scatter plot with league average reference lines) **Requirements**: - Use team colors consistently across all plots - Add appropriate labels and titles - Include a main title for the entire dashboard - Make it publication-ready (could be handed to a coach) **Hints**: - Use `patchwork` in R or `plt.subplots()` in Python - Filter data to single team early - Consider using `scale_fill_team()` or custom colorsExercise 2: Interactive Win Probability
**Objective**: Create an interactive win probability chart using plotly. **Task**: 1. Select an exciting close game from 2023 2. Create an interactive win probability chart where: - Hovering shows play description, quarter, time, score - Clicking a point highlights that play - The chart is zoomable and pannable 3. Export as an HTML file that can be shared **Requirements**: - Use `plotly` (available in both R and Python) - Include informative hover text - Make it visually appealing - Add a title and caption **Hints**: - `ggplotly()` in R converts ggplot to plotly - `plotly.express` in Python provides quick interactive plots - Use `tooltip` aesthetic for hover informationExercise 3: Positional EPA Comparison
**Objective**: Compare EPA across player positions using advanced visualization. **Task**: Create a visualization comparing quarterback EPA by team: 1. Calculate average EPA for each team's quarterbacks 2. Create a visualization that shows: - Team rankings - Confidence intervals (if you have enough data) - Highlight playoff teams vs. non-playoff teams 3. Add team logos or colors **Requirements**: - Filter to only QB pass plays - Calculate both mean EPA and sample size - Consider showing uncertainty if appropriate - Make it professional and polished **Extensions**: - Compare to receiver EPA - Show EPA by target (which receivers get targeted) - Break down by quarter or game situationFurther Reading
Books:
-
Wilke, C.O. (2019). Fundamentals of Data Visualization. O'Reilly. - Comprehensive guide to visualization principles with many examples.
-
Tufte, E.R. (2001). The Visual Display of Quantitative Information. Graphics Press. - Classic text on visualization design principles.
-
Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press. - Excellent introduction using R and ggplot2.
Online Resources:
-
ggplot2 Documentation: https://ggplot2.tidyverse.org/ - Official documentation with examples
-
Python Graph Gallery: https://python-graph-gallery.com/ - Hundreds of examples with code
-
ColorBrewer: https://colorbrewer2.org/ - Excellent color palette tool for maps and data visualization
-
nflplotR Gallery: https://nflplotr.nflverse.com/ - Examples of NFL-specific visualizations
Tools and Packages:
- Quarto: https://quarto.org/ - Publishing system used to create this textbook
- Observable: https://observablehq.com/ - Interactive JavaScript visualizations
- Tableau Public: Free version of Tableau for creating interactive dashboards
References
:::