Learning ObjectivesBy the end of this chapter, you will be able to:
- Access college football data using cfbfastR and the College Football Data API
- Understand key differences between NFL and college football data structures
- Handle multiple divisions (FBS, FCS, Division II, Division III)
- Work with recruiting data from 247Sports and other services
- Integrate transfer portal and coaching change data
- Navigate conference realignment and historical data challenges
- Build comprehensive college football analytics pipelines
- Implement data quality checks specific to college football
Introduction
Imagine you're analyzing a game where Alabama, with a roster valued at over $150 million in recruiting rankings, faces an FCS opponent whose entire roster cost might not crack the top-100 nationally. Or consider tracking 130+ FBS teams across 10+ conferences, each with different schedules, recruiting territories, and competitive levels. Now add the complexity of 1,000+ players entering the transfer portal annually, frequent coaching changes, and conference realignment that reshapes the competitive landscape every few years.
Welcome to college football analytics.
College football analytics presents unique challenges and opportunities compared to NFL analysis. With 130+ FBS teams, multiple divisions, recruiting classes, transfer portals, and constant conference realignment, the college football data landscape is significantly more complex than the professional game. Unlike the NFL's 32 teams playing relatively balanced schedules, college football features massive competitive disparities, regional recruiting battles, and a constantly evolving organizational structure.
Yet this complexity creates rich analytical opportunities. College football's structural differences—larger talent disparities, diverse playing styles, recruiting dynamics, and the importance of player development—make it a fascinating domain for data science applications. Questions like "How much does recruiting quality predict future success?" or "What's the true value of a transfer portal addition?" simply don't exist in professional sports, where all players come through a centralized draft system.
The data infrastructure supporting college football analytics has matured significantly over the past decade. The College Football Data (CFBD) API now provides comprehensive play-by-play data dating back to 2013, recruiting rankings, transfer portal tracking, and team metrics. The cfbfastR package brings this data into R with a familiar interface similar to nflfastR, making the transition from NFL to college analytics relatively smooth for analysts already familiar with the nflverse ecosystem.
However, college football data comes with unique challenges. Coverage varies dramatically by conference and era—Power 5 conferences have excellent data quality since 2013, but Group of 5 coverage can be spotty, and FCS data is limited. Teams change conferences, coaches turn over frequently, and roster composition changes dramatically year-over-year through recruiting and transfers. These factors require analysts to think carefully about data validation, historical comparisons, and the contextual factors that make college football analytics both challenging and rewarding.
This chapter provides a comprehensive guide to college football data infrastructure. We'll explore the cfbfastR package ecosystem, the College Football Data (CFBD) API, recruiting data sources, transfer portal tracking, and techniques for building robust analytics pipelines that handle the unique aspects of college football. Whether you're transitioning from NFL analytics or starting fresh with college football, this chapter will equip you with the tools and knowledge to navigate this complex but fascinating analytical landscape.
Why College Football Analytics is Different
College football analytics differs from NFL analytics in fundamental ways: - **Team parity**: Massive talent gaps between teams (Alabama vs. FCS opponents) - **Recruiting**: 15-30 new players join each team annually - **Transfer portal**: Hundreds of players change schools each year - **Conference structure**: Multiple divisions with varying competition levels - **Roster size**: 85 scholarship players vs. 53 in NFL - **Eligibility**: 4-5 years of eligibility, redshirts, COVID years - **Coaching changes**: Higher turnover than NFL (20-30 changes annually) - **Limited data history**: Modern analytics data available primarily from 2013 onward - **Data quality**: Less consistent than NFL due to multiple sourcesThe cfbfastR Ecosystem
If you've worked with NFL data using nflfastR, you'll find cfbfastR immediately familiar. Both packages share similar design philosophies, function naming conventions, and data structures. However, cfbfastR extends beyond simple play-by-play data to encompass the unique elements of college football: recruiting rankings, transfer portal movements, coaching changes, and the complex web of conference affiliations.
cfbfastR is the premier R package for college football analytics, providing access to the College Football Data API with a convenient interface similar to nflfastR. Developed and maintained by the SportsDataverse community (the same team behind nflfastR), it's the college football equivalent of the nflverse ecosystem. The package serves as a wrapper around the CFBD API, handling authentication, rate limiting, data formatting, and error handling so you can focus on analysis rather than data engineering.
Package Overview
The breadth of data available through cfbfastR reflects the complexity of college football itself. Unlike the NFL, where you might primarily work with play-by-play and roster data, college football analysis requires integrating multiple data streams to build a complete picture of team performance and program health.
cfbfastR provides access to:
- Play-by-play data (2013-present for major conferences): Every play from every game, including down, distance, yards gained, and advanced metrics like EPA
- Team statistics and advanced metrics: Season and game-level team stats, including efficiency metrics, tempo data, and success rates
- Recruiting class data (247Sports composite): Individual recruit rankings, team class rankings, and historical recruiting trends
- Transfer portal information: Players entering the portal, their destinations, and ratings
- Coaching data and staff changes: Head coach records, hire dates, and career statistics
- Team talent composites: 247Sports talent ratings that aggregate recruiting classes over multiple years
- Betting lines and spreads: Opening and closing lines from major sportsbooks
- Conference and division information: Team affiliations, accounting for conference realignment
- Historical records and rankings: AP Poll, Coaches Poll, and CFP rankings over time
- Venue and attendance data: Stadium information, capacity, and actual attendance figures
Transitioning from nflfastR to cfbfastR
If you're already familiar with nflfastR, cfbfastR will feel like home: - Similar function naming: `cfbd_pbp_data()` vs `load_pbp()` - Consistent data structures: Both use tidyverse-friendly data frames - Shared concepts: EPA, success rate, and win probability exist in both - API key management: Both use environment variables for authentication **Key differences**: - College data requires CFBD API key (free registration) - More data sources beyond play-by-play (recruiting, transfers) - Greater attention to data quality and coverage gaps - Conference and division tracking not needed in NFLData Availability and Quality Considerations
Before diving into analysis, understand these important limitations: **Temporal coverage**: - 2013-present: Good coverage for Power 5 conferences - 2010-2012: Spotty coverage, significant missing data - Pre-2010: Very limited play-by-play availability **Conference coverage**: - Power 5 (SEC, Big Ten, Big 12, ACC, Pac-12): Excellent - Group of 5 (AAC, Sun Belt, MAC, C-USA, MWC): Good from 2015+, gaps before - FCS: Limited play-by-play, basic game results available - Division II/III: Minimal data **Game types**: - Regular season: Best coverage - Bowl games: Good coverage - Early season (Week 0-1): Sometimes delayed - FCS vs FBS: Variable quality Always validate your data before drawing conclusions!#| eval: false
#| echo: true
# Install from CRAN
install.packages("cfbfastR")
# Or install development version from GitHub
# install.packages("devtools")
devtools::install_github("sportsdataverse/cfbfastR")
# Install companion packages
install.packages("tidyverse")
install.packages("gt")
install.packages("ggplot2")
install.packages("httr")
install.packages("jsonlite")
#| eval: false
#| echo: true
# Install cfbd package (College Football Data API wrapper)
pip install cfbd
# Install data science packages
pip install pandas numpy matplotlib seaborn
# Alternative: Use requests to call API directly
pip install requests
# For advanced features
pip install scipy scikit-learn
Setting Up API Authentication
The College Football Data API requires a free API key for most endpoints:
#| eval: false
#| echo: true
# Set API key as environment variable
# Option 1: In .Renviron file (recommended)
# Add this line to .Renviron:
# CFBD_API_KEY=your_key_here
# Option 2: Set in session
Sys.setenv(CFBD_API_KEY = "your_key_here")
# Option 3: Pass directly to functions
pbp <- cfbd_pbp_data(
year = 2023,
api_key = "your_key_here"
)
#| eval: false
#| echo: true
import cfbd
import os
# Configure API
configuration = cfbd.Configuration()
# Option 1: From environment variable (recommended)
configuration.api_key['Authorization'] = os.environ.get('CFBD_API_KEY')
configuration.api_key_prefix['Authorization'] = 'Bearer'
# Option 2: Direct assignment (not recommended for production)
configuration.api_key['Authorization'] = 'your_key_here'
configuration.api_key_prefix['Authorization'] = 'Bearer'
# Create API client
api_client = cfbd.ApiClient(configuration)
Getting a CFBD API Key
To obtain a free API key: 1. Visit https://collegefootballdata.com/ 2. Click "Get Started" or "API Keys" 3. Create an account (free) 4. Navigate to API Keys section 5. Generate a new key 6. Store it securely (use environment variables, never commit to version control) **Rate Limits**: Free tier allows 200 requests per hour. For higher volume needs, consider supporting the project through Patreon.Loading Play-by-Play Data
Play-by-play data forms the foundation of modern college football analytics. Each row represents a single play, containing information about the game situation (down, distance, field position), the play outcome (yards gained, play type), and advanced metrics (EPA, win probability). Understanding how to load and validate this data is the first step toward meaningful analysis.
Unlike NFL data, which comes from a single centralized source (NFL.com via nflfastR), college football play-by-play data is compiled from multiple sources and aggregated by the College Football Data API. This means you'll encounter more variability in data quality, coverage gaps for certain games or conferences, and occasional inconsistencies that require careful validation.
When loading college football play-by-play data, several important processes occur behind the scenes:
- API Authentication: Your API key is validated against the CFBD servers
- Data Retrieval: The function queries the CFBD database for matching plays
- Rate Limiting: Requests are throttled to comply with API limits (200 requests/hour for free tier)
- Data Parsing: JSON responses are converted to structured data frames
- Metric Calculation: Advanced metrics like EPA and success rate are pre-calculated
- Error Handling: Missing or invalid data is flagged appropriately
Let's start with the fundamental task of loading college football play-by-play data:
#| label: load-pbp-r
#| message: false
#| warning: false
#| cache: true
# Load required packages for data manipulation and presentation
library(tidyverse) # Data manipulation and visualization
library(cfbfastR) # College football data access
library(gt) # Professional table formatting
# Load play-by-play data for the 2023 season
# This function queries the CFBD API and returns a data frame
# where each row represents a single play from the season
pbp_2023 <- cfbd_pbp_data(
year = 2023, # Season to load
season_type = "regular", # "regular", "postseason", or "both"
week = NULL # NULL loads all weeks; specify 1-15 for single week
)
# Display basic information about the loaded dataset
# This helps verify the data loaded successfully
cat("Loaded", nrow(pbp_2023), "plays from 2023 season\n")
cat("Number of columns:", ncol(pbp_2023), "\n")
# Count unique teams (offensive and defensive sides)
# College football has 130+ FBS teams, so we expect ~130 unique teams
unique_teams <- n_distinct(c(pbp_2023$pos_team, pbp_2023$def_pos_team))
cat("Unique teams:", unique_teams, "\n")
# Show date range to verify we have the full season
cat("Date range:",
min(pbp_2023$game_date, na.rm = TRUE), "to",
max(pbp_2023$game_date, na.rm = TRUE), "\n")
# Display sample plays to understand the data structure
# We select key variables that analysts most commonly use
pbp_2023 %>%
select(
game_id, # Unique identifier for the game
pos_team, # Team with possession (offense)
def_pos_team, # Defensive team
down, # Down (1-4)
distance, # Yards needed for first down
yards_gained, # Result of the play
play_type, # pass, rush, punt, etc.
EPA # Expected Points Added
) %>%
head(10) %>%
gt() %>%
tab_header(
title = "Sample College Football Plays",
subtitle = "2023 Season"
) %>%
fmt_number(
columns = EPA,
decimals = 3
)
#| label: load-pbp-py
#| message: false
#| warning: false
#| cache: true
# Import required libraries for data analysis and API access
import pandas as pd # Data manipulation
import numpy as np # Numerical operations
import cfbd # College Football Data API wrapper
from cfbd.rest import ApiException # Error handling
import os # Environment variable access
# Configure API authentication
# The CFBD API requires an API key for most endpoints
configuration = cfbd.Configuration()
# Retrieve API key from environment variable (recommended approach)
# Never hardcode API keys in scripts or commit them to version control
configuration.api_key['Authorization'] = os.environ.get('CFBD_API_KEY', 'YOUR_KEY')
configuration.api_key_prefix['Authorization'] = 'Bearer'
# Create API instance for play-by-play data access
# This handles the connection to CFBD servers
api_instance = cfbd.PlaysApi(cfbd.ApiClient(configuration))
try:
# Request play-by-play data for 2023 regular season
# This may take 30-60 seconds for a full season
plays = api_instance.get_plays(
year=2023, # Season year
season_type='regular' # Regular season only (excludes bowls)
)
# Convert API response objects to pandas DataFrame
# The to_dict() method extracts data from each play object
pbp_2023 = pd.DataFrame([p.to_dict() for p in plays])
# Display dataset summary with thousands separator for readability
print(f"Loaded {len(pbp_2023):,} plays from 2023 season")
print(f"Number of columns: {len(pbp_2023.columns)}")
# Count unique offensive teams (should be ~130 for FBS)
print(f"Unique teams: {pbp_2023['offense'].nunique()}")
# Display sample plays to verify data structure
# Select most commonly used variables for initial inspection
print("\nSample Plays:")
sample_columns = ['id', 'offense', 'defense', 'down',
'distance', 'yards_gained', 'play_type']
print(pbp_2023[sample_columns].head(10))
except ApiException as e:
# Handle API errors gracefully (rate limits, authentication issues, etc.)
print(f"Exception when calling PlaysApi: {e}")
When you successfully load play-by-play data, you've accessed the core dataset for college football analytics. This data enables you to calculate team efficiency metrics, analyze coaching decisions, evaluate player performance, and build predictive models. However, understanding the structure and meaning of the variables in this dataset is essential before performing any analysis.
Understanding Play-by-Play Variables
College football play-by-play data contains different variables than NFL data:
#| label: pbp-vars-r
#| message: false
#| warning: false
#| cache: true
# Examine key variables
key_variables <- tibble(
Category = c(
"Game Info", "Game Info", "Game Info",
"Situation", "Situation", "Situation", "Situation",
"Play Info", "Play Info", "Play Info",
"Metrics", "Metrics", "Metrics"
),
Variable = c(
"game_id", "week", "home",
"down", "distance", "yards_to_goal", "period",
"play_type", "yards_gained", "play_text",
"EPA", "success", "wpa"
),
Description = c(
"Unique game identifier",
"Week of season (1-15+)",
"Home team abbreviation",
"Down (1-4)",
"Yards to first down",
"Yards from goal line",
"Quarter (1-4+, includes OT)",
"Type of play (pass, rush, etc)",
"Net yards gained",
"Play description text",
"Expected Points Added",
"Binary success indicator",
"Win Probability Added"
)
)
key_variables %>%
gt() %>%
tab_header(
title = "Key College Football Play-by-Play Variables"
) %>%
cols_label(
Category = "Category",
Variable = "Variable",
Description = "Description"
)
#| label: pbp-vars-py
#| message: false
#| warning: false
# Key variables in college football PBP data
key_variables = pd.DataFrame({
'Category': [
'Game Info', 'Game Info', 'Game Info',
'Situation', 'Situation', 'Situation', 'Situation',
'Play Info', 'Play Info', 'Play Info',
'Metrics', 'Metrics', 'Metrics'
],
'Variable': [
'id', 'week', 'home',
'down', 'distance', 'yards_to_goal', 'period',
'play_type', 'yards_gained', 'play_text',
'ppa', 'success', 'wpa'
],
'Description': [
'Unique play identifier',
'Week of season (1-15+)',
'Home team name',
'Down (1-4)',
'Yards to first down',
'Yards from goal line',
'Quarter (1-4+, includes OT)',
'Type of play (pass, rush, etc)',
'Net yards gained',
'Play description text',
'Predicted Points Added',
'Binary success indicator',
'Win Probability Added'
]
})
print("\nKey College Football Play-by-Play Variables:")
print(key_variables.to_string(index=False))
Play-by-Play Data Differences from NFL
If you're transitioning from NFL analytics to college football, you might initially assume the data structures and analytical approaches are identical. After all, both sports feature the same basic concept: teams advance the ball down the field through a series of plays, with downs resetting at first down markers or turnovers.
However, significant rule differences between the NFL and college football create important analytical distinctions. These aren't just trivia—they fundamentally affect how plays unfold, how games are managed, and how analysts should interpret the data. For example, college football's wider hash marks (40 feet apart vs. 18.5 feet in the NFL) create different field geometries that affect optimal play calling. The absence of a two-minute warning changes late-game strategy dramatically. Even overtime rules differ completely, creating unique analytical situations that don't exist in professional football.
Understanding these differences is crucial for accurate analysis. If you apply NFL-based models or assumptions directly to college data without accounting for rule differences, you'll likely draw incorrect conclusions. For instance, a fourth-down decision model trained on NFL data won't translate perfectly to college football because college teams face different strategic considerations with wider hash marks and different clock management rules.
College football play-by-play data has several important differences from NFL data that affect analysis. Let's examine the key structural and rule differences you need to account for in your analytical work.
Structural Differences
#| label: cfb-nfl-diff-r
#| message: false
#| warning: false
#| cache: true
# Create comparison table
differences <- tribble(
~Aspect, ~NFL, ~College,
"Overtime", "Modified sudden death", "Alternating possessions from 25",
"Clock rules", "2-minute warning", "No 2-minute warning",
"Hash marks", "18.5 feet apart", "40 feet apart (wider)",
"Pass interference", "Spot foul", "15-yard penalty maximum",
"Targeting", "15 yards", "15 yards + ejection",
"Down after penalty", "Can award untimed down", "Untimed down on def penalty",
"Overtime start", "N/A (new rules)", "Starts at 25-yard line",
"Player names", "Required on jerseys", "Optional in many conferences",
"Clock stops", "Only on incompletions late", "Stops on 1st down (varies)",
"Replay review", "Booth review limited", "Booth can review scoring plays"
)
differences %>%
gt() %>%
tab_header(
title = "Key Rule Differences: NFL vs College Football"
) %>%
cols_label(
Aspect = "Rule Aspect",
NFL = "NFL Rule",
College = "College Rule"
) %>%
tab_options(
table.font.size = px(12)
)
#| label: cfb-nfl-diff-py
#| message: false
#| warning: false
# Create comparison table
differences = pd.DataFrame({
'Aspect': [
'Overtime', 'Clock rules', 'Hash marks',
'Pass interference', 'Targeting', 'Down after penalty',
'Overtime start', 'Player names', 'Clock stops', 'Replay review'
],
'NFL': [
'Modified sudden death', '2-minute warning', '18.5 feet apart',
'Spot foul', '15 yards', 'Can award untimed down',
'N/A (new rules)', 'Required on jerseys', 'Only on incompletions late',
'Booth review limited'
],
'College': [
'Alternating possessions from 25', 'No 2-minute warning', '40 feet apart (wider)',
'15-yard penalty maximum', '15 yards + ejection', 'Untimed down on def penalty',
'Starts at 25-yard line', 'Optional in many conferences',
'Stops on 1st down (varies)', 'Booth can review scoring plays'
]
})
print("\nKey Rule Differences: NFL vs College Football")
print(differences.to_string(index=False))
Data Quality Considerations
College football data quality varies by conference and era:
Play-by-Play Coverage:
- 2013-Present: Good coverage for Power 5 conferences
- 2010-2012: Spotty coverage, missing data common
- Pre-2010: Very limited play-by-play data
Conference Differences:
- Power 5 conferences: Most complete data
- Group of 5: Good data for recent years, gaps in older data
- FCS: Limited play-by-play coverage
- Division II/III: Minimal play-by-play data
Data Availability Varies by Conference
When analyzing college football data, always check coverage:# Check data availability by conference
pbp %>%
group_by(season, conference) %>%
summarise(
games = n_distinct(game_id),
plays = n(),
.groups = "drop"
) %>%
arrange(season, desc(plays))
Team and Conference Data
Understanding team affiliations and conference structure is essential for college football analysis:
#| label: team-info-r
#| message: false
#| warning: false
#| cache: true
# Get FBS teams
fbs_teams <- cfbd_team_info(
year = 2023,
only_fbs = TRUE
)
# Display team information
fbs_teams %>%
select(school, conference, division, color, mascot, abbreviation) %>%
arrange(conference, school) %>%
head(20) %>%
gt() %>%
tab_header(
title = "FBS Team Information",
subtitle = "2023 Season - Sample"
) %>%
cols_label(
school = "School",
conference = "Conference",
division = "Division",
color = "Primary Color",
mascot = "Mascot",
abbreviation = "Abbr"
)
# Count teams by conference
conference_counts <- fbs_teams %>%
count(conference, sort = TRUE, name = "teams")
cat("\nTeams by Conference (2023):\n")
print(conference_counts)
#| label: team-info-py
#| message: false
#| warning: false
#| cache: true
# Get team information
teams_api = cfbd.TeamsApi(cfbd.ApiClient(configuration))
try:
fbs_teams = teams_api.get_fbs_teams(year=2023)
# Convert to DataFrame
teams_df = pd.DataFrame([{
'school': t.school,
'conference': t.conference,
'division': t.division,
'color': t.color,
'mascot': t.mascot,
'abbreviation': t.abbreviation
} for t in fbs_teams])
print("FBS Teams (2023) - Sample:")
print(teams_df.sort_values(['conference', 'school']).head(20).to_string(index=False))
print("\n\nTeams by Conference (2023):")
print(teams_df['conference'].value_counts())
except ApiException as e:
print(f"Exception: {e}")
Conference Realignment
Conference realignment is a major challenge in college football analytics. Teams frequently change conferences, affecting:
- Strength of schedule calculations
- Historical comparisons
- Recruiting territories
- Travel patterns
#| label: realignment-r
#| message: false
#| warning: false
#| cache: true
# Track conference changes over time
# Example: Texas and Oklahoma to SEC
conference_history <- map_df(2020:2024, function(yr) {
cfbd_team_info(year = yr) %>%
filter(school %in% c("Texas", "Oklahoma", "UCLA", "USC")) %>%
select(school, conference) %>%
mutate(year = yr)
})
conference_history %>%
pivot_wider(
names_from = year,
values_from = conference
) %>%
gt() %>%
tab_header(
title = "Conference Realignment Examples",
subtitle = "Selected schools, 2020-2024"
)
#| label: realignment-py
#| message: false
#| warning: false
#| cache: true
# Track conference changes
schools_to_track = ['Texas', 'Oklahoma', 'UCLA', 'USC']
conference_history = []
for year in range(2020, 2025):
try:
teams = teams_api.get_teams(year=year)
for t in teams:
if t.school in schools_to_track:
conference_history.append({
'school': t.school,
'year': year,
'conference': t.conference
})
except:
pass
conf_df = pd.DataFrame(conference_history)
conf_pivot = conf_df.pivot(index='school', columns='year', values='conference')
print("\nConference Realignment Examples (2020-2024):")
print(conf_pivot)
FBS vs FCS and Division Structure
One of the most distinctive features of college football—and a major analytical challenge—is the multi-tiered division structure. Unlike the NFL, where all 32 teams theoretically compete on a level playing field (with parity enforced through the draft and salary cap), college football features dramatic competitive stratification across divisions and even within divisions.
The NCAA divides college football into multiple competitive tiers:
- FBS (Football Bowl Subdivision): The top tier with ~130 teams, including Power 5 and Group of 5 conferences
- FCS (Football Championship Subdivision): A lower tier with ~125 teams, previously called Division I-AA
- Division II: ~170 teams with partial scholarship limits
- Division III: ~250 teams with no athletic scholarships
These divisions don't just represent different competitive levels—they represent fundamentally different financial ecosystems, recruiting approaches, and analytical challenges. An FBS team like Alabama might have an athletic department budget exceeding $200 million, while an FCS school might operate on $5-10 million. This translates to vastly different talent levels, which creates unique analytical considerations when teams from different divisions play each other.
Why this matters for analytics: When FBS teams schedule FCS opponents (commonly called "buy games" because the FBS school pays the FCS school a guarantee), the talent gap is so large that traditional efficiency metrics become less meaningful. A team dominating an FCS opponent doesn't tell you much about how they'll perform against SEC competition. Analysts must account for opponent strength when evaluating team performance—a challenge that doesn't exist in the NFL's more balanced structure.
Division Hierarchy
Let's examine the division structure and the competitive gaps between divisions using actual data:
#| label: divisions-r
#| message: false
#| warning: false
#| cache: true
# Get teams from all divisions
all_teams <- cfbd_team_info(year = 2023)
# Analyze division distribution
division_summary <- all_teams %>%
mutate(
division_level = case_when(
classification == "fbs" ~ "FBS (Division I-A)",
classification == "fcs" ~ "FCS (Division I-AA)",
TRUE ~ "Other"
)
) %>%
count(division_level, name = "teams") %>%
arrange(desc(teams))
division_summary %>%
gt() %>%
tab_header(
title = "College Football Division Structure",
subtitle = "2023 Season"
) %>%
cols_label(
division_level = "Division",
teams = "Number of Teams"
)
# Analyze FBS vs FCS matchups
games_2023 <- cfbd_game_info(year = 2023)
# Identify cross-division games
fbs_schools <- all_teams %>%
filter(classification == "fbs") %>%
pull(school)
cross_division_games <- games_2023 %>%
filter(
(home_team %in% fbs_schools & !away_team %in% fbs_schools) |
(!home_team %in% fbs_schools & away_team %in% fbs_schools)
)
cat("\nCross-division FBS vs FCS games in 2023:", nrow(cross_division_games), "\n")
cat("Average point differential:",
mean(abs(cross_division_games$home_points - cross_division_games$away_points), na.rm = TRUE),
"points\n")
#| label: divisions-py
#| message: false
#| warning: false
#| cache: true
try:
# Get all teams
all_teams = teams_api.get_teams()
# Convert to DataFrame
all_teams_df = pd.DataFrame([{
'school': t.school,
'classification': t.classification if hasattr(t, 'classification') else 'unknown',
'conference': t.conference
} for t in all_teams])
# Division distribution
print("Division Distribution (2023):")
print(all_teams_df['classification'].value_counts())
# FBS schools list
fbs_schools = all_teams_df[
all_teams_df['classification'] == 'fbs'
]['school'].tolist()
print(f"\nNumber of FBS schools: {len(fbs_schools)}")
print(f"Number of FCS schools: {len(all_teams_df[all_teams_df['classification'] == 'fcs'])}")
except ApiException as e:
print(f"Exception: {e}")
Competitive Balance Analysis
College football's competitive imbalance requires special analytical considerations:
#| label: competitive-balance-r
#| message: false
#| warning: false
#| cache: true
# Calculate team talent composites
talent <- cfbd_team_talent(year = 2023)
# Analyze talent distribution
talent_summary <- talent %>%
summarise(
teams = n(),
mean_talent = mean(talent, na.rm = TRUE),
median_talent = median(talent, na.rm = TRUE),
sd_talent = sd(talent, na.rm = TRUE),
min_talent = min(talent, na.rm = TRUE),
max_talent = max(talent, na.rm = TRUE),
range = max_talent - min_talent,
top10_avg = mean(talent[order(-talent)][1:10], na.rm = TRUE),
bottom10_avg = mean(talent[order(talent)][1:10], na.rm = TRUE),
gap = top10_avg - bottom10_avg
)
talent_summary %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
gt() %>%
tab_header(
title = "FBS Talent Distribution",
subtitle = "2023 Season - 247Sports Composite"
) %>%
fmt_number(decimals = 2) %>%
cols_label(
metric = "Metric",
value = "Value"
)
# Identify talent tiers
talent_tiers <- talent %>%
mutate(
tier = case_when(
talent >= quantile(talent, 0.90, na.rm = TRUE) ~ "Elite (Top 10%)",
talent >= quantile(talent, 0.75, na.rm = TRUE) ~ "Very Good (75-90%)",
talent >= quantile(talent, 0.50, na.rm = TRUE) ~ "Above Average (50-75%)",
talent >= quantile(talent, 0.25, na.rm = TRUE) ~ "Below Average (25-50%)",
TRUE ~ "Low (Bottom 25%)"
)
) %>%
count(tier, name = "teams")
talent_tiers %>%
gt() %>%
tab_header(
title = "FBS Talent Tiers",
subtitle = "Distribution of teams by talent level"
)
#| label: competitive-balance-py
#| message: false
#| warning: false
#| cache: true
import matplotlib.pyplot as plt
# Get talent data
try:
talent_data = teams_api.get_talent(year=2023)
# Convert to DataFrame
talent_df = pd.DataFrame([{
'school': t.school,
'talent': t.talent
} for t in talent_data])
# Summary statistics
print("FBS Talent Distribution Summary:")
print(f"Teams: {len(talent_df)}")
print(f"Mean: {talent_df['talent'].mean():.2f}")
print(f"Median: {talent_df['talent'].median():.2f}")
print(f"Std Dev: {talent_df['talent'].std():.2f}")
print(f"Min: {talent_df['talent'].min():.2f}")
print(f"Max: {talent_df['talent'].max():.2f}")
print(f"Range: {talent_df['talent'].max() - talent_df['talent'].min():.2f}")
# Top 10 vs Bottom 10
top10_avg = talent_df.nlargest(10, 'talent')['talent'].mean()
bottom10_avg = talent_df.nsmallest(10, 'talent')['talent'].mean()
print(f"\nTop 10 Average: {top10_avg:.2f}")
print(f"Bottom 10 Average: {bottom10_avg:.2f}")
print(f"Gap: {top10_avg - bottom10_avg:.2f}")
# Create talent tiers
talent_df['tier'] = pd.cut(
talent_df['talent'],
bins=[0, talent_df['talent'].quantile(0.25),
talent_df['talent'].quantile(0.50),
talent_df['talent'].quantile(0.75),
talent_df['talent'].quantile(0.90),
talent_df['talent'].max()],
labels=['Low (Bottom 25%)', 'Below Average (25-50%)',
'Above Average (50-75%)', 'Very Good (75-90%)',
'Elite (Top 10%)']
)
print("\n\nFBS Talent Tiers:")
print(talent_df['tier'].value_counts().sort_index())
except ApiException as e:
print(f"Exception: {e}")
Adjusting for Competition Level
When comparing teams across divisions or talent tiers: 1. **Use opponent-adjusted metrics**: Account for strength of schedule 2. **Separate divisions in analysis**: Don't directly compare FBS and FCS stats 3. **Consider recruiting rankings**: Talent disparities explain much variance 4. **Weight by competition**: Games vs stronger opponents should carry more weight 5. **Use relative metrics**: Compare teams within their conference/division first 6. **Context matters**: A 30-point win over FCS ≠ 30-point win over ranked FBS teamRecruiting Data Sources
In the NFL, all teams acquire talent through the same mechanism: a centralized draft that distributes players in reverse order of finish, promoting competitive balance. The worst team gets the first pick. Free agency operates under a salary cap that prevents wealthy teams from hoarding talent. This creates parity.
College football works completely differently. There is no draft. There is no salary cap (though NIL rules have created a quasi-market). Instead, teams recruit high school players in an open competition where blue-blood programs with national brands, elite facilities, and winning traditions have enormous advantages. A five-star recruit from Texas might choose between Alabama, Ohio State, Texas, and USC—but probably won't even consider a Group of 5 school.
This makes recruiting data absolutely fundamental to college football analytics. While NFL teams acquire talent through a mechanism designed to promote parity, college football's recruiting system amplifies existing advantages. The teams that recruit well tend to stay good, while teams that recruit poorly struggle to break through. This creates persistent competitive hierarchies that make college football strategically different from the NFL.
Recruiting data allows analysts to:
- Predict future performance: Teams with top recruiting classes tend to win more games 2-3 years later
- Identify development: Teams that win more than their recruiting suggests are "developing" players effectively
- Assess coaching: Coaches are often evaluated partly on recruiting class rankings
- Project competitive landscape: Conference realignment and NIL changes affect recruiting territories
- Quantify talent gaps: Understanding roster talent helps adjust performance metrics appropriately
Recruiting is fundamental to college football success, and multiple services track recruiting rankings and commitments with varying methodologies and coverage.
Primary Recruiting Services
247Sports Composite
- Industry-standard rankings combining multiple sources
- Star ratings (2-5 stars)
- Numerical ratings (0.0000-1.0000)
- Position rankings
- Available through CFBD API
Other Services
- Rivals: Long-standing recruiting service
- ESPN: National recruiting rankings
- On3: Newer service with NIL valuations
Accessing Recruiting Data
#| label: recruiting-r
#| message: false
#| warning: false
#| cache: true
# Get recruiting rankings
recruiting_2024 <- cfbd_recruiting_player(
year = 2024,
classification = "HighSchool"
)
# Top recruits
top_recruits <- recruiting_2024 %>%
arrange(desc(rating)) %>%
select(name, position, school = committed_to,
rating, ranking, stars, hometown = city, state = state_province) %>%
head(25)
top_recruits %>%
gt() %>%
tab_header(
title = "Top 2024 Recruits",
subtitle = "247Sports Composite Rankings"
) %>%
cols_label(
name = "Player",
position = "Pos",
school = "Committed To",
rating = "Rating",
ranking = "National Rank",
stars = "Stars",
hometown = "City",
state = "State"
) %>%
fmt_number(
columns = rating,
decimals = 4
) %>%
data_color(
columns = stars,
colors = scales::col_numeric(
palette = c("lightgray", "gold"),
domain = c(3, 5)
)
)
# Team recruiting class rankings
team_recruiting <- cfbd_recruiting_team(year = 2024)
team_recruiting %>%
arrange(rank) %>%
select(team, rank, points, avg_rating = average_rating) %>%
head(20) %>%
gt() %>%
tab_header(
title = "2024 Recruiting Class Rankings"
) %>%
fmt_number(
columns = c(points, avg_rating),
decimals = 2
) %>%
cols_label(
team = "Team",
rank = "Rank",
points = "Points",
avg_rating = "Avg Rating"
)
#| label: recruiting-py
#| message: false
#| warning: false
#| cache: true
# Get recruiting data
recruiting_api = cfbd.RecruitingApi(cfbd.ApiClient(configuration))
try:
# Player rankings
players = recruiting_api.get_recruiting_players(
year=2024,
classification='HighSchool'
)
# Convert to DataFrame
recruits_df = pd.DataFrame([{
'name': p.name,
'position': p.position,
'school': p.committed_to,
'rating': p.rating,
'ranking': p.ranking,
'stars': p.stars,
'city': p.city if hasattr(p, 'city') else '',
'state': p.state_province if hasattr(p, 'state_province') else ''
} for p in players])
# Top recruits
top_recruits = recruits_df.sort_values('rating', ascending=False).head(25)
print("Top 2024 Recruits (247Sports Composite):")
print(top_recruits[['name', 'position', 'school', 'rating',
'ranking', 'stars']].to_string(index=False))
# Team rankings
teams = recruiting_api.get_recruiting_teams(year=2024)
teams_df = pd.DataFrame([{
'team': t.team,
'rank': t.rank,
'points': t.points if hasattr(t, 'points') else 0
} for t in teams]).sort_values('rank').head(20)
print("\n\n2024 Recruiting Class Rankings:")
print(teams_df.to_string(index=False))
except ApiException as e:
print(f"Exception: {e}")
Analyzing Recruiting Success Over Time
#| label: recruiting-trends-r
#| message: false
#| warning: false
#| cache: true
# Get multiple years of recruiting data
recruiting_multi <- map_df(2020:2024, function(yr) {
cfbd_recruiting_team(year = yr) %>%
mutate(year = yr)
})
# Top programs by average recruiting rank
top_recruiters <- recruiting_multi %>%
group_by(team) %>%
summarise(
classes = n(),
avg_rank = mean(rank, na.rm = TRUE),
avg_points = mean(points, na.rm = TRUE),
top_10_classes = sum(rank <= 10),
top_25_classes = sum(rank <= 25),
.groups = "drop"
) %>%
arrange(avg_rank) %>%
head(20)
top_recruiters %>%
gt() %>%
tab_header(
title = "Best Recruiting Programs (2020-2024)",
subtitle = "Based on average class ranking over 5 years"
) %>%
cols_label(
team = "Team",
classes = "Classes",
avg_rank = "Avg Rank",
avg_points = "Avg Points",
top_10_classes = "Top 10",
top_25_classes = "Top 25"
) %>%
fmt_number(
columns = c(avg_rank, avg_points),
decimals = 1
)
#| label: recruiting-trends-py
#| message: false
#| warning: false
#| cache: true
# Get multiple years
all_recruiting = []
for year in range(2020, 2025):
try:
teams = recruiting_api.get_recruiting_teams(year=year)
for t in teams:
all_recruiting.append({
'year': year,
'team': t.team,
'rank': t.rank,
'points': t.points if hasattr(t, 'points') else 0
})
except:
pass
recruiting_df = pd.DataFrame(all_recruiting)
# Top programs
top_programs = (recruiting_df
.groupby('team')
.agg(
classes=('year', 'count'),
avg_rank=('rank', 'mean'),
avg_points=('points', 'mean'),
top_10=('rank', lambda x: (x <= 10).sum()),
top_25=('rank', lambda x: (x <= 25).sum())
)
.sort_values('avg_rank')
.head(20)
)
print("Best Recruiting Programs (2020-2024):")
print(top_programs.to_string())
The "Blue Chip Ratio"
A key recruiting concept is the "Blue Chip Ratio"—the percentage of 4 and 5-star recruits on a roster. This metric, popularized by analyst Bud Elliott, has proven to be one of the most predictive indicators of championship-level success in college football.
The insight is simple but powerful: Since 2011, every team that has won the College Football Playoff National Championship has had a Blue Chip Ratio above 50%. That is, more than half of their scholarship players were rated as 4 or 5-star recruits coming out of high school. This threshold has held remarkably consistent across multiple championship teams and different eras.
The Blue Chip Ratio: A Championship Threshold
**Definition**: The percentage of a team's roster composed of players who were rated as 4-star or 5-star recruits. **The 50% Rule**: Every national champion since the BCS era has had a Blue Chip Ratio exceeding 50%. **Why it matters**: - Identifies championship-caliber rosters before the season starts - Separates "contenders" from "pretenders" based on talent accumulation - Highlights teams that must develop players exceptionally well to compete - Demonstrates the importance of multi-year recruiting success **Historical evidence** (recent champions): - 2022 Georgia: 65% Blue Chip Ratio - 2021 Georgia: 58% Blue Chip Ratio - 2020 Alabama: 71% Blue Chip Ratio - 2019 LSU: 58% Blue Chip Ratio - 2018 Clemson: 54% Blue Chip Ratio **Current threshold**: Only 10-15 teams nationally exceed the 50% threshold in any given year, creating a clear separation between championship contenders and the rest of FBS. **Limitations**: The Blue Chip Ratio is **necessary but not sufficient** for championship success. Having elite talent is required, but coaching, player development, scheme, and injury luck still matter enormously.Let's calculate the Blue Chip Ratio for selected teams:
#| label: blue-chip-r
#| eval: false
#| echo: true
# Calculate Blue Chip Ratio
# Teams with >50% blue chips have won every national title since 2010
calculate_blue_chip_ratio <- function(team, years) {
# Get recruiting data for past 4-5 years (typical roster composition)
recruiting_data <- map_df(years, ~cfbd_recruiting_player(
year = .x,
team = team,
classification = "HighSchool"
))
blue_chips <- recruiting_data %>%
filter(stars >= 4) %>%
nrow()
total_recruits <- nrow(recruiting_data)
ratio <- blue_chips / total_recruits
tibble(
team = team,
blue_chip_ratio = ratio,
blue_chips = blue_chips,
total_recruits = total_recruits
)
}
# Example for top teams
top_teams <- c("Georgia", "Alabama", "Ohio State", "Michigan")
blue_chip_analysis <- map_df(
top_teams,
~calculate_blue_chip_ratio(.x, 2020:2023)
)
blue_chip_analysis %>%
arrange(desc(blue_chip_ratio)) %>%
gt() %>%
fmt_percent(columns = blue_chip_ratio, decimals = 1)
#| label: blue-chip-py
#| eval: false
#| echo: true
def calculate_blue_chip_ratio(team, years):
"""
Calculate blue chip ratio for a team
Blue chips = 4 and 5 star recruits
"""
blue_chips = 0
total = 0
for year in years:
try:
players = recruiting_api.get_recruiting_players(
year=year,
team=team,
classification='HighSchool'
)
for p in players:
total += 1
if p.stars >= 4:
blue_chips += 1
except:
pass
ratio = blue_chips / total if total > 0 else 0
return {
'team': team,
'blue_chip_ratio': ratio,
'blue_chips': blue_chips,
'total_recruits': total
}
# Example
top_teams = ['Georgia', 'Alabama', 'Ohio State', 'Michigan']
blue_chip_analysis = [
calculate_blue_chip_ratio(team, range(2020, 2024))
for team in top_teams
]
bcr_df = pd.DataFrame(blue_chip_analysis)
print("\nBlue Chip Ratio Analysis:")
print(bcr_df.sort_values('blue_chip_ratio', ascending=False))
Transfer Portal Data
The transfer portal has transformed college football since its introduction in 2018, fundamentally changing how teams build and maintain rosters. Before the portal, transferring between schools required permission from your current school and often involved sitting out a year. Many coaches would "block" transfers to rival schools or conference opponents. The system favored coaches and institutions over players.
The transfer portal changed everything. Now, any player can enter the portal at any time (with peak activity during designated windows), and once in the portal, they can communicate with any school without restrictions. The one-time transfer exception (implemented in 2021) allows players to transfer once without sitting out a year. Combined with the COVID eligibility waiver and NIL (Name, Image, Likeness) rules, these changes created a free agency system in college football that didn't exist before.
The magnitude of this change is staggering: Over 1,000 FBS players enter the transfer portal annually. Some programs lose 20-30 players while gaining 15-20 transfers. Rosters turn over much faster than in the past. Coaches must now recruit their own roster constantly, not just high school players. A team might lose its starting quarterback to a better program, then replace him with a transfer from a lower-tier school—all in the same offseason.
For analysts, the transfer portal creates both challenges and opportunities:
- Roster prediction becomes harder: You can't assume players will stay for their full eligibility
- New talent evaluation needs: Transfer talent is different from high school recruiting
- Program health indicators: Transfer balance (in vs. out) reveals program momentum
- NIL complications: Wealthy programs can use NIL to attract elite transfers
- Coaching change impacts: New coaches trigger massive transfer activity
The transfer portal has transformed college football since its introduction in 2018, and tracking transfers is now essential for modern college football analytics. Let's explore how to access and analyze this data.
Understanding the Transfer Portal
The transfer portal operates on specific rules and timelines that analysts need to understand:
Key Facts:
- Introduced: October 2018 as a centralized database
- Entry process: Players can enter portal without losing eligibility
- Transfer windows: Primary windows in December and April/May (30-45 days)
- Immediate eligibility: One-time transfer exception (2021) allows immediate play
- Graduate transfers: Players with degrees can transfer and play immediately (pre-dates portal)
- Multiple transfers: Second+ transfers typically require sitting out unless granted waiver
- Portal != committed: Entering portal doesn't mean player has a destination yet
Transfer Portal Terminology
Understanding the portal requires knowing key terms: - **Entered portal**: Player has officially declared intent to transfer via the portal database - **Committed**: Player has chosen a destination school (can happen before or after entering portal) - **Processing**: When schools encourage players to enter portal to free up scholarships - **Portal window**: Designated periods when players can enter without penalty - **Transfer tampering**: Illegal contact with players before they enter portal (common but hard to prove) - **Portal recruitment**: Schools actively recruiting players in the portal - **Net transfer balance**: Incoming transfers minus outgoing transfersTransfer Portal Data Quality Issues
Transfer portal data has several quality challenges: 1. **Timing lags**: Players may enter portal before destination is known 2. **Uncommitted players**: Many enter portal without finding new schools 3. **Down-transfers**: FBS players transferring to FCS or lower aren't always tracked 4. **Rating inconsistencies**: Some services rate transfers, others don't 5. **Position changes**: Players may switch positions when transferring 6. **Academic issues**: Some portal entries are due to academic ineligibility 7. **Recruitment overlap**: Hard to separate high school recruiting from portal recruiting in team-building analysis Always verify portal data with multiple sources and be cautious about drawing conclusions from incomplete data.Accessing Transfer Data
#| label: transfers-r
#| message: false
#| warning: false
#| cache: true
# Get transfer portal data
transfers_2024 <- cfbd_recruiting_transfer_portal(
year = 2024
)
# Summary of transfer activity
transfer_summary <- transfers_2024 %>%
summarise(
total_transfers = n(),
avg_rating = mean(rating, na.rm = TRUE),
five_stars = sum(stars == 5, na.rm = TRUE),
four_stars = sum(stars == 4, na.rm = TRUE),
three_stars = sum(stars == 3, na.rm = TRUE),
with_destination = sum(!is.na(destination)),
still_uncommitted = sum(is.na(destination))
)
transfer_summary %>%
gt() %>%
tab_header(
title = "2024 Transfer Portal Summary"
) %>%
fmt_number(
columns = avg_rating,
decimals = 4
) %>%
cols_label(
total_transfers = "Total",
avg_rating = "Avg Rating",
five_stars = "5 Stars",
four_stars = "4 Stars",
three_stars = "3 Stars",
with_destination = "Committed",
still_uncommitted = "Uncommitted"
)
# Teams gaining most transfers
transfer_destinations <- transfers_2024 %>%
filter(!is.na(destination)) %>%
count(destination, sort = TRUE, name = "transfers_in") %>%
head(20)
transfer_destinations %>%
gt() %>%
tab_header(
title = "Top Transfer Destinations",
subtitle = "2024 Transfer Portal"
) %>%
cols_label(
destination = "School",
transfers_in = "Transfers In"
)
# Teams losing most players
transfer_origins <- transfers_2024 %>%
count(origin, sort = TRUE, name = "transfers_out") %>%
head(20)
# Net transfer balance
transfer_balance <- transfers_2024 %>%
filter(!is.na(destination)) %>%
count(destination, name = "in") %>%
full_join(
transfers_2024 %>% count(origin, name = "out"),
by = c("destination" = "origin")
) %>%
mutate(
in = replace_na(in, 0),
out = replace_na(out, 0),
net = in - out
) %>%
arrange(desc(net)) %>%
rename(team = destination)
transfer_balance %>%
head(15) %>%
gt() %>%
tab_header(
title = "Net Transfer Balance",
subtitle = "Incoming minus outgoing transfers (2024)"
) %>%
cols_label(
team = "Team",
in = "In",
out = "Out",
net = "Net"
) %>%
data_color(
columns = net,
colors = scales::col_numeric(
palette = c("red", "white", "green"),
domain = c(-20, 20)
)
)
#| label: transfers-py
#| message: false
#| warning: false
#| cache: true
try:
# Get transfer data
transfers = recruiting_api.search_recruiting_players(
year=2024,
classification='Transfer'
)
# Convert to DataFrame
transfers_df = pd.DataFrame([{
'name': t.name,
'position': t.position,
'origin': t.origin if hasattr(t, 'origin') else None,
'destination': t.committed_to,
'rating': t.rating,
'stars': t.stars
} for t in transfers])
# Summary
print("2024 Transfer Portal Summary:")
print(f"Total transfers: {len(transfers_df)}")
print(f"Average rating: {transfers_df['rating'].mean():.4f}")
print(f"5-star: {(transfers_df['stars'] == 5).sum()}")
print(f"4-star: {(transfers_df['stars'] == 4).sum()}")
print(f"3-star: {(transfers_df['stars'] == 3).sum()}")
print(f"With destination: {transfers_df['destination'].notna().sum()}")
print(f"Uncommitted: {transfers_df['destination'].isna().sum()}")
# Top destinations
print("\n\nTop Transfer Destinations:")
print(transfers_df['destination'].value_counts().head(20))
# Net transfer balance
transfers_in = transfers_df[transfers_df['destination'].notna()].groupby('destination').size()
transfers_out = transfers_df[transfers_df['origin'].notna()].groupby('origin').size()
transfer_balance = pd.DataFrame({
'in': transfers_in,
'out': transfers_out
}).fillna(0)
transfer_balance['net'] = transfer_balance['in'] - transfer_balance['out']
transfer_balance = transfer_balance.sort_values('net', ascending=False)
print("\n\nNet Transfer Balance (Top 15):")
print(transfer_balance.head(15))
except ApiException as e:
print(f"Exception: {e}")
Transfer Portal Analytics
Key metrics to track:
#| label: transfer-metrics-r
#| message: false
#| warning: false
#| cache: true
# Position breakdown
position_transfers <- transfers_2024 %>%
count(position, sort = TRUE) %>%
head(15)
position_transfers %>%
ggplot(aes(x = reorder(position, n), y = n)) +
geom_col(fill = "#D50032", alpha = 0.8) +
coord_flip() +
labs(
title = "Transfer Portal by Position",
subtitle = "2024 Season",
x = "Position",
y = "Number of Transfers",
caption = "Data: 247Sports via CFBD API"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.major.y = element_blank()
)
#| label: transfer-metrics-py
#| message: false
#| warning: false
#| cache: true
if len(transfers_df) > 0:
# Position breakdown
print("\n\nTransfers by Position (Top 15):")
position_counts = transfers_df['position'].value_counts().head(15)
print(position_counts)
# Visualize
plt.figure(figsize=(10, 6))
position_counts.plot(kind='barh', color='#D50032', alpha=0.8)
plt.xlabel('Number of Transfers', fontsize=12)
plt.ylabel('Position', fontsize=12)
plt.title('Transfer Portal by Position\n2024 Season',
fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
Transfer Portal Analytics Best Practices
1. **Track timing**: Early vs late transfer window matters 2. **Monitor position needs**: Which positions see most movement? 3. **Quality vs quantity**: Net transfers means less without quality metrics 4. **Multi-year patterns**: One-year snapshot can be misleading 5. **Coaching changes**: New coaches trigger transfer waves 6. **Portal + recruiting**: Combine for full picture of roster buildingCoaching Data
Coaching changes are frequent in college football and have major impacts on program performance:
#| label: coaching-r
#| message: false
#| warning: false
#| cache: true
# Get coaching records
coaches_2023 <- cfbd_coaches(year = 2023)
# Current head coaches with longest tenure
tenured_coaches <- coaches_2023 %>%
mutate(
tenure = 2023 - first_year + 1,
win_pct = wins / (wins + losses)
) %>%
arrange(desc(tenure)) %>%
select(first_name, last_name, school,
first_year, tenure, wins, losses, win_pct) %>%
head(20)
tenured_coaches %>%
gt() %>%
tab_header(
title = "Longest-Tenured Active FBS Coaches",
subtitle = "Through 2023 Season"
) %>%
cols_label(
first_name = "First",
last_name = "Last",
school = "School",
first_year = "Start",
tenure = "Years",
wins = "Wins",
losses = "Losses",
win_pct = "Win %"
) %>%
fmt_percent(
columns = win_pct,
decimals = 1
)
# Coaching changes analysis
coaches_2022 <- cfbd_coaches(year = 2022)
# Identify new coaches in 2023
new_coaches_2023 <- coaches_2023 %>%
filter(first_year == 2023)
cat("\n\nNew head coaches in 2023:", nrow(new_coaches_2023), "\n")
# Show new hires
new_coaches_2023 %>%
select(first_name, last_name, school) %>%
gt() %>%
tab_header(
title = "New Head Coaches in 2023"
)
#| label: coaching-py
#| message: false
#| warning: false
#| cache: true
# Get coaching data
coaches_api = cfbd.CoachesApi(cfbd.ApiClient(configuration))
try:
coaches = coaches_api.get_coaches(year=2023)
# Convert to DataFrame
coaches_df = pd.DataFrame([{
'name': f"{c.first_name} {c.last_name}",
'first_name': c.first_name,
'last_name': c.last_name,
'school': c.school,
'first_year': c.first_year,
'wins': c.wins if hasattr(c, 'wins') else 0,
'losses': c.losses if hasattr(c, 'losses') else 0
} for c in coaches])
# Calculate tenure and win percentage
coaches_df['tenure'] = 2023 - coaches_df['first_year'] + 1
coaches_df['win_pct'] = coaches_df['wins'] / (
coaches_df['wins'] + coaches_df['losses']
)
# Longest tenured
print("Longest-Tenured Active FBS Coaches (Through 2023):")
tenured = coaches_df.nlargest(20, 'tenure')[
['name', 'school', 'first_year', 'tenure',
'wins', 'losses', 'win_pct']
]
print(tenured.to_string(index=False))
# New coaches
new_coaches = coaches_df[coaches_df['first_year'] == 2023]
print(f"\n\nNew head coaches in 2023: {len(new_coaches)}")
if len(new_coaches) > 0:
print(new_coaches[['name', 'school']].to_string(index=False))
except ApiException as e:
print(f"Exception: {e}")
Coaching Change Impact
Coaching changes significantly affect:
- Recruiting (decommits, new targets)
- Transfer portal (players following coaches)
- Scheme changes (offensive/defensive philosophy)
- Staff continuity
- Short-term performance volatility
Building Integrated College Data Pipelines
Up to this point, we've explored individual data sources in isolation: play-by-play data, recruiting rankings, transfer portal, coaching records, and team metrics. While each provides valuable insights on its own, the real power of college football analytics emerges when you integrate these diverse data streams into a unified analytical pipeline.
Why integration matters:
- Holistic team evaluation: Combine on-field performance (PBP data) with roster quality (recruiting/talent) and roster changes (transfers)
- Context-aware metrics: Adjust efficiency stats for opponent quality and talent differentials
- Predictive modeling: Build models that incorporate both current performance and future talent influx
- Program health indicators: Track multi-year trends in recruiting, transfers, coaching, and performance
- Comparative analysis: Benchmark teams against peers with similar resources and talent levels
However, building integrated pipelines for college football presents unique challenges compared to NFL analytics:
- Multiple data sources: CFBD API alone has 40+ endpoints, each with different structures and update frequencies
- Temporal alignment: Recruiting data (by class year), season data (by year), and coaching data (by tenure) use different time scales
- Conference realignment: Teams change conferences, requiring dynamic joins rather than static lookups
- Data quality variance: Different endpoints have different coverage, requiring validation at each step
- Rate limiting: Free tier limits require careful request management across multiple data types
- Storage considerations: Full historical datasets can be large (multiple GB uncompressed)
Best practices for college football data pipelines:
- Cache aggressively: Store downloaded data locally to avoid redundant API calls
- Validate incrementally: Check data quality at each integration step, not just at the end
- Handle missing data gracefully: Use left joins and explicit NA handling
- Document conference changes: Track historical conference affiliations in lookup tables
- Version your data: Save timestamped snapshots to reproduce historical analyses
- Separate raw from processed: Keep original API responses separate from cleaned/integrated data
Let's build a comprehensive pipeline that integrates multiple college football data sources into a unified analytical database:
#| label: pipeline-r
#| eval: false
#| echo: true
# Complete College Football Data Pipeline
library(tidyverse)
library(cfbfastR)
build_college_football_database <- function(years, output_dir = "data/cfb") {
# Create output directory
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
# Initialize list to store data
cfb_data <- list()
for (year in years) {
cat("Processing", year, "season...\n")
# 1. Play-by-play data
cat(" Loading play-by-play data...\n")
pbp <- safely(cfbd_pbp_data)(year = year)
if (!is.null(pbp$result)) {
cfb_data[[paste0("pbp_", year)]] <- pbp$result
}
# 2. Team information
cat(" Loading team information...\n")
teams <- safely(cfbd_team_info)(year = year, only_fbs = TRUE)
if (!is.null(teams$result)) {
cfb_data[[paste0("teams_", year)]] <- teams$result
}
# 3. Game information
cat(" Loading game data...\n")
games <- safely(cfbd_game_info)(year = year)
if (!is.null(games$result)) {
cfb_data[[paste0("games_", year)]] <- games$result
}
# 4. Team statistics
cat(" Loading team statistics...\n")
team_stats <- safely(cfbd_stats_season_team)(year = year)
if (!is.null(team_stats$result)) {
cfb_data[[paste0("team_stats_", year)]] <- team_stats$result
}
# 5. Recruiting data
cat(" Loading recruiting data...\n")
recruiting <- safely(cfbd_recruiting_team)(year = year)
if (!is.null(recruiting$result)) {
cfb_data[[paste0("recruiting_", year)]] <- recruiting$result
}
# 6. Talent composites
cat(" Loading talent data...\n")
talent <- safely(cfbd_team_talent)(year = year)
if (!is.null(talent$result)) {
cfb_data[[paste0("talent_", year)]] <- talent$result
}
# 7. Transfer portal (2021 onwards)
if (year >= 2021) {
cat(" Loading transfer portal data...\n")
transfers <- safely(cfbd_recruiting_transfer_portal)(year = year)
if (!is.null(transfers$result)) {
cfb_data[[paste0("transfers_", year)]] <- transfers$result
}
}
# 8. Rankings
cat(" Loading rankings...\n")
rankings <- safely(cfbd_rankings)(year = year)
if (!is.null(rankings$result)) {
cfb_data[[paste0("rankings_", year)]] <- rankings$result
}
# 9. Betting lines
cat(" Loading betting lines...\n")
lines <- safely(cfbd_betting_lines)(year = year)
if (!is.null(lines$result)) {
cfb_data[[paste0("lines_", year)]] <- lines$result
}
# 10. Coaches
cat(" Loading coaching data...\n")
coaches <- safely(cfbd_coaches)(year = year)
if (!is.null(coaches$result)) {
cfb_data[[paste0("coaches_", year)]] <- coaches$result
}
# Save individual year
saveRDS(
cfb_data[grepl(year, names(cfb_data))],
file.path(output_dir, paste0("cfb_data_", year, ".rds"))
)
cat(" Year", year, "complete.\n\n")
# Rate limiting
Sys.sleep(1)
}
# Save combined dataset
saveRDS(cfb_data, file.path(output_dir, "cfb_data_complete.rds"))
cat("Pipeline complete! Data saved to:", output_dir, "\n")
return(cfb_data)
}
# Build database for recent years
cfb_db <- build_college_football_database(
years = 2019:2023,
output_dir = "data/college_football"
)
# Create integrated analysis dataset
create_integrated_dataset <- function(cfb_data, year) {
# Extract year-specific data
teams <- cfb_data[[paste0("teams_", year)]]
games <- cfb_data[[paste0("games_", year)]]
team_stats <- cfb_data[[paste0("team_stats_", year)]]
recruiting <- cfb_data[[paste0("recruiting_", year)]]
talent <- cfb_data[[paste0("talent_", year)]]
# Combine game results
team_records <- games %>%
mutate(
home_win = home_points > away_points,
away_win = away_points > home_points
) %>%
select(home_team, away_team, home_win, away_win,
home_points, away_points) %>%
pivot_longer(
cols = c(home_team, away_team),
names_to = "location",
values_to = "team"
) %>%
mutate(
win = if_else(location == "home_team", home_win, away_win),
points_for = if_else(location == "home_team", home_points, away_points),
points_against = if_else(location == "home_team", away_points, home_points)
) %>%
group_by(team) %>%
summarise(
games = n(),
wins = sum(win, na.rm = TRUE),
losses = games - wins,
total_points = sum(points_for, na.rm = TRUE),
total_points_allowed = sum(points_against, na.rm = TRUE),
avg_points = total_points / games,
avg_points_allowed = total_points_allowed / games,
.groups = "drop"
)
# Combine into integrated dataset
integrated <- teams %>%
left_join(
team_records,
by = c("school" = "team")
) %>%
left_join(
recruiting %>%
select(team, recruiting_rank = rank, recruiting_points = points),
by = c("school" = "team")
) %>%
left_join(
talent %>% select(school, talent),
by = "school"
)
return(integrated)
}
# Example: Create 2023 integrated dataset
integrated_2023 <- create_integrated_dataset(cfb_db, 2023)
# Save integrated dataset
write_csv(integrated_2023, "data/college_football/integrated_2023.csv")
#| label: pipeline-py
#| eval: false
#| echo: true
import pandas as pd
import cfbd
import pickle
import os
from pathlib import Path
import time
class CollegeFootballPipeline:
"""
Comprehensive college football data pipeline
"""
def __init__(self, api_key):
"""Initialize API configuration"""
self.configuration = cfbd.Configuration()
self.configuration.api_key['Authorization'] = api_key
self.configuration.api_key_prefix['Authorization'] = 'Bearer'
# Initialize API instances
self.plays_api = cfbd.PlaysApi(cfbd.ApiClient(self.configuration))
self.teams_api = cfbd.TeamsApi(cfbd.ApiClient(self.configuration))
self.games_api = cfbd.GamesApi(cfbd.ApiClient(self.configuration))
self.recruiting_api = cfbd.RecruitingApi(cfbd.ApiClient(self.configuration))
self.rankings_api = cfbd.RankingsApi(cfbd.ApiClient(self.configuration))
self.coaches_api = cfbd.CoachesApi(cfbd.ApiClient(self.configuration))
self.betting_api = cfbd.BettingApi(cfbd.ApiClient(self.configuration))
def load_season_data(self, year):
"""Load all data for a single season"""
season_data = {}
print(f"Loading data for {year} season...")
try:
# 1. Teams
print(" Loading teams...")
teams = self.teams_api.get_fbs_teams(year=year)
season_data['teams'] = pd.DataFrame([t.to_dict() for t in teams])
time.sleep(0.5)
# 2. Games
print(" Loading games...")
games = self.games_api.get_games(year=year)
season_data['games'] = pd.DataFrame([g.to_dict() for g in games])
time.sleep(0.5)
# 3. Play-by-play
print(" Loading play-by-play...")
plays = self.plays_api.get_plays(year=year, season_type='regular')
season_data['plays'] = pd.DataFrame([p.to_dict() for p in plays])
time.sleep(0.5)
# 4. Recruiting
print(" Loading recruiting...")
recruiting = self.recruiting_api.get_recruiting_teams(year=year)
season_data['recruiting'] = pd.DataFrame([r.to_dict() for r in recruiting])
time.sleep(0.5)
# 5. Talent
print(" Loading talent...")
talent = self.teams_api.get_talent(year=year)
season_data['talent'] = pd.DataFrame([t.to_dict() for t in talent])
time.sleep(0.5)
# 6. Rankings
print(" Loading rankings...")
rankings = self.rankings_api.get_rankings(year=year)
if rankings:
all_ranks = []
for poll in rankings:
for rank in poll.ranks:
rank_dict = rank.to_dict()
rank_dict['poll'] = poll.poll
rank_dict['week'] = poll.week
all_ranks.append(rank_dict)
season_data['rankings'] = pd.DataFrame(all_ranks)
time.sleep(0.5)
# 7. Coaches
print(" Loading coaches...")
coaches = self.coaches_api.get_coaches(year=year)
season_data['coaches'] = pd.DataFrame([c.to_dict() for c in coaches])
time.sleep(0.5)
# 8. Transfer portal (2021+)
if year >= 2021:
print(" Loading transfer portal...")
try:
transfers = self.recruiting_api.search_recruiting_players(
year=year,
classification='Transfer'
)
season_data['transfers'] = pd.DataFrame([t.to_dict() for t in transfers])
time.sleep(0.5)
except:
print(" Transfer data not available")
print(f" {year} season complete!\n")
except Exception as e:
print(f"Error loading {year}: {e}")
return season_data
def build_database(self, years, output_dir='data/cfb'):
"""Build complete database for multiple years"""
# Create output directory
Path(output_dir).mkdir(parents=True, exist_ok=True)
all_data = {}
for year in years:
season_data = self.load_season_data(year)
all_data[year] = season_data
# Save individual year
with open(f"{output_dir}/cfb_data_{year}.pkl", 'wb') as f:
pickle.dump(season_data, f)
# Save complete database
with open(f"{output_dir}/cfb_data_complete.pkl", 'wb') as f:
pickle.dump(all_data, f)
print(f"Pipeline complete! Data saved to: {output_dir}")
return all_data
def create_integrated_dataset(self, season_data):
"""Create integrated dataset from season data"""
# Start with teams
integrated = season_data['teams'].copy()
# Add recruiting
if 'recruiting' in season_data:
integrated = integrated.merge(
season_data['recruiting'][['team', 'rank', 'points']],
left_on='school',
right_on='team',
how='left',
suffixes=('', '_recruiting')
)
integrated.rename(columns={
'rank': 'recruiting_rank',
'points': 'recruiting_points'
}, inplace=True)
# Add talent
if 'talent' in season_data:
integrated = integrated.merge(
season_data['talent'][['school', 'talent']],
on='school',
how='left'
)
# Add win-loss records from games
if 'games' in season_data:
games = season_data['games']
# Calculate records
home_games = games[['home_team', 'home_points', 'away_points']].copy()
home_games['win'] = home_games['home_points'] > home_games['away_points']
home_games.rename(columns={'home_team': 'team'}, inplace=True)
away_games = games[['away_team', 'home_points', 'away_points']].copy()
away_games['win'] = away_games['away_points'] > away_games['home_points']
away_games.rename(columns={'away_team': 'team'}, inplace=True)
all_games = pd.concat([home_games, away_games])
records = all_games.groupby('team').agg(
games=('win', 'count'),
wins=('win', 'sum')
).reset_index()
records['losses'] = records['games'] - records['wins']
integrated = integrated.merge(
records,
left_on='school',
right_on='team',
how='left',
suffixes=('', '_record')
)
return integrated
# Example usage
if __name__ == "__main__":
# Initialize pipeline
api_key = os.environ.get('CFBD_API_KEY')
pipeline = CollegeFootballPipeline(api_key)
# Build database
cfb_db = pipeline.build_database(
years=range(2019, 2024),
output_dir='data/college_football'
)
# Create integrated dataset for 2023
integrated_2023 = pipeline.create_integrated_dataset(cfb_db[2023])
print("\nIntegrated 2023 Dataset:")
print(integrated_2023.head())
# Save to CSV
integrated_2023.to_csv('data/college_football/integrated_2023.csv', index=False)
Conference Strength Metrics and Tempo-Free Statistics
One of college football's most persistent analytical challenges is quantifying conference strength and adjusting team statistics for pace of play. Unlike the NFL, where schedule imbalances are relatively minor, college football's conference-based structure creates dramatic differences in strength of schedule that can make raw statistics misleading.
Conference Strength: The Fundamental Challenge
Consider two teams with identical 8-4 records:
- Team A: Plays in the SEC, facing Alabama, Georgia, LSU, and Florida
- Team B: Plays in the MAC, facing Toledo, Miami (OH), Bowling Green, and Northern Illinois
Are these teams equally good? Obviously not. Team A faced far superior competition. But how do we quantify this difference?
Conference strength metrics attempt to solve this problem by aggregating team-level performance indicators across entire conferences. Common approaches include:
- Average team rating: Mean SP+ or FPI for all conference teams
- Top-to-bottom depth: Standard deviation of team ratings (lower = more balanced)
- Non-conference performance: How conference teams perform against outside opponents
- Computer ranking aggregates: Sagarin, Massey, and other computer ratings
- Historical performance: Bowl records, playoff appearances, NFL draft picks
Why Conference Strength Matters for Analytics
Conference strength affects nearly every analytical task in college football: **1. Team Evaluation** - A 9-3 Big Ten team may be better than an 11-1 Mountain West team - Raw win totals don't reflect quality of competition - Rankings and playoff selection depend on strength of schedule **2. Statistical Adjustment** - A team averaging 35 points per game against weak competition isn't as impressive as 30 PPG against elite defenses - Need opponent-adjusted efficiency metrics **3. Recruiting Territory** - Conference realignment affects recruiting pipelines - SEC and Big Ten footprints overlap with talent-rich states **4. Predictive Modeling** - Models must account for conference strength when predicting outcomes - Cross-conference games require careful calibration **5. Player Evaluation** - A quarterback's stats against Group of 5 defenses aren't predictive of success against SEC defenses - NFL scouts heavily discount statistics from weak conferencesTempo-Free Statistics
Another critical adjustment for college football analytics is accounting for pace of play or tempo. Some teams play extremely fast (80+ plays per game), while others play deliberately (55-60 plays per game). This creates a major confound when comparing traditional counting statistics.
Example: Team A averages 450 yards per game over 80 plays, while Team B averages 400 yards per game over 60 plays. Which offense is better?
- Team A: 450 yards / 80 plays = 5.63 yards per play
- Team B: 400 yards / 60 plays = 6.67 yards per play
Team B is actually more efficient despite lower total yardage. This is why per-play metrics (yards per play, points per play, EPA per play) are essential in college football analytics.
Tempo-Free Statistics Explained
**Tempo-free statistics** adjust for pace of play to enable fair comparisons between teams with different playing styles. **Key tempo-free metrics**: 1. **Yards per play** (YPP): Total yards / total plays 2. **Points per play** (PPP): Total points / total plays 3. **EPA per play**: Total EPA / total plays 4. **Success rate**: Percentage of successful plays (not tempo-dependent) 5. **Plays per game**: Direct measure of pace 6. **Seconds per play**: Average time between plays 7. **Possessions per game**: Number of offensive possessions **Why tempo matters**: - **Fast tempo teams** (Oregon, Auburn historically) play 75-85+ plays per game - **Slow tempo teams** (Wisconsin, Army, Navy) play 55-65 plays per game - A 30% difference in plays means 15-20 more opportunities to score - Total yardage and points are misleading without tempo adjustment **Example application**: - Team A: 35 points on 75 plays = 0.47 points per play - Team B: 28 points on 60 plays = 0.47 points per play Despite Team A scoring 7 more points, both teams have identical efficiency. Team A just had more opportunities. **Best practice**: Always use per-play metrics when comparing teams, and include pace of play (plays per game) as a separate variable in your analysis.Implementing Conference-Adjusted and Tempo-Free Metrics
#| label: conference-tempo-adjusted-r
#| eval: false
#| echo: true
#| message: false
#| warning: false
# Calculate conference-adjusted and tempo-free metrics
library(tidyverse)
library(cfbfastR)
# Load play-by-play and team data
pbp_2023 <- cfbd_pbp_data(year = 2023)
teams_2023 <- cfbd_team_info(year = 2023, only_fbs = TRUE)
# Calculate team-level tempo-free offensive metrics
team_offense <- pbp_2023 %>%
filter(
play_type %in% c("pass", "run"), # Regular offensive plays only
!is.na(EPA) # Valid EPA values only
) %>%
group_by(pos_team) %>%
summarise(
# Counting stats (tempo-dependent)
total_plays = n(),
total_yards = sum(yards_gained, na.rm = TRUE),
total_epa = sum(EPA, na.rm = TRUE),
# Tempo-free stats (per-play efficiency)
yards_per_play = mean(yards_gained, na.rm = TRUE),
epa_per_play = mean(EPA, na.rm = TRUE),
success_rate = mean(EPA > 0, na.rm = TRUE),
# Tempo metric
plays_per_game = n() / n_distinct(game_id),
.groups = "drop"
) %>%
rename(team = pos_team)
# Join with conference information
team_metrics <- team_offense %>%
left_join(
teams_2023 %>% select(school, conference),
by = c("team" = "school")
)
# Calculate conference strength (average EPA per play)
conference_strength <- team_metrics %>%
group_by(conference) %>%
summarise(
teams = n(),
avg_epa_per_play = mean(epa_per_play, na.rm = TRUE),
median_epa_per_play = median(epa_per_play, na.rm = TRUE),
sd_epa = sd(epa_per_play, na.rm = TRUE),
avg_tempo = mean(plays_per_game, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(avg_epa_per_play))
# Display conference strength rankings
conference_strength %>%
gt() %>%
tab_header(
title = "Conference Offensive Strength Rankings",
subtitle = "Tempo-free efficiency metrics, 2023"
) %>%
fmt_number(
columns = c(avg_epa_per_play, median_epa_per_play, sd_epa, avg_tempo),
decimals = 3
) %>%
cols_label(
conference = "Conference",
teams = "Teams",
avg_epa_per_play = "Avg EPA/Play",
median_epa_per_play = "Median EPA/Play",
sd_epa = "SD EPA",
avg_tempo = "Avg Plays/Game"
)
# Calculate opponent-adjusted EPA for each team
team_adjusted <- team_metrics %>%
left_join(
conference_strength %>% select(conference, conf_avg_epa = avg_epa_per_play),
by = "conference"
) %>%
mutate(
# Adjusted EPA = actual EPA minus conference average
# Positive = performing above conference average
epa_above_conference = epa_per_play - conf_avg_epa,
# Classify tempo
tempo_category = case_when(
plays_per_game >= 75 ~ "Fast",
plays_per_game >= 65 ~ "Average",
TRUE ~ "Slow"
)
)
# Show top teams by adjusted EPA
team_adjusted %>%
arrange(desc(epa_above_conference)) %>%
select(team, conference, epa_per_play, epa_above_conference,
plays_per_game, tempo_category) %>%
head(20) %>%
gt() %>%
tab_header(
title = "Top Teams by Conference-Adjusted Efficiency",
subtitle = "EPA per play relative to conference average"
) %>%
fmt_number(
columns = c(epa_per_play, epa_above_conference, plays_per_game),
decimals = 3
)
#| label: conference-tempo-adjusted-py
#| eval: false
#| echo: true
#| message: false
#| warning: false
import pandas as pd
import numpy as np
import cfbd
# Initialize API (assuming configuration already set up)
plays_api = cfbd.PlaysApi(cfbd.ApiClient(configuration))
teams_api = cfbd.TeamsApi(cfbd.ApiClient(configuration))
# Load play-by-play data
plays = plays_api.get_plays(year=2023, season_type='regular')
pbp_2023 = pd.DataFrame([p.to_dict() for p in plays])
# Load team data for conference information
teams = teams_api.get_fbs_teams(year=2023)
teams_df = pd.DataFrame([{
'school': t.school,
'conference': t.conference
} for t in teams])
# Calculate team-level tempo-free offensive metrics
# Filter to regular offensive plays with valid EPA
offensive_plays = pbp_2023[
(pbp_2023['play_type'].isin(['pass', 'run'])) &
(pbp_2023['ppa'].notna())
].copy()
team_offense = offensive_plays.groupby('offense').agg(
# Counting stats (tempo-dependent)
total_plays=('id', 'count'),
total_yards=('yards_gained', 'sum'),
total_epa=('ppa', 'sum'),
games=('game_id', 'nunique'),
# Tempo-free stats (per-play efficiency)
yards_per_play=('yards_gained', 'mean'),
epa_per_play=('ppa', 'mean'),
success_rate=('ppa', lambda x: (x > 0).mean())
).reset_index()
# Calculate plays per game (tempo)
team_offense['plays_per_game'] = team_offense['total_plays'] / team_offense['games']
# Join with conference information
team_metrics = team_offense.merge(
teams_df,
left_on='offense',
right_on='school',
how='left'
)
# Calculate conference strength
conference_strength = (team_metrics
.groupby('conference')
.agg(
teams=('school', 'count'),
avg_epa_per_play=('epa_per_play', 'mean'),
median_epa_per_play=('epa_per_play', 'median'),
sd_epa=('epa_per_play', 'std'),
avg_tempo=('plays_per_game', 'mean')
)
.sort_values('avg_epa_per_play', ascending=False)
.reset_index()
)
print("Conference Offensive Strength Rankings (2023):")
print(conference_strength.to_string(index=False))
# Calculate opponent-adjusted EPA
team_adjusted = team_metrics.merge(
conference_strength[['conference', 'avg_epa_per_play']],
on='conference',
how='left',
suffixes=('', '_conf')
)
# Calculate EPA above conference average
team_adjusted['epa_above_conference'] = (
team_adjusted['epa_per_play'] - team_adjusted['avg_epa_per_play_conf']
)
# Classify tempo
team_adjusted['tempo_category'] = pd.cut(
team_adjusted['plays_per_game'],
bins=[0, 65, 75, 100],
labels=['Slow', 'Average', 'Fast']
)
# Show top teams
top_teams = (team_adjusted
.sort_values('epa_above_conference', ascending=False)
[['school', 'conference', 'epa_per_play', 'epa_above_conference',
'plays_per_game', 'tempo_category']]
.head(20)
)
print("\n\nTop Teams by Conference-Adjusted Efficiency:")
print(top_teams.to_string(index=False))
This analysis reveals which teams are truly elite after accounting for conference strength and tempo. A team that dominates a weak conference might not look as impressive when compared to conference peers, while a team with a moderate record in a strong conference might actually be very efficient.
Data Quality and Validation
With an understanding of the complexities of conference strength and tempo-adjusted metrics, we now turn to ensuring the data underlying these analyses is accurate and complete. College football data requires careful quality checks:
#| label: data-quality-r
#| eval: false
#| echo: true
# Data quality validation function
validate_cfb_data <- function(pbp_data, year) {
validation_results <- list()
# Check 1: Missing values
missing_check <- pbp_data %>%
summarise(
total_plays = n(),
missing_epa = sum(is.na(EPA)),
missing_team = sum(is.na(pos_team)),
missing_down = sum(is.na(down)),
missing_distance = sum(is.na(distance))
)
validation_results$missing <- missing_check
# Check 2: Value ranges
range_check <- pbp_data %>%
summarise(
invalid_down = sum(down < 1 | down > 4, na.rm = TRUE),
invalid_distance = sum(distance < 0 | distance > 99, na.rm = TRUE),
invalid_yards_to_goal = sum(yards_to_goal < 0 | yards_to_goal > 100, na.rm = TRUE)
)
validation_results$ranges <- range_check
# Check 3: Expected number of games
game_count <- pbp_data %>%
summarise(
unique_games = n_distinct(game_id),
expected_min = 130 * 12, # 130 FBS teams, 12 games minimum
expected_max = 130 * 15 # Up to 15 with bowls
)
validation_results$games <- game_count
# Check 4: Conference coverage
conference_coverage <- pbp_data %>%
left_join(
cfbd_team_info(year = year) %>% select(school, conference),
by = c("pos_team" = "school")
) %>%
group_by(conference) %>%
summarise(
teams = n_distinct(pos_team),
plays = n()
) %>%
arrange(desc(plays))
validation_results$conferences <- conference_coverage
return(validation_results)
}
# Run validation
validation <- validate_cfb_data(pbp_2023, 2023)
# Print results
cat("Data Quality Report\n")
cat("===================\n\n")
cat("Missing Values:\n")
print(validation$missing)
cat("\n\nRange Violations:\n")
print(validation$ranges)
cat("\n\nGame Count:\n")
print(validation$games)
cat("\n\nConference Coverage:\n")
print(validation$conferences)
#| label: data-quality-py
#| eval: false
#| echo: true
def validate_cfb_data(pbp_data, year):
"""
Validate college football data quality
"""
validation_results = {}
# Check 1: Missing values
missing_check = {
'total_plays': len(pbp_data),
'missing_offense': pbp_data['offense'].isna().sum(),
'missing_down': pbp_data['down'].isna().sum(),
'missing_distance': pbp_data['distance'].isna().sum()
}
validation_results['missing'] = missing_check
# Check 2: Value ranges
range_check = {
'invalid_down': ((pbp_data['down'] < 1) | (pbp_data['down'] > 4)).sum(),
'invalid_distance': ((pbp_data['distance'] < 0) | (pbp_data['distance'] > 99)).sum(),
'invalid_yards_to_goal': ((pbp_data['yards_to_goal'] < 0) | (pbp_data['yards_to_goal'] > 100)).sum()
}
validation_results['ranges'] = range_check
# Check 3: Expected number of games
game_count = {
'unique_games': pbp_data['game_id'].nunique(),
'expected_min': 130 * 12,
'expected_max': 130 * 15
}
validation_results['games'] = game_count
# Check 4: Play type distribution
play_types = pbp_data['play_type'].value_counts()
validation_results['play_types'] = play_types
return validation_results
# Run validation
validation = validate_cfb_data(pbp_2023, 2023)
# Print results
print("Data Quality Report")
print("===================\n")
print("Missing Values:")
for k, v in validation['missing'].items():
print(f" {k}: {v}")
print("\n\nRange Violations:")
for k, v in validation['ranges'].items():
print(f" {k}: {v}")
print("\n\nGame Count:")
for k, v in validation['games'].items():
print(f" {k}: {v}")
print("\n\nPlay Type Distribution:")
print(validation['play_types'].head(10))
Visualizing College Football Data
College football's unique characteristics require specialized visualizations:
Conference Strength Comparison
#| label: fig-conference-strength-r
#| fig-cap: "Conference strength by average team talent (2023)"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
#| eval: false
# Conference strength visualization
conference_talent <- talent %>%
left_join(
fbs_teams %>% select(school, conference),
by = "school"
) %>%
filter(!is.na(conference)) %>%
group_by(conference) %>%
summarise(
teams = n(),
avg_talent = mean(talent, na.rm = TRUE),
median_talent = median(talent, na.rm = TRUE),
total_talent = sum(talent, na.rm = TRUE),
.groups = "drop"
)
conference_talent %>%
ggplot(aes(x = reorder(conference, avg_talent), y = avg_talent)) +
geom_col(aes(fill = avg_talent), show.legend = FALSE, alpha = 0.8) +
geom_text(
aes(label = round(avg_talent, 1)),
hjust = -0.2,
size = 3.5
) +
scale_fill_viridis_c(option = "plasma") +
coord_flip() +
labs(
title = "FBS Conference Strength Rankings",
subtitle = "Average team talent composite, 2023",
x = NULL,
y = "Average Talent Rating",
caption = "Data: 247Sports Talent Composite via CFBD API"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
axis.text.y = element_text(size = 11),
panel.grid.major.y = element_blank()
)
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-conference-strength-py
#| fig-cap: "Conference strength visualization"
#| fig-width: 12
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
#| eval: false
# Conference strength visualization
if 'talent_df' in locals() and 'teams_df' in locals():
conf_talent = talent_df.merge(
teams_df[['school', 'conference']],
on='school'
)
conf_avg = (conf_talent
.groupby('conference')
.agg(
teams=('school', 'count'),
avg_talent=('talent', 'mean'),
median_talent=('talent', 'median'),
total_talent=('talent', 'sum')
)
.sort_values('avg_talent')
)
plt.figure(figsize=(12, 8))
colors = plt.cm.plasma(np.linspace(0, 1, len(conf_avg)))
plt.barh(range(len(conf_avg)), conf_avg['avg_talent'],
color=colors, alpha=0.8)
plt.yticks(range(len(conf_avg)), conf_avg.index)
# Add value labels
for i, v in enumerate(conf_avg['avg_talent']):
plt.text(v, i, f' {v:.1f}', va='center')
plt.xlabel('Average Talent Rating', fontsize=12)
plt.title('FBS Conference Strength Rankings\nAverage Team Talent Composite, 2023',
fontsize=16, fontweight='bold', pad=20)
plt.text(0.99, 0.01, 'Data: 247Sports via CFBD API',
transform=plt.gca().transAxes,
ha='right', fontsize=8, style='italic')
plt.tight_layout()
plt.show()
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Recruiting vs Performance
#| label: fig-recruiting-performance-r
#| fig-cap: "Relationship between recruiting and wins"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
#| eval: false
# Analyze recruiting vs performance
recruiting_performance <- integrated_2023 %>%
filter(!is.na(recruiting_rank), !is.na(wins)) %>%
mutate(
win_pct = wins / (wins + losses),
recruiting_tier = case_when(
recruiting_rank <= 10 ~ "Top 10",
recruiting_rank <= 25 ~ "Top 25",
recruiting_rank <= 50 ~ "Top 50",
TRUE ~ "Below 50"
)
)
recruiting_performance %>%
ggplot(aes(x = recruiting_rank, y = win_pct)) +
geom_point(aes(color = recruiting_tier), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
scale_x_reverse() +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(
values = c("Top 10" = "#D4AF37", "Top 25" = "#C0C0C0",
"Top 50" = "#CD7F32", "Below 50" = "gray60")
) +
labs(
title = "Recruiting Class Rank vs Winning Percentage",
subtitle = "2023 Season - FBS Teams",
x = "Recruiting Class Rank (lower = better)",
y = "Winning Percentage",
color = "Recruiting Tier",
caption = "Data: CFBD API"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
)
# Calculate correlation
cor_result <- cor.test(
recruiting_performance$recruiting_rank,
recruiting_performance$win_pct,
method = "spearman"
)
cat("\nSpearman correlation between recruiting rank and win %:",
round(cor_result$estimate, 3), "\n")
cat("p-value:", format.pval(cor_result$p.value), "\n")
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
#| label: fig-recruiting-performance-py
#| fig-cap: "Recruiting vs performance correlation"
#| fig-width: 10
#| fig-height: 8
#| message: false
#| warning: false
#| cache: true
#| eval: false
from scipy import stats
# Analyze recruiting vs performance
if 'integrated_2023' in locals():
recruit_perf = integrated_2023[
integrated_2023['recruiting_rank'].notna() &
integrated_2023['wins'].notna()
].copy()
recruit_perf['win_pct'] = recruit_perf['wins'] / (
recruit_perf['wins'] + recruit_perf['losses']
)
# Create recruiting tiers
recruit_perf['recruiting_tier'] = pd.cut(
recruit_perf['recruiting_rank'],
bins=[0, 10, 25, 50, 200],
labels=['Top 10', 'Top 25', 'Top 50', 'Below 50']
)
# Create plot
plt.figure(figsize=(10, 8))
colors = {'Top 10': '#D4AF37', 'Top 25': '#C0C0C0',
'Top 50': '#CD7F32', 'Below 50': 'gray'}
for tier in ['Top 10', 'Top 25', 'Top 50', 'Below 50']:
mask = recruit_perf['recruiting_tier'] == tier
plt.scatter(
recruit_perf[mask]['recruiting_rank'],
recruit_perf[mask]['win_pct'],
c=colors[tier],
label=tier,
s=100,
alpha=0.6
)
# Add regression line
z = np.polyfit(recruit_perf['recruiting_rank'],
recruit_perf['win_pct'], 1)
p = np.poly1d(z)
plt.plot(recruit_perf['recruiting_rank'].sort_values(),
p(recruit_perf['recruiting_rank'].sort_values()),
"k--", alpha=0.8)
plt.gca().invert_xaxis()
plt.xlabel('Recruiting Class Rank (lower = better)', fontsize=12)
plt.ylabel('Winning Percentage', fontsize=12)
plt.title('Recruiting Class Rank vs Winning Percentage\n2023 Season - FBS Teams',
fontsize=14, fontweight='bold', pad=20)
plt.legend(title='Recruiting Tier', loc='best')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate correlation
correlation = stats.spearmanr(
recruit_perf['recruiting_rank'],
recruit_perf['win_pct']
)
print(f"\nSpearman correlation: {correlation.correlation:.3f}")
print(f"p-value: {correlation.pvalue:.4e}")
📊 Visualization Output
The code above generates a visualization. To see the output, run this code in your R or Python environment. The resulting plot will help illustrate the concepts discussed in this section.
Summary
College football analytics requires navigating a complex data landscape with unique challenges that don't exist in professional football. While NFL analytics can focus primarily on play-by-play data and relatively stable rosters, college football demands a multi-dimensional approach that integrates on-field performance, recruiting, transfers, coaching changes, and conference dynamics.
Throughout this chapter, we've built a comprehensive foundation for college football data analysis:
Data Infrastructure:
- cfbfastR ecosystem: The primary R package for college football data, providing access to 40+ API endpoints through familiar tidyverse-compatible functions
- CFBD API: The centralized source for play-by-play data (2013-present), recruiting rankings, transfer portal tracking, and team metrics
- Python alternatives: The cfbd package provides similar functionality for Python users, with parallel code examples throughout the chapter
Unique Challenges:
- Division structure: FBS vs FCS competitive separation, with talent gaps that require opponent-adjusted metrics rather than raw statistics
- Conference realignment: Teams changing conferences create temporal challenges for historical analysis and require dynamic rather than static data joins
- Data quality variance: Coverage differs dramatically by conference, era, and game type, requiring extensive validation
- Multi-source integration: Effective analysis requires combining play-by-play data, recruiting rankings, transfer information, and coaching records
College-Specific Analytics:
- Recruiting data: The Blue Chip Ratio (50% threshold for championship contention), class rankings, and multi-year talent accumulation patterns
- Transfer portal: Net transfer balance, position-specific movement, and the impact of coaching changes on roster turnover
- Coaching data: Tenure analysis, hire/fire cycles, and the relationship between coaching stability and program success
- Conference strength metrics: Opponent-adjusted efficiency measures that account for schedule quality
- Tempo-free statistics: Per-play metrics that enable fair comparisons between teams with different paces of play
Practical Implementation:
- Integrated pipelines: Building comprehensive databases that combine multiple data streams with proper error handling and rate limiting
- Data validation: Implementing quality checks specific to college football's data quirks
- Visualization approaches: Presenting complex multi-dimensional data (talent, performance, recruiting) in coherent visual formats
Key Analytical Insights:
- Recruiting predicts success: The Blue Chip Ratio has perfectly predicted championship-level programs, though it's necessary but not sufficient
- Conference matters enormously: Raw statistics without opponent adjustment are misleading in college football's stratified competitive structure
- Tempo must be adjusted: Teams playing 75 vs. 60 plays per game can't be compared on counting stats alone
- Transfers have changed everything: The portal has created NFL-style free agency, making roster management more dynamic
- Data quality requires vigilance: Unlike NFL data from a single source, college football data has gaps, inconsistencies, and conference-specific coverage issues
Looking Ahead:
This chapter provided the data infrastructure foundation. In Chapter 37, we'll apply these data sources to build college-specific models for team evaluation, game prediction, and recruiting analysis. Chapter 38 will explore advanced topics like strength of schedule adjustment, playoff selection analytics, and the intersection of NIL with traditional recruiting metrics.
The college game's unique characteristics—recruiting, transfers, conference realignment, and competitive imbalance—create both challenges and opportunities for analytics. Understanding these nuances and building robust data infrastructure is essential for effective college football analysis. With the tools and concepts from this chapter, you're equipped to navigate the complex landscape of college football data and extract meaningful insights that account for the sport's distinctive structure.
Key Takeaways
1. **Data quality varies**: Always validate and check coverage 2. **Context matters**: Account for strength of schedule and opponent quality 3. **Recruiting is fundamental**: Talent acquisition drives long-term success 4. **Change is constant**: Conference realignment and coaching changes require flexible data structures 5. **Multiple data sources**: Integrate play-by-play, recruiting, transfers, and coaching data for complete picture 6. **Historical limitations**: Pre-2013 data is sparse for many teamsExercises
Conceptual Questions
-
Competition Levels: Explain why directly comparing statistics between FBS and FCS teams can be misleading. What adjustments would you make to enable fair comparisons?
-
Recruiting vs Performance: Discuss the relationship between recruiting rankings and on-field success. What factors besides recruiting affect team performance? Why doesn't the highest-ranked recruiting class always win the national championship?
-
Transfer Portal Impact: How has the transfer portal changed college football analytics? What new metrics should be tracked that weren't relevant before 2018?
-
Conference Realignment: What analytical challenges does conference realignment create? How would you handle comparing teams across different eras when conference affiliations have changed?
-
Data Quality: Why is college football data quality more variable than NFL data? What steps can analysts take to ensure they're working with reliable data?
Coding Exercises
Exercise 1: Conference EPA Analysis
Load 2023 play-by-play data and calculate: a) Average offensive EPA per play for each conference b) Average defensive EPA per play for each conference c) Success rate by conference d) Create a scatter plot of offensive vs defensive EPA by conference e) Determine if Power 5 conferences significantly outperform Group of 5 **Hint**: You'll need to join play-by-play data with team-conference mappings. **Advanced**: Calculate opponent-adjusted EPA to account for strength of schedule.Exercise 2: Recruiting-Performance Correlation
For the past 5 years (2019-2023): a) Get recruiting class rankings for all FBS teams b) Calculate season win totals for each team c) Analyze the correlation between recruiting rank and wins d) Create visualizations showing this relationship e) Identify teams that consistently over/underperform relative to recruiting f) Calculate the "Blue Chip Ratio" for top teams **Advanced**: Build a regression model predicting wins from recruiting rank, controlling for prior year performance and coaching changes.Exercise 3: Transfer Portal Analysis
Using transfer portal data from 2021-2024: a) Calculate net transfers (in minus out) for each team b) Determine average star rating of incoming vs outgoing transfers c) Analyze which position groups see most transfer activity d) Compare transfer activity between Power 5 and Group of 5 conferences e) Identify teams that are "transfer portal winners" vs "losers" f) Examine correlation between transfer activity and coaching changes **Hint**: You'll need to combine multiple years of transfer data. **Advanced**: Build a model predicting team success based on net transfer balance and transfer quality.Exercise 4: Division Comparison
Compare FBS and FCS teams that played each other in 2023: a) Identify all FBS vs FCS games b) Calculate win rate for FBS teams c) Analyze point differentials and scoring patterns d) If available, compare EPA metrics e) Determine if home/away matters in cross-division games f) Calculate expected FBS margin of victory based on talent differential **Goal**: Quantify the competitive gap between divisions.Exercise 5: Build Complete Team Profile
Create a comprehensive team profile function that takes a school name and year and returns: a) Season record and key statistics (wins, losses, points scored/allowed) b) Recruiting class ranking (current year and past 4 years) c) Roster talent composite and comparison to conference d) Transfer portal activity (incoming/outgoing) e) Conference standing and rankings f) Coaching information (name, tenure, record) g) Historical context (3-5 year trends in all above metrics) h) Strength of schedule analysis **Advanced**: Create a polished HTML report or dashboard using Quarto, gt, or Plotly.Exercise 6: Conference Realignment Impact
Analyze the impact of recent conference realignment: a) Track UCLA, USC, Texas, and Oklahoma's conference affiliations 2015-2024 b) Compare their recruiting rankings before and after conference change announcements c) Analyze transfer portal activity around realignment periods d) Examine changes in strength of schedule e) Create visualizations showing these trends **Goal**: Understand how conference realignment affects program dynamics.Further Reading
Official Documentation
- cfbfastR Documentation: https://cfbfastr.sportsdataverse.org/
- College Football Data API: https://collegefootballdata.com/
- API Documentation: https://api.collegefootballdata.com/api/docs/
- SportsDataverse: https://sportsdataverse.org/
Research and Analysis
- Connelly, B. (2023). "SP+ Ratings Methodology." ESPN Analytics.
- Elliott, B. "Blue Chip Ratio" - Relationship between recruiting and championships
- Massey, K. & Thaler, K. "Massey Ratings" - Computer ranking methodology
- Sagarin, J. "Sagarin Ratings" - Statistical team ratings
Data Sources
- Recruiting Services:
- 247Sports: https://247sports.com/
- Rivals: https://rivals.com/
- ESPN Recruiting: https://espn.com/college-sports/football/recruiting/
-
On3: https://on3.com/
-
Statistics and Data:
- Sports Reference CFB: https://www.sports-reference.com/cfb/
- College Football Data: https://collegefootballdata.com/
- ESPN College Football: https://espn.com/college-football/
Community Resources
- r/CFBAnalysis on Reddit
- College Football Data Discord: Community support and discussions
- Open Source Football: https://www.opensourcefootball.com/
- SportsDataverse Discord: Developer community
Academic Research
- Frey, J. & Eitzen, D.S. (1991). "Sport and society." Annual Review of Sociology, 17(1), 503-522.
- Loh, W.Y. & Shin, H.S. (1997). "Split selection methods for classification trees." Statistica Sinica, 815-840.
- Stekler, H.O., Sendor, D., & Verlander, R. (2010). "Issues in sports forecasting." International Journal of Forecasting, 26(3), 606-621.
References
:::