Package Documentation

Overview

This project is organized into two main components:

  1. Data acquisition and construction (fetch_data.py)
    Pulls raw Statcast data, aggregates hitter-level summaries, merges in barrel metrics, and saves a cleaned CSV used throughout the project.

  2. Analysis utilities (analysis.py)
    Loads the combined dataset, applies consistent filtering, and provides functions for generating tables, correlations, outlier summaries, and plots used in both the report and Streamlit app.

Generally you will not need to run fetch_data.py directly and can instead work from the saved dataset using the analysis functions.

Data Acquisition and Dataset Construction (fetch_data.py)

This project includes a standalone script, fetch_data.py, which constructs the season-level hitter dataset used by the analysis module and Streamlit app. The script collects Statcast event-level data for the 2025 season, filters to home runs, aggregates hitter-level summaries, and merges in barrel statistics from a separate Baseball Savant leaderboard export. The final output is a single combined CSV used throughout the project.

What the script produces

Running fetch_data.py creates the following files:

  • data/hr_distance_leaders_2025.csv
    Player-level home run summaries computed directly from Statcast event data (home run count, average/max distance, average/max exit velocity).

  • data/combined_leaders_2025.csv
    The final analysis dataset containing the home run summaries merged with barrel metrics (barrels and barrel percentage).

Data sources

The dataset is assembled from two sources:

  1. Event-level Statcast data via pybaseball
    The script uses pybaseball.statcast() to collect pitch-by-pitch data for the 2025 season and filters to events == "home_run". This provides the raw inputs for home run distance and exit velocity metrics.

  2. Barrel leaderboard data from Baseball Savant
    Barrel totals and barrel percentage are imported from data/exit_velocity.csv, which is a leaderboard-style CSV downloaded from Baseball Savant.

Processing steps

The script follows these steps:

  1. Pull Statcast data for the full season
    • statcast(start_dt="2025-03-20", end_dt="2025-11-01")
  2. Filter to home runs only
    • hr = data[data["events"] == "home_run"]
  3. Aggregate to hitter-season level Grouping by hitter ID (batter), the script computes:
    • hr_count: number of home runs
    • avg_hr_distance, max_hr_distance: mean and max of hit_distance_sc
    • avg_launch_speed, max_launch_speed: mean and max of launch_speed
  4. Apply minimum sample size threshold
    • Only hitters with hr_count >= 5 are kept.
  5. Round numeric columns for readability
    • Distance and exit velocity are rounded to 1 decimal place.
  6. Load and clean barrel leaderboard data
    • The leaderboard file exit_velocity.csv is loaded.
    • A player_name field is created from the "last_name, first_name" column.
    • Only player_id, player_name, barrels, and brl_percent are retained.
  7. Merge datasets
    • The home run summary table is merged with the barrel table using IDs:
      • leaders.batter matched to savant_barrels.player_id
    • The merge uses how="inner" to keep only hitters present in both sources.
  8. Finalize columns and save
    • Columns are ordered so that player_name appears first.
    • The final dataset is saved to data/combined_leaders_2025.csv.

How to run it

From the repository root:

python fetch_data.py

Analysis Module (analysis.py)

The analysis.py module contains the core analysis functionality used throughout this project. It is designed to operate on the cleaned, season-level Statcast dataset produced by fetch_data.py and saved as combined_leaders_2025.csv.

The functions in this module support:

  • loading the combined dataset
  • applying consistent filtering rules
  • generating summary and ranking tables
  • computing correlations among power metrics
  • identifying outliers using z-scores
  • producing scatter plots used in both the report and Streamlit app

All analysis is performed at the player-season level, meaning each row represents a hitter summarized across the season.


Design

The analysis module is intentionally lightweight and modular. Each function:

  • performs a single, well-defined task,
  • does not modify inputs in place,
  • returns either a pandas DataFrame or a Matplotlib figure,
  • can be reused in scripts, reports, or interactive applications.

This design allows the same code to power the static report and the interactive Streamlit app without duplication.


Loading and Preparing Data

load_combined(path=COMBINED_CSV)

Loads the combined hitter dataset from disk.

Purpose - Provide a single entry point for loading the cleaned Statcast leaderboard dataset.

Parameters - path (Path | str, optional): Path to the combined CSV file. Defaults to data/combined_leaders_2025.csv.

Returns - pd.DataFrame: DataFrame containing hitter-level statistics.

Example

Code
from stat386_project.analysis import load_combined

df_raw = load_combined()
df_raw.head()
player_name batter hr_count avg_hr_distance max_hr_distance avg_launch_speed max_launch_speed barrels brl_percent
0 Acuña Jr., Ronald 660670 21 418.5 468 109.2 115.5 37 15.7
1 Pham, Tommy 502054 10 417.9 446 106.4 110.6 20 6.6
2 Wagaman, Eric 676572 9 417.4 453 106.3 110.9 26 6.8
3 Arias, Gabriel 672356 12 415.5 440 108.4 112.3 31 11.3
4 Walker, Jordan 691023 6 415.3 434 108.0 112.0 26 10.9

prepare_data(df, min_hr=5, dropna_cols=None)

Filters and cleans the dataset used for analysis.

Purpose - Remove hitters with very small sample sizes. - Ensure all required metrics are present before analysis.

Parameters - df (pd.DataFrame): Input DataFrame. - min_hr (int, default 5): Minimum number of home runs required. - dropna_cols (Iterable[str] | None): Columns that must not contain missing values.
If None, defaults to: - avg_hr_distance, max_hr_distance - avg_launch_speed, max_launch_speed - barrels, brl_percent

Returns - pd.DataFrame: Filtered copy of the DataFrame.

Notes - The function returns a copy and does not modify the input DataFrame. - Most downstream functions assume this filtering step has already been applied.

Example

Code
from stat386_project.analysis import load_combined, prepare_data

df = prepare_data(load_combined(), min_hr=5)
df.shape
(240, 9)

Summary and Ranking Tables

longest_vs_avg_distance(df, n=20)

Creates a ranking table of hitters by maximum home run distance, with average distance and exit velocity context.

Purpose Compare extreme peak power with typical home run distance.

Parameters - df (pd.DataFrame): Cleaned dataset. - n (int, default 20): Number of hitters to return.

Returns - pd.DataFrame: Top n hitters sorted by max_hr_distance.

Example

Code
from stat386_project.analysis import longest_vs_avg_distance

longest_vs_avg_distance(df, n=10)
player_name hr_count avg_hr_distance max_hr_distance avg_launch_speed max_launch_speed
60 Kurtz, Nick 37 403.6 493 105.7 114.6
30 Trout, Mike 27 407.6 485 107.1 115.4
35 Buxton, Byron 37 406.9 479 106.4 113.7
42 Carroll, Corbin 31 405.4 474 105.6 111.8
85 Greene, Riley 38 399.2 471 107.6 114.3
68 O'Hoppe, Logan 19 401.7 470 104.9 109.6
34 Judge, Aaron 55 407.1 469 108.0 117.9
31 Ohtani, Shohei 62 407.5 469 109.5 120.0
0 Acuña Jr., Ronald 21 418.5 468 109.2 115.5
18 Schwarber, Kyle 59 409.8 468 107.8 117.2

barrel_power_table(df, n=20)

Ranks hitters by barrel percentage and reports related power metrics.

Purpose Examine how contact quality aligns with distance and exit velocity metrics.

Parameters - df (pd.DataFrame): Cleaned dataset. - n (int, default 20): Number of hitters to return.

Returns - pd.DataFrame: Top n hitters sorted by brl_percent.

Example

Code
from stat386_project.analysis import barrel_power_table

barrel_power_table(df, n=10)
player_name hr_count avg_hr_distance max_hr_distance avg_launch_speed barrels brl_percent
34 Judge, Aaron 55 407.1 469 108.0 96 24.7
31 Ohtani, Shohei 62 407.5 469 109.5 100 23.5
18 Schwarber, Kyle 59 409.8 468 107.8 85 20.8
155 Raleigh, Cal 67 390.6 448 105.6 80 19.5
118 Stowers, Kyle 25 395.1 440 104.6 53 19.0
32 Alonso, Pete 39 407.3 447 107.6 89 18.9
60 Kurtz, Nick 37 403.6 493 105.7 50 18.4
78 Soto, Juan 43 400.3 437 106.9 81 18.1
19 Cruz, Oneil 20 409.7 463 110.2 54 17.9
35 Buxton, Byron 37 406.9 479 106.4 61 17.6

workload_vs_distance(df)

Returns a table relating home run workload to average distance.

Purpose Explore whether hitters with more home runs tend to hit longer home runs on average.

Parameters - df (pd.DataFrame): Cleaned dataset.

Returns - pd.DataFrame with columns: - player_name - hr_count - avg_hr_distance

Example

Code
from stat386_project.analysis import workload_vs_distance

workload_vs_distance(df).head()
player_name hr_count avg_hr_distance
0 Acuña Jr., Ronald 21 418.5
1 Pham, Tommy 10 417.9
2 Wagaman, Eric 9 417.4
3 Arias, Gabriel 12 415.5
4 Walker, Jordan 6 415.3

Correlation and Outlier Analysis

correlation_table(df)

Computes a Pearson correlation matrix for key power-related metrics.

Purpose Quantify linear relationships among distance, exit velocity, barrel metrics, and home run totals.

Included metrics - avg_hr_distance - max_hr_distance - avg_launch_speed - max_launch_speed - barrels - brl_percent - hr_count

Parameters - df (pd.DataFrame): Cleaned dataset.

Returns - pd.DataFrame: Correlation matrix.

Example

Code
from stat386_project.analysis import correlation_table

correlation_table(df).round(3)
avg_hr_distance max_hr_distance avg_launch_speed max_launch_speed barrels brl_percent hr_count
avg_hr_distance 1.000 0.695 0.803 0.607 0.502 0.572 0.318
max_hr_distance 0.695 1.000 0.690 0.697 0.648 0.703 0.582
avg_launch_speed 0.803 0.690 1.000 0.852 0.673 0.760 0.472
max_launch_speed 0.607 0.697 0.852 1.000 0.726 0.782 0.597
barrels 0.502 0.648 0.673 0.726 1.000 0.889 0.889
brl_percent 0.572 0.703 0.760 0.782 0.889 1.000 0.772
hr_count 0.318 0.582 0.472 0.597 0.889 0.772 1.000

find_outliers(df, columns=('avg_hr_distance','max_hr_distance','avg_launch_speed'), z_thresh=2.5)

Identifies hitters with extreme values on selected metrics using z-scores.

Purpose Highlight standout performance profiles rather than treat them as errors.

Parameters - df (pd.DataFrame): Cleaned dataset. - columns (Iterable[str]): Metrics used for z-score computation. - z_thresh (float, default 2.5): Threshold for flagging outliers.

Returns - pd.DataFrame: Subset of hitters flagged as outliers.

Notes - Z-scores are computed using population standard deviation (ddof=0). - A hitter is flagged if any selected metric exceeds the threshold in absolute value.

Example

Code
from stat386_project.analysis import find_outliers

outliers = find_outliers(df)
outliers[["player_name", "avg_hr_distance", "avg_hr_distance_z"]].head()
player_name avg_hr_distance avg_hr_distance_z
19 Cruz, Oneil 409.7 1.406094
30 Trout, Mike 407.6 1.204904
60 Kurtz, Nick 403.6 0.821685
232 McKinstry, Zach 375.5 -1.870428
237 Arraez, Luis 369.7 -2.426095

Plotting Functions

All plotting functions return Matplotlib figure objects.


plot_max_vs_avg_distance(df)

Scatter plot of average home run distance versus maximum home run distance.

Purpose Visualize the relationship between peak and typical power.

Returns - Matplotlib graph

Example

Code
from stat386_project.analysis import plot_max_vs_avg_distance

fig = plot_max_vs_avg_distance(df)
fig


plot_launch_speed_vs_distance(df)

Scatter plot of average exit velocity versus average home run distance.

Purpose Examine whether harder contact corresponds to longer average home runs.

Returns - Matplotlib graph


plot_barrel_percent_vs_distance(df)

Scatter plot of barrel percentage versus average home run distance.

Purpose Explore the relationship between contact quality and average power.

Returns - Matplotlib graph


plot_hr_count_vs_distance(df)

Scatter plot of home run count versus average home run distance.

Purpose Examine how workload relates to typical power outcomes.

Returns - Matplotlib graph