Package Documentation

Overview

This project is organized into two main components:

Data acquisition and construction (fetch_data.py)
Pulls raw Statcast data, aggregates hitter-level summaries, merges in barrel metrics, and saves a cleaned CSV used throughout the project.
Analysis utilities (analysis.py)
Loads the combined dataset, applies consistent filtering, and provides functions for generating tables, correlations, outlier summaries, and plots used in both the report and Streamlit app.

Generally you will not need to run fetch_data.py directly and can instead work from the saved dataset using the analysis functions.

Data Acquisition and Dataset Construction (`fetch_data.py`)

This project includes a standalone script, fetch_data.py, which constructs the season-level hitter dataset used by the analysis module and Streamlit app. The script collects Statcast event-level data for the 2025 season, filters to home runs, aggregates hitter-level summaries, and merges in barrel statistics from a separate Baseball Savant leaderboard export. The final output is a single combined CSV used throughout the project.

What the script produces

Running fetch_data.py creates the following files:

data/hr_distance_leaders_2025.csv
Player-level home run summaries computed directly from Statcast event data (home run count, average/max distance, average/max exit velocity).
data/combined_leaders_2025.csv
The final analysis dataset containing the home run summaries merged with barrel metrics (barrels and barrel percentage).

Data sources

The dataset is assembled from two sources:

Event-level Statcast data via pybaseball
The script uses pybaseball.statcast() to collect pitch-by-pitch data for the 2025 season and filters to events == "home_run". This provides the raw inputs for home run distance and exit velocity metrics.
Barrel leaderboard data from Baseball Savant
Barrel totals and barrel percentage are imported from data/exit_velocity.csv, which is a leaderboard-style CSV downloaded from Baseball Savant.

Processing steps

The script follows these steps:

Pull Statcast data for the full season
- statcast(start_dt="2025-03-20", end_dt="2025-11-01")
Filter to home runs only
- hr = data[data["events"] == "home_run"]
Aggregate to hitter-season level Grouping by hitter ID (batter), the script computes:
- hr_count: number of home runs
- avg_hr_distance, max_hr_distance: mean and max of hit_distance_sc
- avg_launch_speed, max_launch_speed: mean and max of launch_speed
Apply minimum sample size threshold
- Only hitters with hr_count >= 5 are kept.
Round numeric columns for readability
- Distance and exit velocity are rounded to 1 decimal place.
Load and clean barrel leaderboard data
- The leaderboard file exit_velocity.csv is loaded.
- A player_name field is created from the "last_name, first_name" column.
- Only player_id, player_name, barrels, and brl_percent are retained.
Merge datasets
- The home run summary table is merged with the barrel table using IDs:
  - leaders.batter matched to savant_barrels.player_id
- The merge uses how="inner" to keep only hitters present in both sources.
Finalize columns and save
- Columns are ordered so that player_name appears first.
- The final dataset is saved to data/combined_leaders_2025.csv.

How to run it

From the repository root:

python fetch_data.py

Analysis Module (`analysis.py`)

The analysis.py module contains the core analysis functionality used throughout this project. It is designed to operate on the cleaned, season-level Statcast dataset produced by fetch_data.py and saved as combined_leaders_2025.csv.

The functions in this module support:

loading the combined dataset
applying consistent filtering rules
generating summary and ranking tables
computing correlations among power metrics
identifying outliers using z-scores
producing scatter plots used in both the report and Streamlit app

All analysis is performed at the player-season level, meaning each row represents a hitter summarized across the season.

Design

The analysis module is intentionally lightweight and modular. Each function:

performs a single, well-defined task,
does not modify inputs in place,
returns either a pandas DataFrame or a Matplotlib figure,
can be reused in scripts, reports, or interactive applications.

This design allows the same code to power the static report and the interactive Streamlit app without duplication.

Loading and Preparing Data

`load_combined(path=COMBINED_CSV)`

Loads the combined hitter dataset from disk.

Purpose - Provide a single entry point for loading the cleaned Statcast leaderboard dataset.

Parameters - path (Path | str, optional): Path to the combined CSV file. Defaults to data/combined_leaders_2025.csv.

Returns - pd.DataFrame: DataFrame containing hitter-level statistics.

Example

Code

from stat386_project.analysis import load_combined

df_raw = load_combined()
df_raw.head()

	player_name	batter	hr_count	avg_hr_distance	max_hr_distance	avg_launch_speed	max_launch_speed	barrels	brl_percent
0	Acuña Jr., Ronald	660670	21	418.5	468	109.2	115.5	37	15.7
1	Pham, Tommy	502054	10	417.9	446	106.4	110.6	20	6.6
2	Wagaman, Eric	676572	9	417.4	453	106.3	110.9	26	6.8
3	Arias, Gabriel	672356	12	415.5	440	108.4	112.3	31	11.3
4	Walker, Jordan	691023	6	415.3	434	108.0	112.0	26	10.9

`prepare_data(df, min_hr=5, dropna_cols=None)`

Filters and cleans the dataset used for analysis.

Purpose - Remove hitters with very small sample sizes. - Ensure all required metrics are present before analysis.

Parameters - df (pd.DataFrame): Input DataFrame. - min_hr (int, default 5): Minimum number of home runs required. - dropna_cols (Iterable[str] | None): Columns that must not contain missing values.
If None, defaults to: - avg_hr_distance, max_hr_distance - avg_launch_speed, max_launch_speed - barrels, brl_percent

Returns - pd.DataFrame: Filtered copy of the DataFrame.

Notes - The function returns a copy and does not modify the input DataFrame. - Most downstream functions assume this filtering step has already been applied.

Example

Code

from stat386_project.analysis import load_combined, prepare_data

df = prepare_data(load_combined(), min_hr=5)
df.shape

(240, 9)

Summary and Ranking Tables

`longest_vs_avg_distance(df, n=20)`

Creates a ranking table of hitters by maximum home run distance, with average distance and exit velocity context.

Purpose Compare extreme peak power with typical home run distance.

Parameters - df (pd.DataFrame): Cleaned dataset. - n (int, default 20): Number of hitters to return.

Returns - pd.DataFrame: Top n hitters sorted by max_hr_distance.

Example

Code

from stat386_project.analysis import longest_vs_avg_distance

longest_vs_avg_distance(df, n=10)

	player_name	hr_count	avg_hr_distance	max_hr_distance	avg_launch_speed	max_launch_speed
60	Kurtz, Nick	37	403.6	493	105.7	114.6
30	Trout, Mike	27	407.6	485	107.1	115.4
35	Buxton, Byron	37	406.9	479	106.4	113.7
42	Carroll, Corbin	31	405.4	474	105.6	111.8
85	Greene, Riley	38	399.2	471	107.6	114.3
68	O'Hoppe, Logan	19	401.7	470	104.9	109.6
34	Judge, Aaron	55	407.1	469	108.0	117.9
31	Ohtani, Shohei	62	407.5	469	109.5	120.0
0	Acuña Jr., Ronald	21	418.5	468	109.2	115.5
18	Schwarber, Kyle	59	409.8	468	107.8	117.2

`barrel_power_table(df, n=20)`

Ranks hitters by barrel percentage and reports related power metrics.

Purpose Examine how contact quality aligns with distance and exit velocity metrics.

Parameters - df (pd.DataFrame): Cleaned dataset. - n (int, default 20): Number of hitters to return.

Returns - pd.DataFrame: Top n hitters sorted by brl_percent.

Example

Code

from stat386_project.analysis import barrel_power_table

barrel_power_table(df, n=10)

	player_name	hr_count	avg_hr_distance	max_hr_distance	avg_launch_speed	barrels	brl_percent
34	Judge, Aaron	55	407.1	469	108.0	96	24.7
31	Ohtani, Shohei	62	407.5	469	109.5	100	23.5
18	Schwarber, Kyle	59	409.8	468	107.8	85	20.8
155	Raleigh, Cal	67	390.6	448	105.6	80	19.5
118	Stowers, Kyle	25	395.1	440	104.6	53	19.0
32	Alonso, Pete	39	407.3	447	107.6	89	18.9
60	Kurtz, Nick	37	403.6	493	105.7	50	18.4
78	Soto, Juan	43	400.3	437	106.9	81	18.1
19	Cruz, Oneil	20	409.7	463	110.2	54	17.9
35	Buxton, Byron	37	406.9	479	106.4	61	17.6

`workload_vs_distance(df)`

Returns a table relating home run workload to average distance.

Purpose Explore whether hitters with more home runs tend to hit longer home runs on average.

Parameters - df (pd.DataFrame): Cleaned dataset.

Returns - pd.DataFrame with columns: - player_name - hr_count - avg_hr_distance

Example

Code

from stat386_project.analysis import workload_vs_distance

workload_vs_distance(df).head()

	player_name	hr_count	avg_hr_distance
0	Acuña Jr., Ronald	21	418.5
1	Pham, Tommy	10	417.9
2	Wagaman, Eric	9	417.4
3	Arias, Gabriel	12	415.5
4	Walker, Jordan	6	415.3

Correlation and Outlier Analysis

`correlation_table(df)`

Computes a Pearson correlation matrix for key power-related metrics.

Purpose Quantify linear relationships among distance, exit velocity, barrel metrics, and home run totals.

Included metrics - avg_hr_distance - max_hr_distance - avg_launch_speed - max_launch_speed - barrels - brl_percent - hr_count

Parameters - df (pd.DataFrame): Cleaned dataset.

Returns - pd.DataFrame: Correlation matrix.

Example

Code

from stat386_project.analysis import correlation_table

correlation_table(df).round(3)

	avg_hr_distance	max_hr_distance	avg_launch_speed	max_launch_speed	barrels	brl_percent	hr_count
avg_hr_distance	1.000	0.695	0.803	0.607	0.502	0.572	0.318
max_hr_distance	0.695	1.000	0.690	0.697	0.648	0.703	0.582
avg_launch_speed	0.803	0.690	1.000	0.852	0.673	0.760	0.472
max_launch_speed	0.607	0.697	0.852	1.000	0.726	0.782	0.597
barrels	0.502	0.648	0.673	0.726	1.000	0.889	0.889
brl_percent	0.572	0.703	0.760	0.782	0.889	1.000	0.772
hr_count	0.318	0.582	0.472	0.597	0.889	0.772	1.000

`find_outliers(df, columns=('avg_hr_distance','max_hr_distance','avg_launch_speed'), z_thresh=2.5)`

Identifies hitters with extreme values on selected metrics using z-scores.

Purpose Highlight standout performance profiles rather than treat them as errors.

Parameters - df (pd.DataFrame): Cleaned dataset. - columns (Iterable[str]): Metrics used for z-score computation. - z_thresh (float, default 2.5): Threshold for flagging outliers.

Returns - pd.DataFrame: Subset of hitters flagged as outliers.

Notes - Z-scores are computed using population standard deviation (ddof=0). - A hitter is flagged if any selected metric exceeds the threshold in absolute value.

Example

Code

from stat386_project.analysis import find_outliers

outliers = find_outliers(df)
outliers[["player_name", "avg_hr_distance", "avg_hr_distance_z"]].head()

	player_name	avg_hr_distance	avg_hr_distance_z
19	Cruz, Oneil	409.7	1.406094
30	Trout, Mike	407.6	1.204904
60	Kurtz, Nick	403.6	0.821685
232	McKinstry, Zach	375.5	-1.870428
237	Arraez, Luis	369.7	-2.426095

Plotting Functions

All plotting functions return Matplotlib figure objects.

`plot_max_vs_avg_distance(df)`

Scatter plot of average home run distance versus maximum home run distance.

Purpose Visualize the relationship between peak and typical power.

Returns - Matplotlib graph

Example

Code

from stat386_project.analysis import plot_max_vs_avg_distance

fig = plot_max_vs_avg_distance(df)
fig

`plot_launch_speed_vs_distance(df)`

Scatter plot of average exit velocity versus average home run distance.

Purpose Examine whether harder contact corresponds to longer average home runs.

Returns - Matplotlib graph

`plot_barrel_percent_vs_distance(df)`

Scatter plot of barrel percentage versus average home run distance.

Purpose Explore the relationship between contact quality and average power.

Returns - Matplotlib graph

`plot_hr_count_vs_distance(df)`

Scatter plot of home run count versus average home run distance.

Purpose Examine how workload relates to typical power outcomes.

Returns - Matplotlib graph

Overview

Data Acquisition and Dataset Construction (fetch_data.py)

What the script produces

Data sources

Processing steps

How to run it

Analysis Module (analysis.py)

Design

Loading and Preparing Data

load_combined(path=COMBINED_CSV)

prepare_data(df, min_hr=5, dropna_cols=None)

Summary and Ranking Tables

longest_vs_avg_distance(df, n=20)

barrel_power_table(df, n=20)

workload_vs_distance(df)

Correlation and Outlier Analysis

correlation_table(df)

find_outliers(df, columns=('avg_hr_distance','max_hr_distance','avg_launch_speed'), z_thresh=2.5)

Plotting Functions

plot_max_vs_avg_distance(df)

plot_launch_speed_vs_distance(df)

plot_barrel_percent_vs_distance(df)

plot_hr_count_vs_distance(df)

Data Acquisition and Dataset Construction (`fetch_data.py`)

Analysis Module (`analysis.py`)

`load_combined(path=COMBINED_CSV)`

`prepare_data(df, min_hr=5, dropna_cols=None)`

`longest_vs_avg_distance(df, n=20)`

`barrel_power_table(df, n=20)`

`workload_vs_distance(df)`

`correlation_table(df)`

`find_outliers(df, columns=('avg_hr_distance','max_hr_distance','avg_launch_speed'), z_thresh=2.5)`

`plot_max_vs_avg_distance(df)`

`plot_launch_speed_vs_distance(df)`

`plot_barrel_percent_vs_distance(df)`

`plot_hr_count_vs_distance(df)`