January 1st, 2025 - Scraping Fantasy Data

January 1st, 2025 - Scraping Fantasy Data#

This is a notebook that will scrape advanced stats for offensive fantasy football positions. The website used to scrape is FantasyPros. Additionally, I calculate Average Depth of Target, or ADOT, for WRs and TEs. I’ll explain this statistic at the bottom of the notebook. First, let’s build a web scraper using BeautifulSoup, which is standard method when using Python.

import argparse
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import concurrent.futures
import logging
import os
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
os.environ['NUMEXPR_MAX_THREADS'] = '4'
os.environ['NUMEXPR_NUM_THREADS'] = '2'

I decided to use logging in this example to help me track any errors that might arise during building. This is something i’m usually not great at, but trying to get better. Below you can see that I use a series of try: and except: statements. These helped me build out this code and track down the exact line bugging out. I highly recommend.

First thing, which isn’t the best way of doing things, is hardcoding the fantasy pros url into my scrape_fantasypros() function. I set my arguments to be position and season. I could just run this function to get a single season and position’s dataframe, but i use this function to run multiple seasons and positions in parallel. In a few cells I’ll explain how I extract multiple years and postions into a list of dataframes.

I make sure to create two columns for the season and player position. This will help later when breaking up into separate analyses. I also remove the first row and column of the scraped data. This data is not useful to the analysis.

def scrape_fantasypros(position, season):
    url = f"https://www.fantasypros.com/nfl/advanced-stats-{position}.php?year={season}"
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        table = soup.find('table', {'id': 'data'})

        headers = [th.text for th in table.find_all('th')]
        rows = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')[1:]]

        df = pd.DataFrame(rows, columns=headers)
        df['Season'] = season
        df['Position'] = position.upper()
        df['Player'] = df['Player'].astype(str)
        df = df.iloc[1:,1:]
        
        if position == 'qb':
            df = get_qb_rushing(season, position, df)
            return df
        else:
            return df

        return df
            
    except requests.RequestException as e:
        logging.error(f"Error scraping {position.upper()} data for {season}: {str(e)}")
        return None
    except AttributeError as e:
        logging.error(f"Error parsing {position.upper()} data for {season}: {str(e)}")
        return None

I’ve chosen to pull the data from the advanced stats positional pages on Fantasy Pros. I find these pages to be rich with information that you may hear people who play a lot of fantasy football. One of the issues I found with pulling the advanced stats data occurs on the QB page. If you navigate to the Advanced QB Stats page there is no rushing information available. That’s because rushing information can be found on the Standard QB Stats. To work around this issue, I wrote the function below, called get_qb_rushing(). This function extracts the important statistics from the standard table, like rush attempts, rush yards, and rushing touchdowns.

Why put so much effort into getting the QB rushing information? As you you will see in future analyses, a Quarterback that also has mobility can be more valuable to your team than one that is not mobile. We will explore this in a future post.

def get_qb_rushing(season, position, dataframe):
    url = f"https://www.fantasypros.com/nfl/stats/qb.php?year={season}"
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', {'id': 'data'})

    headers = [th.text for th in table.find_all('th')]
    rows = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')[1:]]

    rushing_list_df = pd.DataFrame(rows, columns=headers)
    rushing_list_df['Season'] = season
    rushing_list_df['Position'] = position.upper()

    col_nums=[1,10,11,12,18]
    out_df = rushing_list_df.iloc[1:,col_nums]
    out_df = out_df.rename(columns={'ATT':'Rush_Att', 'YDS': 'Rush_Yds', 'TD':'Rush_TDs'})

    merged_df = dataframe.merge(out_df, how='inner', on=['Player', 'Season'])
    return merged_df

As I alluded to above, what happens if I want to extract multiple positions and multiple seasons worth of data. Well, I could manually input each year and position into the scrape_fantasypros() function, then extract the data into a csv for analysis. That sounds time consuming and i’m kind of a lazy person. Let’s write some code to speed up the process! The next three lines demonstrate how I use a package called concurrent to parallelize the web scraping process to get data faster. First, I create a wrapper function called scrape_worker() that takes a position and season. I use this function in a function called extract_data() to get data from multiple seasons and multiple positions. The inputs are lists for positions and seasons needed. The output is a list of dataframes for each combination!

def scrape_worker(args):
    position, season = args
    return scrape_fantasypros(position, season)

def extract_data(positions, seasons):
    scrape_args = [(position, season) for position in positions for season in seasons]

    all_data = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        future_to_args = {executor.submit(scrape_worker, arg): arg for arg in scrape_args}
        for future in concurrent.futures.as_completed(future_to_args):
            args = future_to_args[future]
            try:
                df = future.result()
                if df is not None:
                    all_data.append(df)
                    logging.info(f"Scraped {args[0].upper()} data for {args[1]}")
            except Exception as e:
                logging.error(f"Error processing {args[0].upper()} data for {args[1]}: {str(e)}")
                
    return all_data

positions=['qb','rb','wr','te']
seasons = [2023]
all_data = extract_data(positions, seasons)

2025-01-16 05:52:46,652 - ERROR - Error processing QB data for 2023: 3 columns passed, passed data had 0 columns

2025-01-16 05:52:46,673 - ERROR - Error processing RB data for 2023: 3 columns passed, passed data had 0 columns

2025-01-16 05:52:47,050 - ERROR - Error processing TE data for 2023: 3 columns passed, passed data had 0 columns

2025-01-16 05:52:47,237 - ERROR - Error processing WR data for 2023: 3 columns passed, passed data had 0 columns

Boom baby! We have data for QB, RB, WR, and TE from 2023. Let’s take a look at the data to see it’s quality.

all_data[3]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-1632202c9fb2> in <module>
----> 1 all_data[3]

IndexError: list index out of range

all_data[3].dtypes

Player       object
G            object
REC          object
YDS          object
Y/R          object
YBC          object
YBC/R        object
AIR          object
AIR/R        object
YAC          object
YAC/R        object
YACON        object
YACON/R      object
BRKTKL       object
TGT          object
% TM         object
CATCHABLE    object
DROP         object
RZ TGT       object
10+ YDS      object
20+ YDS      object
30+ YDS      object
40+ YDS      object
50+ YDS      object
LNG          object
Season        int64
Position     object
dtype: object

Looks like we might need to do a bit of clean up here. It looks like when the data was scraped into a Pandas dataframe it took on object data types, instead of numerical data types. One reason for this is due to the special charaters, like % in PCT and , in YDS and AIR. These will need to be cleaned before we can convert to numerical columns. This is a critical step for future analysis because we will want to calculate our own metrics from the scraped stats. I clean up this data using a function called clean_dataframes(). This function removes the special characters identified above. It also helps us identify some issues in the RB dataframe.

def clean_dataframes(dataframe_list):
    cleaned_dataframes = []
    for df in dataframe_list:
        # Ensure unique column names
        df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
        
        if 'Position' in df.columns and len(df) > 0:
            position = df['Position'].iloc[0]
            
            if position == 'QB':
                if 'PCT' in df.columns:
                    df['PCT'] = pd.to_numeric(df['PCT'].astype(str).str.rstrip('%'), errors='coerce') / 100.0
                if 'YDS' in df.columns:
                    df['YDS'] = pd.to_numeric(df['YDS'].astype(str).str.replace(',', ''), errors='coerce')
                if 'AIR' in df.columns:
                    df['AIR'] = pd.to_numeric(df['AIR'].astype(str).str.replace(',', ''), errors='coerce')
            
            elif position in ['RB', 'WR', 'TE']:
                if 'YDS' in df.columns:
                    df['YDS'] = pd.to_numeric(df['YDS'].astype(str).str.replace(',', ''), errors='coerce')
                if 'AIR' in df.columns:
                    df['AIR'] = pd.to_numeric(df['AIR'].astype(str).str.replace(',', ''), errors='coerce')
                if '% TM' in df.columns:
                    df['% TM'] = pd.to_numeric(df['% TM'].astype(str).str.rstrip('%'), errors='coerce') / 100.0
            
            # Convert remaining numeric columns
            numeric_cols = [col for col in df.columns if col not in ['Player', 'Position']]
            df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')
        
        # Reset index to avoid potential issues with duplicate indices
        df = df.reset_index(drop=True)
        cleaned_dataframes.append(df)
    
    return cleaned_dataframes

cleaned_full_dataframe_list = clean_dataframes(all_data)

cleaned_full_dataframe_list[1]

	Player	G	COMP	ATT	PCT	YDS	Y/A	AIR	AIR/A	10+ YDS	...	BLITZ	POOR	DROP	RZ ATT	RTG	Season	Position	Rush_Att	Rush_Yds	Rush_TDs
0	Josh Allen (BUF)	17	385	579	0.66	4306	7.4	2533	4.4	164	...	182	78	31	68	92	2023	QB	111	524	15
1	Jalen Hurts (PHI)	17	352	538	0.65	3858	7.2	2374	4.4	149	...	185	72	19	50	88	2023	QB	157	605	15
2	Dak Prescott (DAL)	17	410	590	0.69	4516	7.7	2768	4.7	179	...	163	69	38	104	104	2023	QB	55	242	2
3	Lamar Jackson (BAL)	16	307	457	0.67	3678	8.0	2140	4.7	150	...	169	73	22	61	102	2023	QB	148	821	5
4	Jordan Love (GB)	17	372	579	0.64	4159	7.2	2534	4.4	162	...	215	98	29	93	98	2023	QB	50	247	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
82	Kyle Trask (TB)	2	0	1	0.00	0	0.0	0	0.0	0	...	0	0	1	1	20	2023	QB	1	-1	0
83	Teddy Bridgewater (DET)	1	0	0	0.00	0	0.0	0	0.0	0	...	0	0	0	0	0	2023	QB	2	-2	0
84	Matt Barkley (FA)	1	0	0	0.00	0	0.0	0	0.0	0	...	0	0	0	0	0	2023	QB	3	-3	0
85	Nathan Peterman (ATL)	2	0	0	0.00	0	0.0	0	0.0	0	...	1	0	0	0	0	2023	QB	2	-4	0
86	Kyle Allen (PIT)	7	0	0	0.00	0	0.0	0	0.0	0	...	0	0	0	0	0	2023	QB	13	-13	0

87 rows × 28 columns

cleaned_full_dataframe_list[1].dtypes

Player       object
G             int64
COMP          int64
ATT           int64
PCT         float64
YDS           int64
Y/A         float64
AIR           int64
AIR/A       float64
10+ YDS       int64
20+ YDS       int64
30+ YDS       int64
40+ YDS       int64
50+ YDS       int64
PKT TIME    float64
SACK          int64
KNCK          int64
HRRY          int64
BLITZ         int64
POOR          int64
DROP          int64
RZ ATT        int64
RTG           int64
Season        int64
Position     object
Rush_Att      int64
Rush_Yds      int64
Rush_TDs      int64
dtype: object

Ahh. So fresh and clean! Looks like we resolved the issues with the QB dataframe. I did notice something weird about the RB dataframe.

cleaned_full_dataframe_list[0]

	Player	G	ATT	YDS	Y/ATT	YBCON	YBCON/ATT	YACON	YACON/ATT	BRKTKL	...	30+ YDS	40+ YDS	50+ YDS	LNG	REC	TGT	RZ TGT	YACON.1	Season	Position
0	Christian McCaffrey (SF)	16	272	1459	5.4	886	3.3	573	2.1	13	...	4	3	3	72	67	83	16	73	2023	RB
1	Raheem Mostert (MIA)	15	209	1012	4.8	620	3.0	392	1.9	15	...	3	2	0	49	25	32	6	59	2023	RB
2	Travis Etienne Jr. (JAC)	17	267	1008	3.8	583	2.2	425	1.6	31	...	3	1	1	62	58	73	2	145	2023	RB
3	Kyren Williams (LAR)	12	228	1144	5.0	791	3.5	353	1.5	20	...	2	1	1	56	32	48	12	46	2023	RB
4	Derrick Henry (BAL)	17	280	1167	4.2	597	2.1	570	2.0	23	...	2	2	2	69	28	36	2	46	2023	RB
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
134	Evan Hull (IND)	1	1	1	1.0	0	0.0	0	0.0	0	...	0	0	0	1	1	1	0	0	2023	RB
135	Adam Prentice (NO)	13	2	12	6.0	10	5.0	2	1.0	0	...	0	0	0	7	2	3	0	0	2023	RB
136	James Robinson (FA)	1	1	2	2.0	0	0.0	2	2.0	0	...	0	0	0	2	1	1	0	0	2023	RB
137	Jonathan Williams (FA)	1	1	-2	-2.0	0	0.0	0	0.0	0	...	0	0	0	0	0	1	0	0	2023	RB
138	Deon Jackson (FA)	2	14	16	1.1	4	0.3	12	0.9	0	...	0	0	0	7	5	6	0	5	2023	RB

139 rows × 25 columns

Notice that there are two YACON (Yards after Contact) columns. This gave me some trouble while I was writing the clean_dataframes() function. The purpose of the first line of the function (df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)) is to help identify columns that might be duplicated. In this case we are looking at RB data, and the second YACON actually refers to the YACON for recieving plays. I’ll clean this up by changing the name from YACON.1 to REC_YACON.

cleaned_full_dataframe_list[0] = cleaned_full_dataframe_list[0].rename(columns={'YACON.1':'REC_YACON'})
cleaned_full_dataframe_list[0]

	Player	G	ATT	YDS	Y/ATT	YBCON	YBCON/ATT	YACON	YACON/ATT	BRKTKL	...	30+ YDS	40+ YDS	50+ YDS	LNG	REC	TGT	RZ TGT	REC_YACON	Season	Position
0	Christian McCaffrey (SF)	16	272	1459	5.4	886	3.3	573	2.1	13	...	4	3	3	72	67	83	16	73	2023	RB
1	Raheem Mostert (MIA)	15	209	1012	4.8	620	3.0	392	1.9	15	...	3	2	0	49	25	32	6	59	2023	RB
2	Travis Etienne Jr. (JAC)	17	267	1008	3.8	583	2.2	425	1.6	31	...	3	1	1	62	58	73	2	145	2023	RB
3	Kyren Williams (LAR)	12	228	1144	5.0	791	3.5	353	1.5	20	...	2	1	1	56	32	48	12	46	2023	RB
4	Derrick Henry (BAL)	17	280	1167	4.2	597	2.1	570	2.0	23	...	2	2	2	69	28	36	2	46	2023	RB
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
134	Evan Hull (IND)	1	1	1	1.0	0	0.0	0	0.0	0	...	0	0	0	1	1	1	0	0	2023	RB
135	Adam Prentice (NO)	13	2	12	6.0	10	5.0	2	1.0	0	...	0	0	0	7	2	3	0	0	2023	RB
136	James Robinson (FA)	1	1	2	2.0	0	0.0	2	2.0	0	...	0	0	0	2	1	1	0	0	2023	RB
137	Jonathan Williams (FA)	1	1	-2	-2.0	0	0.0	0	0.0	0	...	0	0	0	0	0	1	0	0	2023	RB
138	Deon Jackson (FA)	2	14	16	1.1	4	0.3	12	0.9	0	...	0	0	0	7	5	6	0	5	2023	RB

139 rows × 25 columns

Much better! Now let’s check WR and TE.

cleaned_full_dataframe_list[2]

	Player	G	REC	YDS	Y/R	YBC	YBC/R	AIR	AIR/R	YAC	...	DROP	RZ TGT	10+ YDS	20+ YDS	30+ YDS	40+ YDS	50+ YDS	LNG	Season	Position
0	Sam LaPorta (DET)	17	86	889	10.3	531	6.2	851	9.9	358	...	5	15	35	8	5	2	0	48	2023	TE
1	George Kittle (SF)	16	65	1020	15.7	537	8.3	852	13.1	483	...	4	12	40	18	8	3	2	66	2023	TE
2	Travis Kelce (KC)	15	93	984	10.6	515	5.5	808	8.7	469	...	7	19	39	12	2	2	1	53	2023	TE
3	T.J. Hockenson (MIN)	15	95	960	10.1	624	6.6	976	10.3	336	...	4	10	40	13	0	0	0	29	2023	TE
4	David Njoku (CLE)	16	81	882	10.9	283	3.5	551	6.8	599	...	11	17	29	12	7	2	0	43	2023	TE
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
94	Kenny Yeboah (NYJ)	5	2	28	14.0	8	4.0	19	9.5	20	...	0	0	1	1	0	0	0	20	2023	TE
95	Greg Dulcich (NYG)	2	3	25	8.3	20	6.7	23	7.7	5	...	0	0	1	0	0	0	0	13	2023	TE
96	Nate Adkins (DEN)	10	4	22	5.5	-2	-0.5	-5	-1.3	24	...	1	0	1	0	0	0	0	11	2023	TE
97	Chris Manhertz (NYG)	16	2	16	8.0	6	3.0	12	6.0	10	...	0	1	1	0	0	0	0	10	2023	TE
98	Eric Saubert (SF)	8	3	12	4.0	6	2.0	6	2.0	6	...	0	0	0	0	0	0	0	5	2023	TE

99 rows × 27 columns

cleaned_full_dataframe_list[3]

	Player	G	REC	YDS	Y/R	YBC	YBC/R	AIR	AIR/R	YAC	...	DROP	RZ TGT	10+ YDS	20+ YDS	30+ YDS	40+ YDS	50+ YDS	LNG	Season	Position
0	CeeDee Lamb (DAL)	17	135	1749	13.0	NaN	7.9	1726	12.8	676	...	6	31	73	29	8	3	1	92	2023	WR
1	Tyreek Hill (MIA)	16	119	1799	15.1	NaN	9.6	1847	15.5	653	...	12	24	64	29	14	9	5	78	2023	WR
2	Amon-Ra St. Brown (DET)	16	119	1515	12.7	847.0	7.1	1297	10.9	668	...	8	23	60	24	6	3	1	70	2023	WR
3	Mike Evans (TB)	17	79	1255	15.9	933.0	11.8	1899	24.0	322	...	7	14	46	20	10	6	3	75	2023	WR
4	Puka Nacua (LAR)	17	105	1486	14.2	854.0	8.1	1453	13.8	632	...	13	16	59	25	10	3	2	80	2023	WR
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
185	Miles Boykin (SEA)	16	3	17	5.7	15.0	5.0	19	6.3	2	...	1	0	0	0	0	0	0	6	2023	WR
186	Marvin Jones Jr. (FA)	6	5	35	7.0	26.0	5.2	80	16.0	9	...	2	2	1	0	0	0	0	16	2023	WR
187	Malik Taylor (NYJ)	3	2	13	6.5	6.0	3.0	89	44.5	7	...	0	0	0	0	0	0	0	7	2023	WR
188	Austin Trammell (JAC)	15	4	29	7.3	-6.0	-1.5	25	6.3	35	...	0	0	2	0	0	0	0	14	2023	WR
189	James Proche II (CLE)	10	0	0	0.0	0.0	0.0	86	0.0	0	...	0	0	0	0	0	0	0	0	2023	WR

190 rows × 27 columns

Perfect! Now we’re ready to do some analysis in the next post!