January 1st, 2025 - Scraping Fantasy Data

Open In Colab

January 1st, 2025 - Scraping Fantasy Data#

This is a notebook that will scrape advanced stats for offensive fantasy football positions. The website used to scrape is FantasyPros. Additionally, I calculate Average Depth of Target, or ADOT, for WRs and TEs. I’ll explain this statistic at the bottom of the notebook. First, let’s build a web scraper using BeautifulSoup, which is standard method when using Python.

import argparse
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import concurrent.futures
import logging
import os
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
os.environ['NUMEXPR_MAX_THREADS'] = '4'
os.environ['NUMEXPR_NUM_THREADS'] = '2'

I decided to use logging in this example to help me track any errors that might arise during building. This is something i’m usually not great at, but trying to get better. Below you can see that I use a series of try: and except: statements. These helped me build out this code and track down the exact line bugging out. I highly recommend.

First thing, which isn’t the best way of doing things, is hardcoding the fantasy pros url into my scrape_fantasypros() function. I set my arguments to be position and season. I could just run this function to get a single season and position’s dataframe, but i use this function to run multiple seasons and positions in parallel. In a few cells I’ll explain how I extract multiple years and postions into a list of dataframes.

I make sure to create two columns for the season and player position. This will help later when breaking up into separate analyses. I also remove the first row and column of the scraped data. This data is not useful to the analysis.

def scrape_fantasypros(position, season):
    url = f"https://www.fantasypros.com/nfl/advanced-stats-{position}.php?year={season}"
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        table = soup.find('table', {'id': 'data'})

        headers = [th.text for th in table.find_all('th')]
        rows = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')[1:]]

        df = pd.DataFrame(rows, columns=headers)
        df['Season'] = season
        df['Position'] = position.upper()
        df['Player'] = df['Player'].astype(str)
        df = df.iloc[1:,1:]
        
        if position == 'qb':
            df = get_qb_rushing(season, position, df)
            return df
        else:
            return df

        return df
            
    except requests.RequestException as e:
        logging.error(f"Error scraping {position.upper()} data for {season}: {str(e)}")
        return None
    except AttributeError as e:
        logging.error(f"Error parsing {position.upper()} data for {season}: {str(e)}")
        return None

I’ve chosen to pull the data from the advanced stats positional pages on Fantasy Pros. I find these pages to be rich with information that you may hear people who play a lot of fantasy football. One of the issues I found with pulling the advanced stats data occurs on the QB page. If you navigate to the Advanced QB Stats page there is no rushing information available. That’s because rushing information can be found on the Standard QB Stats. To work around this issue, I wrote the function below, called get_qb_rushing(). This function extracts the important statistics from the standard table, like rush attempts, rush yards, and rushing touchdowns.

Why put so much effort into getting the QB rushing information? As you you will see in future analyses, a Quarterback that also has mobility can be more valuable to your team than one that is not mobile. We will explore this in a future post.

def get_qb_rushing(season, position, dataframe):
    url = f"https://www.fantasypros.com/nfl/stats/qb.php?year={season}"
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find('table', {'id': 'data'})

    headers = [th.text for th in table.find_all('th')]
    rows = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')[1:]]

    rushing_list_df = pd.DataFrame(rows, columns=headers)
    rushing_list_df['Season'] = season
    rushing_list_df['Position'] = position.upper()

    col_nums=[1,10,11,12,18]
    out_df = rushing_list_df.iloc[1:,col_nums]
    out_df = out_df.rename(columns={'ATT':'Rush_Att', 'YDS': 'Rush_Yds', 'TD':'Rush_TDs'})

    merged_df = dataframe.merge(out_df, how='inner', on=['Player', 'Season'])
    return merged_df

As I alluded to above, what happens if I want to extract multiple positions and multiple seasons worth of data. Well, I could manually input each year and position into the scrape_fantasypros() function, then extract the data into a csv for analysis. That sounds time consuming and i’m kind of a lazy person. Let’s write some code to speed up the process! The next three lines demonstrate how I use a package called concurrent to parallelize the web scraping process to get data faster. First, I create a wrapper function called scrape_worker() that takes a position and season. I use this function in a function called extract_data() to get data from multiple seasons and multiple positions. The inputs are lists for positions and seasons needed. The output is a list of dataframes for each combination!

def scrape_worker(args):
    position, season = args
    return scrape_fantasypros(position, season)
def extract_data(positions, seasons):
    scrape_args = [(position, season) for position in positions for season in seasons]

    all_data = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
        future_to_args = {executor.submit(scrape_worker, arg): arg for arg in scrape_args}
        for future in concurrent.futures.as_completed(future_to_args):
            args = future_to_args[future]
            try:
                df = future.result()
                if df is not None:
                    all_data.append(df)
                    logging.info(f"Scraped {args[0].upper()} data for {args[1]}")
            except Exception as e:
                logging.error(f"Error processing {args[0].upper()} data for {args[1]}: {str(e)}")
                
    return all_data
positions=['qb','rb','wr','te']
seasons = [2023]
all_data = extract_data(positions, seasons)
2025-01-16 05:52:46,652 - ERROR - Error processing QB data for 2023: 3 columns passed, passed data had 0 columns
2025-01-16 05:52:46,673 - ERROR - Error processing RB data for 2023: 3 columns passed, passed data had 0 columns
2025-01-16 05:52:47,050 - ERROR - Error processing TE data for 2023: 3 columns passed, passed data had 0 columns
2025-01-16 05:52:47,237 - ERROR - Error processing WR data for 2023: 3 columns passed, passed data had 0 columns

Boom baby! We have data for QB, RB, WR, and TE from 2023. Let’s take a look at the data to see it’s quality.

all_data[3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-1632202c9fb2> in <module>
----> 1 all_data[3]

IndexError: list index out of range
all_data[3].dtypes
Player       object
G            object
REC          object
YDS          object
Y/R          object
YBC          object
YBC/R        object
AIR          object
AIR/R        object
YAC          object
YAC/R        object
YACON        object
YACON/R      object
BRKTKL       object
TGT          object
% TM         object
CATCHABLE    object
DROP         object
RZ TGT       object
10+ YDS      object
20+ YDS      object
30+ YDS      object
40+ YDS      object
50+ YDS      object
LNG          object
Season        int64
Position     object
dtype: object

Looks like we might need to do a bit of clean up here. It looks like when the data was scraped into a Pandas dataframe it took on object data types, instead of numerical data types. One reason for this is due to the special charaters, like % in PCT and , in YDS and AIR. These will need to be cleaned before we can convert to numerical columns. This is a critical step for future analysis because we will want to calculate our own metrics from the scraped stats. I clean up this data using a function called clean_dataframes(). This function removes the special characters identified above. It also helps us identify some issues in the RB dataframe.

def clean_dataframes(dataframe_list):
    cleaned_dataframes = []
    for df in dataframe_list:
        # Ensure unique column names
        df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
        
        if 'Position' in df.columns and len(df) > 0:
            position = df['Position'].iloc[0]
            
            if position == 'QB':
                if 'PCT' in df.columns:
                    df['PCT'] = pd.to_numeric(df['PCT'].astype(str).str.rstrip('%'), errors='coerce') / 100.0
                if 'YDS' in df.columns:
                    df['YDS'] = pd.to_numeric(df['YDS'].astype(str).str.replace(',', ''), errors='coerce')
                if 'AIR' in df.columns:
                    df['AIR'] = pd.to_numeric(df['AIR'].astype(str).str.replace(',', ''), errors='coerce')
            
            elif position in ['RB', 'WR', 'TE']:
                if 'YDS' in df.columns:
                    df['YDS'] = pd.to_numeric(df['YDS'].astype(str).str.replace(',', ''), errors='coerce')
                if 'AIR' in df.columns:
                    df['AIR'] = pd.to_numeric(df['AIR'].astype(str).str.replace(',', ''), errors='coerce')
                if '% TM' in df.columns:
                    df['% TM'] = pd.to_numeric(df['% TM'].astype(str).str.rstrip('%'), errors='coerce') / 100.0
            
            # Convert remaining numeric columns
            numeric_cols = [col for col in df.columns if col not in ['Player', 'Position']]
            df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')
        
        # Reset index to avoid potential issues with duplicate indices
        df = df.reset_index(drop=True)
        cleaned_dataframes.append(df)
    
    return cleaned_dataframes
cleaned_full_dataframe_list = clean_dataframes(all_data)
cleaned_full_dataframe_list[1]
Player G COMP ATT PCT YDS Y/A AIR AIR/A 10+ YDS ... BLITZ POOR DROP RZ ATT RTG Season Position Rush_Att Rush_Yds Rush_TDs
0 Josh Allen (BUF) 17 385 579 0.66 4306 7.4 2533 4.4 164 ... 182 78 31 68 92 2023 QB 111 524 15
1 Jalen Hurts (PHI) 17 352 538 0.65 3858 7.2 2374 4.4 149 ... 185 72 19 50 88 2023 QB 157 605 15
2 Dak Prescott (DAL) 17 410 590 0.69 4516 7.7 2768 4.7 179 ... 163 69 38 104 104 2023 QB 55 242 2
3 Lamar Jackson (BAL) 16 307 457 0.67 3678 8.0 2140 4.7 150 ... 169 73 22 61 102 2023 QB 148 821 5
4 Jordan Love (GB) 17 372 579 0.64 4159 7.2 2534 4.4 162 ... 215 98 29 93 98 2023 QB 50 247 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82 Kyle Trask (TB) 2 0 1 0.00 0 0.0 0 0.0 0 ... 0 0 1 1 20 2023 QB 1 -1 0
83 Teddy Bridgewater (DET) 1 0 0 0.00 0 0.0 0 0.0 0 ... 0 0 0 0 0 2023 QB 2 -2 0
84 Matt Barkley (FA) 1 0 0 0.00 0 0.0 0 0.0 0 ... 0 0 0 0 0 2023 QB 3 -3 0
85 Nathan Peterman (ATL) 2 0 0 0.00 0 0.0 0 0.0 0 ... 1 0 0 0 0 2023 QB 2 -4 0
86 Kyle Allen (PIT) 7 0 0 0.00 0 0.0 0 0.0 0 ... 0 0 0 0 0 2023 QB 13 -13 0

87 rows × 28 columns

cleaned_full_dataframe_list[1].dtypes
Player       object
G             int64
COMP          int64
ATT           int64
PCT         float64
YDS           int64
Y/A         float64
AIR           int64
AIR/A       float64
10+ YDS       int64
20+ YDS       int64
30+ YDS       int64
40+ YDS       int64
50+ YDS       int64
PKT TIME    float64
SACK          int64
KNCK          int64
HRRY          int64
BLITZ         int64
POOR          int64
DROP          int64
RZ ATT        int64
RTG           int64
Season        int64
Position     object
Rush_Att      int64
Rush_Yds      int64
Rush_TDs      int64
dtype: object

Ahh. So fresh and clean! Looks like we resolved the issues with the QB dataframe. I did notice something weird about the RB dataframe.

cleaned_full_dataframe_list[0]
Player G ATT YDS Y/ATT YBCON YBCON/ATT YACON YACON/ATT BRKTKL ... 30+ YDS 40+ YDS 50+ YDS LNG REC TGT RZ TGT YACON.1 Season Position
0 Christian McCaffrey (SF) 16 272 1459 5.4 886 3.3 573 2.1 13 ... 4 3 3 72 67 83 16 73 2023 RB
1 Raheem Mostert (MIA) 15 209 1012 4.8 620 3.0 392 1.9 15 ... 3 2 0 49 25 32 6 59 2023 RB
2 Travis Etienne Jr. (JAC) 17 267 1008 3.8 583 2.2 425 1.6 31 ... 3 1 1 62 58 73 2 145 2023 RB
3 Kyren Williams (LAR) 12 228 1144 5.0 791 3.5 353 1.5 20 ... 2 1 1 56 32 48 12 46 2023 RB
4 Derrick Henry (BAL) 17 280 1167 4.2 597 2.1 570 2.0 23 ... 2 2 2 69 28 36 2 46 2023 RB
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
134 Evan Hull (IND) 1 1 1 1.0 0 0.0 0 0.0 0 ... 0 0 0 1 1 1 0 0 2023 RB
135 Adam Prentice (NO) 13 2 12 6.0 10 5.0 2 1.0 0 ... 0 0 0 7 2 3 0 0 2023 RB
136 James Robinson (FA) 1 1 2 2.0 0 0.0 2 2.0 0 ... 0 0 0 2 1 1 0 0 2023 RB
137 Jonathan Williams (FA) 1 1 -2 -2.0 0 0.0 0 0.0 0 ... 0 0 0 0 0 1 0 0 2023 RB
138 Deon Jackson (FA) 2 14 16 1.1 4 0.3 12 0.9 0 ... 0 0 0 7 5 6 0 5 2023 RB

139 rows × 25 columns

Notice that there are two YACON (Yards after Contact) columns. This gave me some trouble while I was writing the clean_dataframes() function. The purpose of the first line of the function (df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)) is to help identify columns that might be duplicated. In this case we are looking at RB data, and the second YACON actually refers to the YACON for recieving plays. I’ll clean this up by changing the name from YACON.1 to REC_YACON.

cleaned_full_dataframe_list[0] = cleaned_full_dataframe_list[0].rename(columns={'YACON.1':'REC_YACON'})
cleaned_full_dataframe_list[0]
Player G ATT YDS Y/ATT YBCON YBCON/ATT YACON YACON/ATT BRKTKL ... 30+ YDS 40+ YDS 50+ YDS LNG REC TGT RZ TGT REC_YACON Season Position
0 Christian McCaffrey (SF) 16 272 1459 5.4 886 3.3 573 2.1 13 ... 4 3 3 72 67 83 16 73 2023 RB
1 Raheem Mostert (MIA) 15 209 1012 4.8 620 3.0 392 1.9 15 ... 3 2 0 49 25 32 6 59 2023 RB
2 Travis Etienne Jr. (JAC) 17 267 1008 3.8 583 2.2 425 1.6 31 ... 3 1 1 62 58 73 2 145 2023 RB
3 Kyren Williams (LAR) 12 228 1144 5.0 791 3.5 353 1.5 20 ... 2 1 1 56 32 48 12 46 2023 RB
4 Derrick Henry (BAL) 17 280 1167 4.2 597 2.1 570 2.0 23 ... 2 2 2 69 28 36 2 46 2023 RB
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
134 Evan Hull (IND) 1 1 1 1.0 0 0.0 0 0.0 0 ... 0 0 0 1 1 1 0 0 2023 RB
135 Adam Prentice (NO) 13 2 12 6.0 10 5.0 2 1.0 0 ... 0 0 0 7 2 3 0 0 2023 RB
136 James Robinson (FA) 1 1 2 2.0 0 0.0 2 2.0 0 ... 0 0 0 2 1 1 0 0 2023 RB
137 Jonathan Williams (FA) 1 1 -2 -2.0 0 0.0 0 0.0 0 ... 0 0 0 0 0 1 0 0 2023 RB
138 Deon Jackson (FA) 2 14 16 1.1 4 0.3 12 0.9 0 ... 0 0 0 7 5 6 0 5 2023 RB

139 rows × 25 columns

Much better! Now let’s check WR and TE.

cleaned_full_dataframe_list[2]
Player G REC YDS Y/R YBC YBC/R AIR AIR/R YAC ... DROP RZ TGT 10+ YDS 20+ YDS 30+ YDS 40+ YDS 50+ YDS LNG Season Position
0 Sam LaPorta (DET) 17 86 889 10.3 531 6.2 851 9.9 358 ... 5 15 35 8 5 2 0 48 2023 TE
1 George Kittle (SF) 16 65 1020 15.7 537 8.3 852 13.1 483 ... 4 12 40 18 8 3 2 66 2023 TE
2 Travis Kelce (KC) 15 93 984 10.6 515 5.5 808 8.7 469 ... 7 19 39 12 2 2 1 53 2023 TE
3 T.J. Hockenson (MIN) 15 95 960 10.1 624 6.6 976 10.3 336 ... 4 10 40 13 0 0 0 29 2023 TE
4 David Njoku (CLE) 16 81 882 10.9 283 3.5 551 6.8 599 ... 11 17 29 12 7 2 0 43 2023 TE
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
94 Kenny Yeboah (NYJ) 5 2 28 14.0 8 4.0 19 9.5 20 ... 0 0 1 1 0 0 0 20 2023 TE
95 Greg Dulcich (NYG) 2 3 25 8.3 20 6.7 23 7.7 5 ... 0 0 1 0 0 0 0 13 2023 TE
96 Nate Adkins (DEN) 10 4 22 5.5 -2 -0.5 -5 -1.3 24 ... 1 0 1 0 0 0 0 11 2023 TE
97 Chris Manhertz (NYG) 16 2 16 8.0 6 3.0 12 6.0 10 ... 0 1 1 0 0 0 0 10 2023 TE
98 Eric Saubert (SF) 8 3 12 4.0 6 2.0 6 2.0 6 ... 0 0 0 0 0 0 0 5 2023 TE

99 rows × 27 columns

cleaned_full_dataframe_list[3]
Player G REC YDS Y/R YBC YBC/R AIR AIR/R YAC ... DROP RZ TGT 10+ YDS 20+ YDS 30+ YDS 40+ YDS 50+ YDS LNG Season Position
0 CeeDee Lamb (DAL) 17 135 1749 13.0 NaN 7.9 1726 12.8 676 ... 6 31 73 29 8 3 1 92 2023 WR
1 Tyreek Hill (MIA) 16 119 1799 15.1 NaN 9.6 1847 15.5 653 ... 12 24 64 29 14 9 5 78 2023 WR
2 Amon-Ra St. Brown (DET) 16 119 1515 12.7 847.0 7.1 1297 10.9 668 ... 8 23 60 24 6 3 1 70 2023 WR
3 Mike Evans (TB) 17 79 1255 15.9 933.0 11.8 1899 24.0 322 ... 7 14 46 20 10 6 3 75 2023 WR
4 Puka Nacua (LAR) 17 105 1486 14.2 854.0 8.1 1453 13.8 632 ... 13 16 59 25 10 3 2 80 2023 WR
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
185 Miles Boykin (SEA) 16 3 17 5.7 15.0 5.0 19 6.3 2 ... 1 0 0 0 0 0 0 6 2023 WR
186 Marvin Jones Jr. (FA) 6 5 35 7.0 26.0 5.2 80 16.0 9 ... 2 2 1 0 0 0 0 16 2023 WR
187 Malik Taylor (NYJ) 3 2 13 6.5 6.0 3.0 89 44.5 7 ... 0 0 0 0 0 0 0 7 2023 WR
188 Austin Trammell (JAC) 15 4 29 7.3 -6.0 -1.5 25 6.3 35 ... 0 0 2 0 0 0 0 14 2023 WR
189 James Proche II (CLE) 10 0 0 0.0 0.0 0.0 86 0.0 0 ... 0 0 0 0 0 0 0 0 2023 WR

190 rows × 27 columns

Perfect! Now we’re ready to do some analysis in the next post!