Scraping the MLB for Science & Glory

but mostly science... or math

published: 

If you’ve been following baseball long enough you might have noticed nobody likes baseball anymore, and everyone has an opinion or stance or view or insight into how to fix it.

As someone who likes baseball (and statistics), I want to analyze these trends myself and find out if I need a new “national pastime”, or if these charlatans are just talking out their …er… padding their word counts.

So for this essay, we'll begin a multi-step journey, analyzing baseball statistics collected from the MLB to analyze the cause of "baseball's declining fan base" in later essays.

The Plan

Just like landing on the moon, blind dates, or playing with data science, having a plan before you start is usually a good idea.

As for our plan, we'll need to define a scientific question from the internet's "opinions" about baseball's decline before we can find sources of information we'll use to answer it.

The Question

While there are literally TONS [1] of articles with opinions on "how to fix baseball", the main claim I want to answer is:

Are strikeouts making baseball boring?

This question seems to have been asked every baseball season, or at least before I stopped searching the internet at a 1999 Chicago Tribune article interviewing Ted Williams about the rise in strikeouts with the "new wave" of power hitters like Mark McGuire and Sammy Sosa.

Which I get, if no one can hit the ball then baseball is really 9 dudes throwing the ball to themselves for 3 hours, which is boring.

So, with a solid question defined, let's start looking for sources of information that could answer it.

[1]"literally TONS" measured by page count from a google search, not by weight which the internet is estimated to only weigh ~50 grams or 0.00005 metric tons.

What Data Do We Need

Our question is dealing with trends in pitching and hitting statistics. So we'll need to find a source of baseball statistics going back to the beginning of baseball (anno domini ~1876) or as far back as we can find.

While there are many websites to choose from, I feel the data from the MLB is a good choice, mainly so we can say "according to the official MLB data…" before we start arguing with strangers on the internet.

Unfortunately though, as is often the case, the MLB doesn't give us a link to download their statistics for further analysis.

So we'll need to build a "web scraper" to download the data directly from their website, essentially making our own "download" button.

Some Rules Before We Start

Before we begin collecting data from the MLB, I feel I should give a few words (warnings) about internet etiquette:

  1. Respect the MLB's terms and conditions. The MLB are savages in the box and can defy the laws of physics to destroy you if you mess with their copyright.
  2. Don't stress their servers. Our scrapers can easily make thousands of requests per second which their servers will struggle to fulfill. (See rule #1 for why)
  3. The code will break. Websites change their layouts like I change my …er… socks. Make sure you save the data you've downloaded so you only have to do this once.

Now with all the etiquette out of the way, let's start collecting some data.

Collecting Some Data

Back "in the day" this step would require "web scraping", using python libraries like BeautifulSoup, to parse the raw HTML for the data we want.

Now-a-days, with the increasing popularity of tools like AngularJS and RESTful APIs, websites are using JavaScript to render the web-page on our browsers (client-side) instead of on their servers (server-side).

The MLB is one of these "now-a-days" websites.

Source code of the mlb.com's team pitching stats page, which is missing it's content because the JavaScript hasn't downloaded it yet.

With a client-side app, the static HTML our browsers download is a template (skin) for the data. The JavaScript will download the data using the RESTful API and convert it into something we can easily read by applying the template.

As complicated as that sounds, because the data returned by the API is in a form our computers can use, it's often easier than scraping the web-page directly.

We just have to find the API.

Finding The Data

Because our browsers are doing all the work of building the MLB's client-side web-page, we can use our browser's built-in developer tools to find where the data is coming from.

Depending on which browser you prefer, open your developer tools:

With the developer tools open:

  • Find and open the Network tab.

For Firefox 69.0 the Network tab is in Tools → Web Developer → Network or if you're into key bindings Ctrl+Shift+E.

  • Next, go to the MLB's stats page and refresh the page with the Network tab open.

You should see a list of all the requests your browser made while building the MLB's web-page. Something similar to this (I'm using Firefox 69.0):

Screenshot of the Network tab in Firefox's Developer Tools showing the 128 requests made by the mlb.com client side web application.

If you're in the year 2019, our browsers made 128 individual requests to their servers (54 more than average) for information to build the web-page. Which is 127 more requests I'm willing to look through by hand, so let's start filtering these requests down:

  • Click on the XHR icon to filter out unrelated requests.

The requests we're interested in are XHR (XMLHttpRequest) objects. These are requests made by the web-page's JavaScript to get data (usually in XML or JSON form) from an API.

Screenshot of the Network tab in Firefox's Developer Tools showing the 23 filtered requests after applying the XHR filter.

With the requests filtered, we can start exploring for the data we want. We know we're looking for data which usually comes in either XML or JSON format, so we can ignore the "js" types.

  • Click on each Response looking for the data.

I usually look at responses with the largest size first and work my way down. After a few attempts, we'll find a JSON blob with keys that match the data we want.

Screenshot of the JSON response from the MLB's "team_pitching_season_leader_master" endpoint.

Huzzah! We've found the API endpoint with the data we need.

Before we move on, take note of the data's shape in the response. Notice the list rows, with all the baseball stats, is inside queryResults inside team_pitching_season_leader_master. We'll need this when we start building the web scraper later on.

Our next step is to find how we can recreate the request using python:

  • Switch to that request's Headers tab and copy the URL.

This will show us the URL and the Headers we sent to the server.

Screenshot about the details of the request made to the API to download the stats data we need to answer our question.

If everything went according to plan, when you

  • Copy the URL into a new browser tab,

you should see the same information we found in the Developer Tools just a moment ago telling us we've found the URL to the API we want.

Good Job!

The Code

With the API discovered, we can begin building the web scraper to download the pitching statistics that will help us answer our initial question.

To make things easier on ourselves, let's install some libraries to help us download the data.

We'll be using:

Both can easily be installed via pip like this:

$ pip install requests pandas

Writing the Code

From our exploration above, we know the URL to our endpoint is this:

team_pitching = (
    'http://lookup-service-prod.mlb.com/'
    'json/named.team_pitching_season_leader_master.bam?'
    'season=2019&sort_order=%27asc%27&sort_column=%27whip%27'
    '&game_type=%27R%27&sport_code=%27mlb%27&recSP=1&recPP=50'
)

However, this will only download each team's stats for the 2019 season. This API uses the query string parameters as a kind of filter for the content it provides to us.

For example, changing season from 2019 to 2018 would yield us each team's pitching statistics for the 2018 season or changing game_type to "%27S%27" will give us statistics on spring training games.

So let's break team_pitching down into its component parts, allowing us to make requests for the different seasons the MLB has in their database.

We can start by separating the domain, path, and query string parameters into separate variables like this:

domain = 'https://lookup-service-prod.mlb.com'
path = '/json/named.team_pitching_season_leader_master.bam'
query_string = {
    'season': 2019,
    'sort_order': '%27asc%27',
    'sort_column': '%27whip%27',
    'game_type': '%27R%27',
    'sport_code': '%27mlb%27',
    'recSP': 1,
    'recPP': 50,
}

After looking up what the %27 means (they're URL encoded ' apostrophes), let's make our future selves happy by making a simple function to replace the %27 (with a note) and turn our query_string into this:

# words are quoted with apostrophes
quote = "'{}'".format

query_string = {
    'season': 2019,
    'sort_order': quote('asc'),
    'sort_column': quote('whip'),
    'game_type': quote('R'),
    'sport_code': quote('mlb'),
    'recSP': 1,
    'recPP': 50,
}

With our URL broken down, we can make a simple for loop to go through each year (1876 - 2019), making a request to the MLB to download the pitching statistics for that season and save the results into a list creatively called data.

import time

data = []  # we'll save all the stats here

# go through each season and make a request
for season in range(1876, 2019 + 1):
    time.sleep(1)  # rule #2 don't hammer the servers

    # update the query_string for the correct season
    query_string['season'] = season

    # make the request to the MLB
    response = requests.get(
        domain + path,
        params=query_string
    )

    # raise an error if the MLB gives
    # an invalid response back
    response.raise_for_status()

    # pull the list of stats out of the response
    stats = response.json()[
        'team_pitching_season_leader_master'
    ]['queryResults']['row']

    # add the season to each row, and
    # append it to our 'data' list
    for team in stats:
        team['season'] = season
        data.append(team)

And finally, after we've collected all the data from the MLB, we put the data into a pandas DataFrame to start the "data wrangling" phase of our journey.

import pandas as pd
df = pd.DataFrame(data)

For now though, remembering rule #3 above, we'll save our data to disk and wait until we have more time to answer the question "Are strikeouts making baseball boring".

df.to_pickle('mlb-team-pitching-statistics.pkl')

Good Job Everyone!