In order to determine if we can identify the Sterpid era by Looking at Player Performance data, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html
This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Master.csv table lists every player that has played the game from 1870 to 2016, along with their year of birth . Its schema is listed below.
Field | Description |
---|---|
playerID | A unique code asssigned to each player |
birthYear | Year player was born |
birthMonth | Month player was born |
birthDay | Day player was born |
birthCount | Country where player was born |
birthState | State where player was born |
birthCity | City where player was born |
deathYear | Year player died |
deathMonth | Month player died |
deathDay | Day player died |
deathCount | Country where player died |
deathState | State where player died |
deathCity | City where player died |
nameFirst | Player's first name |
nameLast | Player's last name |
nameGiven | Player's given name |
weight | Player's weight in pounds |
height | Player's height in inches |
bats | Player's batting hand (left, right) |
throws | Player's throwing hand (left or right) |
debut | Date that player made first appearance |
finalGame | Date that player made last appearance |
retroID | ID used by retrosheet |
bbrefID | ID used by Baseball Reference website |
The Batting.csv table lists the batting statistics for every player, for every year that he played the game of baseball between 1870 and 2016. Its schema is listed below
Field | Description |
---|---|
playerID | A unique code asssigned to each player |
yearID | Year |
stint | players stint |
teamID | Team |
lgID | League |
G | Games Played |
AB | At Bats |
R | Runs Scored |
H | Hits |
2B | Doubles |
3B | Triples |
HR | Homeruns |
RBI | Runs Batted In |
SB | Stolen Bases |
CS | Caught Stealing |
BB | Base on Balls |
SO | Strike Outs |
IBB | Intentional Wals |
HBP | Hit by Pitch |
SH | Sacrifice Hits |
SF | Sacrifice Flies |
GIDP | Grounded into Double Plays |
We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1), Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster mode.
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native database
# Import count, avg, round from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import count
from pyspark.sql.functions import avg
from pyspark.sql.functions import round
from pyspark.sql.functions import cume_dist
from pyspark.sql.window import Window
import time
We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other configuration settings can be programmed as well. A detailed explanation of these can be found at https://spark.apache.org/docs/latest/configuration.html
We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark
# The Master will need to change when running on a cluster.
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available cores.
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html
sc_conf = SparkConf().setMaster('local[*]').setAppName('Question3b').set('spark.logConf', True)
# We instantiate a SparkContext object with the SparkConfig
sc = SparkContext(conf=sc_conf)
We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were copied over to a local EC2 instance by using the AWS command line interace command
aws s3 cp s3://cs498ccafinalproject . --recursive
# We create a sql context object, so that we can read in csv files easily, and create a data frame
sqlContext = SQLContext(sc)
masterData = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Master.csv')
battingData = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Batting.csv')
The Steroid Era in Major League Baseball was a time where Performance Enhacing Drugs were banned, but tests were not performed on Players to determine if they were taking Steroids. Steroids were banned in 1991, and it was suspected that Players were engaging in steroid usage from 1995. MLB enforced Testing for Steroids in 2003. So we will consider the Steroid era as 1995-2003.
From 1870 to 2016, we look at the Average number of Home runs hit by players over the age of 30, and try to see if the years Steroids were prevalent in Baseball stand out. In order to do this, we perform the following opertions
1) We join the Batting Table data with the Master Table data.
2) We calculate the Age of every player entry in the merged table.
3) We clean the data to remove null entries and filter out entries where a player did not have enough bats in a season to qualify for a batting title (502). This makes sure we remove statistically insignificant entries. We also remove any data points where a players age could not be calculated, due to a missing birth year.
3) We filter out players who did not spend enough time in the major leagues, or did not Hit enough to qualify for a batting title.
4) We filter our Data for Players over the Age of 30.
5) For every year from 1870 to 2016, we calculate the Average number Home runs per year, for players over the age of 30.
6) We print the Top 20 entries in the Table, by Descending order of Average number of Home runs hit.
Our data set spans 146 years. We see that Years 1995-2003 are all in the Top 20 for average number of Home runs hit, for players over the Age of 30. In fact 1999 and 2001 are the Top 2 entries. This correctly identifies the Height of the Steroid era based on player performance
# Merge the two data frames
questionData = battingData.join(masterData, masterData.playerID == battingData.playerID, 'left')
# Calculate the players age (Using the year only since we aren't calculating the value for every single game and it is summarized in the playing year)
questionData = questionData.withColumn('playerAge', questionData.yearID - questionData.birthYear)
# Filter out lines where the players age could not be calculated (likely because the birth year was blank)
questionData = questionData.na.drop(subset=["playerAge"])
# Filter out players who didn't have at least 502 batting chances which qualifies them for a batting title
questionData = questionData.filter(questionData.AB >= 502)
# Focus on players over 30 years of age
steroidEra = questionData.filter(questionData.playerAge >= 30)
# Query the averge batting value for players by year
steroidEra = steroidEra.createOrReplaceTempView('steroidEra')
sqlDF = sqlContext.sql('select yearID, avg(HR) from steroidEra group by yearID order by yearID asc')
sqlDF= sqlDF.withColumnRenamed('avg(HR)', 'avgHR')
sqlDF.orderBy(sqlDF.avgHR.desc()).show()
We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These files contain the answers to the questions we posed.
# Examples to show how to print the results to an output file
pandas_sqlDF = sqlDF.toPandas()
pandas_sqlDF.to_csv('spark_question3b_avg_hr_players_over_30.csv')
sc.stop()