In order to determine how the effect Team Salary expenditure has on Major League Baseball Team Performance, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html
This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Salaries.csv table lists every player that played in major league baseball, along with their team, and their associated salary. This data is only provided for the years 1985 and later. Its schema is listed below.
Field | Description |
---|---|
yearID | Year |
teamID | Team |
lgID | League |
playerID | A unique code asssigned to each player |
salary | Player Salary |
Note: At the Time of writing, the teamID in the Salaries.csv table for the year 2016 did not follow the convention of teamID's used throughout the rest of the table, and the entire database. Specifically 12 teams had teamIDs that did not match the code that had been used for their teamIDs in previous years. This data was manually cleaned to make sure it did not affect the Results obtained.
The Teams.csv table lists the Team statistics for every team, that has played the game of baseball from 1870 to 2016, along with the year those statistics were recorded. Its schema is listed below
Field | Description |
---|---|
yearID | Year |
lgID | League |
teamID | Team |
franchID | Franchise |
divID | Teams Division |
Rank | Position in Final Standings |
G | Games Played |
GHome | Games Played at Home |
W | Wins |
L | Losses |
DivWin | Division Winner |
WCWin | Wild Card Winner |
LgWin | League Champion |
WSWin | Word Series Champion |
R | Runs Scored |
AB | At Bats |
H | Hits |
2B | Doubles |
3B | Triples |
HR | Homeruns |
BB | Batters Walked |
SO | Strike Outs |
SB | Stolen Bases |
CS | Caught Stealing |
HBP | Hit by Pitch |
SF | Sacrifice Flies |
RA | Opponent Runs Scored |
ER | Earned Runs Allowed |
ERA | Earned Run Average |
CG | Complete Games |
SHO | Shutouts |
SV | Saves |
IPOuts | Outs Pitched |
HA | Hits Allowed |
HRA | Home Runs Allowed |
BBA | Walks Allowed |
SOA | Strikeouts by Pitchers |
E | Errors |
DP | Double Plays |
FP | Fielding Percentage |
name | Teams Full Name |
park | Park |
attendance | Home Attendance Total |
BPF | 3 Year Park Factor Batters |
PPF | 3 Year Park Factor Pitchers |
teamIDBR | Team ID Baseball Reference |
teamIDlahman45 | Team ID Baseball Reference Lahman 4.5 |
teamIDretro | Team ID Baseball Reference Retrosheet |
We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1), Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster mode.
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native database
# Import count, sum, avg, rank from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import count
from pyspark.sql.functions import sum
from pyspark.sql.functions import rank
from pyspark.sql.functions import avg
from pyspark.sql.window import Window
We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other configuration settings can be programmed as well. A detailed explanation of these can be found at https://spark.apache.org/docs/latest/configuration.html
We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark
# The Master will need to change when running on a cluster.
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available cores.
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html
sc_conf = SparkConf().setMaster('local[*]').setAppName('Question2').set('spark.logConf', True)
# We instantiate a SparkContext object with the SparkConfig
sc = SparkContext(conf=sc_conf)
We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were copied over to a local EC2 instance by using the AWS command line interace command
aws s3 cp s3://cs498ccafinalproject . --recursive
# We create a sql context object, so that we can read in csv files easily, and create a data frame
sqlContext = SQLContext(sc)
df_salary = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Salaries.csv')
df_team = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Teams.csv')
In order to determine how the Global representation of Major League Baseball players has changed over time, we perform the following operations
1) We sum up the salary of every player, for every team, for every year. So we can obtain the yearly salary for each time
2) We perform an innner join on the Aggregated Salary data and the Teams.csv table, using the yearID and teamID as unique keys.
3) We select only the columns that we need (yearID, teamID, salary, W, DivWin, WCWin, LgWin, WSWin, R, H, HR, RA, ERA, HA and HRA) to answer our question
4) We then sort the data by yearID and teamID.
This gives us a dataframe that lists the every Team, for the years 1985 and beyond, and their corresponding Salary, Wins, Divisonal Rank, Division Champion state, Wild Card Winner state, League Champion state, World Series Champion state, Runs, Hits, Home Runs, Runs Allowed, Earned Run Average, Hits Allowed and Home Runs Alowed.
# Keep the year, team and salary data from the salary table
# Group the data by year and team, then sum up all the salaries
# Sort the data by year and then team
keep = [df_salary.yearID, df_salary.teamID, df_salary.salary ]
df_salary_filt = df_salary.select(*keep)
df_salary_agg_raw = df_salary_filt.groupBy(df_salary_filt.yearID, df_salary_filt.teamID).\
agg(sum(df_salary_filt.salary)).\
orderBy(df_salary_filt.yearID, df_salary_filt.teamID)
df_salary_agg = df_salary_agg_raw.withColumnRenamed('yearID', 'year').withColumnRenamed('teamID', 'team')
# From the Team table select the Wins, Divisonal Rank, Division Champion state, Wild Card Winner state, League Champion state,
# World Series Champion state, Runs, Hits, Home Runs, Runs Allowed, Earned Run Average, Hits Allowed and Home Runs Alowed
keep2 = [df_team.yearID, df_team.teamID, df_team.W, df_team.DivWin, df_team.WCWin, df_team.LgWin, df_team.WSWin, df_team.R, \
df_team.H, df_team.HR, df_team.RA, df_team.ERA, df_team.HA, df_team.HRA]
df_team_filt = df_team.select(*keep2)
cond = [df_team_filt.yearID == df_salary_agg.year, df_team_filt.teamID == df_salary_agg.team]
# Join the Team Table and the Salart Table
df_merge_raw = df_team_filt.join(df_salary_agg, cond, 'inner')
df_merge = df_merge_raw.withColumnRenamed('sum(salary)', 'salary')
df_merge.show()
To put our data into context, we can also look up the following information
1) Extract the Highest spending teams each year (1984 and later), and examine their performance
a) Find the Number and Percentage of times the Top Spending Team has won the World Series
b) Find the Number and Percentage of times the Top Spending Team has won their League (AL/NL)
c) Find the Number and Percentage of times the Top Spending Team has won their Division
d) Find the Number and Percentage of times the Top Spending Team has made the Playoffs
2) Extract the World Series winning team each year, and examine their salary rank
a) Find the Number and Percentage of times the Word Series Winner is in the Top 5 spenders in the league
b) Find the Number and Percentage of times the Word Series Winner is in the Top 10 spenders in the league
# Rank each team by the amount of money spent every year
windowSpec = Window.partitionBy(df_merge['yearID']).orderBy(df_merge['salary'].desc())
df_big_spender = df_merge.withColumn("yearRank", rank().over(windowSpec))
# Extract the Top Spending Team Every Year, and look at their performance
df_top_spender = df_big_spender.filter(df_big_spender.yearRank == 1).orderBy(df_big_spender['yearID'], ascending=False)
df_top_spender.show()
# Number of time the Top Spending Team has won the world series after 1984
# Ans: 5
topSpendWs = df_top_spender.filter(df_top_spender.WSWin=="Y").count()
teamCount = df_top_spender.count()
print (topSpendWs)
# Percentage of time the top spending team has won the world series after 1984
# Ans: 15.635 %
print ((topSpendWs/teamCount)*100)
# Number of times the top spending team has won their League (AL/NL) after 1984
# Ans: 7
topSpendLg = df_top_spender.filter(df_top_spender.LgWin=="Y").count()
print (topSpendLg)
# Percentage of time the top spending team has won their League (AL/NL) after 1984
# Ans: 21.875 %
print ((topSpendLg/teamCount)*100)
# Number of times the top spending team has won their Division after 1984
# Ans: 15
topSpendDiv = df_top_spender.filter(df_top_spender.DivWin=="Y").count()
print (topSpendDiv)
# Percentage of time the top spending team has won their Division after 1984
# Ans: 46.875%
print ((topSpendDiv/teamCount)*100)
# Number of times the top spending team has made it to the Playoffs
# Ans: 19
topSpendPl = df_top_spender.filter("DivWin=='Y' OR WCWin=='Y'").count()
# Percentage of times the top spending team has made it to the Playoffs
# Ans: 59.375%
print((topSpendPl/teamCount)*100)
# Extract the Word Series Team Every Year, and look at their Spending Rank
df_ws_winner = df_big_spender.filter(df_big_spender.WSWin == "Y").orderBy(df_big_spender['yearID'], ascending=False)
df_ws_winner.show()
# Number of times the World Series Winner is in the Top 5 spenders in the League (After 1984)
# Ans: 14
wsTop5 = df_ws_winner.filter(df_ws_winner.yearRank <=5).count()
teamCount = df_ws_winner.count()
print (wsTop5)
# Percentage of times the World Series Winner is in the Top 5 spenders in the League (After 1984)
# Ans: 45.16%
print ((wsTop5/teamCount)*100)
# Number of times the World Series Winner is in the Top 10 spenders in the League (After 1984)
# Ans: 21
dvTop5 = df_ws_winner.filter(df_ws_winner.yearRank <=10).count()
print (dvTop5)
# Percentage of times the World Series Winner is in the Top 10 spenders in the League (After 1984)
# Ans: 67.74%
print ((dvTop5/teamCount)*100)
We can partition the data by a Teams spending rank, and look at the Average number of wins achieved by each team based on their spending rank. We can also look at their Average amount of Salary expenditure. From the Data we see that the Teams that spend the most on average, also win the most on average. The Highest spending teams since 1984 have averaged 89 wins per season. They have also roughly spent an average of $118 million on salary.
There is a large gap between the Average spending of the Top team, and the Average spending of the Team that spends the fifth most amount of money. There is roughly a 50 percent increase in average spending between these teams. However there is only an average difference of 5 wins per season between these teams.
# Calculate the Average number of wins by Spending Rank
windowSpec2 = Window.partitionBy(df_big_spender['yearRank']).orderBy(df_big_spender['yearRank'].desc())
df_avg_win = df_big_spender.withColumn("avgWin", avg(df_big_spender['W']).over(windowSpec2)).\
withColumn("avgSal", 0.000001*avg(df_big_spender['salary']).over(windowSpec2))
df_avg_win_fin = df_avg_win.filter(df_avg_win.yearID == 2016).orderBy(df_avg_win['yearRank'])
keep4 = [df_avg_win_fin.yearRank, df_avg_win_fin.avgWin, df_avg_win_fin.avgSal]
df_avg_win_fin_filt = df_avg_win_fin.select(*keep4)
df_avg_win_fin_filt.show()
We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These files contain the answers to the questions we posed.
# Examples to show how to print the results to an output file
keep3 = [df_merge.yearID, df_merge.teamID, df_merge.W, df_merge.salary]
df_merge_wins_sal = df_merge.select(*keep3).filter(df_merge.yearID>2006)
pandas_merge_wins_sal = df_merge_wins_sal.toPandas()
pandas_merge = df_merge.toPandas()
pandas_top_spender = df_top_spender.toPandas()
pandas_ws_winner = df_ws_winner.toPandas()
pandas_avg_sal_wins = df_avg_win_fin_filt.toPandas()
pandas_merge_wins_sal.to_csv('spark_question2_wins_salary_last_10.csv')
pandas_merge.to_csv('spark_question2_salary.csv')
pandas_top_spender.to_csv('spark_question2_top_spender.csv')
pandas_ws_winner.to_csv('spark_question2_ws_winner.csv')
pandas_avg_sal_wins.to_csv('spark_question2_avg_sal_wins.csv')
sc.stop()