At What Age to Players provide most Value? After how many years in the league are players most productive, and when do their skills start to decline?


In order to determine how the ages at which Players hit their Peak, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html

This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Master.csv table lists every player that has played the game from 1870 to 2016, along with their year of birth . Its schema is listed below.

Table 1: Master Table Schema

Field Description
playerID A unique code asssigned to each player
birthYear Year player was born
birthMonth Month player was born
birthDay Day player was born
birthCount Country where player was born
birthState State where player was born
birthCity City where player was born
deathYear Year player died
deathMonth Month player died
deathDay Day player died
deathCount Country where player died
deathState State where player died
deathCity City where player died
nameFirst Player's first name
nameLast Player's last name
nameGiven Player's given name
weight Player's weight in pounds
height Player's height in inches
bats Player's batting hand (left, right)
throws Player's throwing hand (left or right)
debut Date that player made first appearance
finalGame Date that player made last appearance
retroID ID used by retrosheet
bbrefID ID used by Baseball Reference website

The Batting.csv table lists the batting statistics for every player, for every year that he played the game of baseball between 1870 and 2016. Its schema is listed below

Table 2 Batting Table schema

Field Description
playerID A unique code asssigned to each player
yearID Year
stint players stint
teamID Team
lgID League
G Games Played
AB At Bats
R Runs Scored
H Hits
2B Doubles
3B Triples
HR Homeruns
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base on Balls
SO Strike Outs
IBB Intentional Wals
HBP Hit by Pitch
SH Sacrifice Hits
SF Sacrifice Flies
GIDP Grounded into Double Plays

We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1), Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297

Pyspark Libraries

Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html

In [308]:
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster mode. 
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native database
# Import count, avg, round from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import count
from pyspark.sql.functions import avg
from pyspark.sql.functions import round
from pyspark.sql.functions import cume_dist


from pyspark.sql.window import Window

Pyspark Configuration & Instantiation

We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other configuration settings can be programmed as well. A detailed explanation of these can be found at https://spark.apache.org/docs/latest/configuration.html

We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark

In [309]:
# The Master will need to change when running on a cluster. 
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available cores. 
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html

sc_conf = SparkConf().setMaster('local[*]').setAppName('Question3').set('spark.logConf', True)
In [310]:
# We instantiate a SparkContext object with the SparkConfig

sc = SparkContext(conf=sc_conf)

Pyspark CSV file Processing

We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were copied over to a local EC2 instance by using the AWS command line interace command

aws s3 cp s3://cs498ccafinalproject . --recursive

In [311]:
# We create a sql context object, so that we can read in csv files easily, and create a data frame
sqlContext = SQLContext(sc)

df_master = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Master.csv')
df_bat = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Batting.csv')

Pyspark Data Operations to Determine the age a Player is most effective.

In order to determine how the Age a player is most effective, we perform the following operations.

1) We extract columns of data from the 2 tables that we need.

2) We clean the data to remove null entries and filter out entries where a player did not have enough bats in a season to qualify for a batting title (502). This makes sure we remove statistically insignificant entries.

3) We calculate additional batting statistics about the player such as Batting Average, Slugging Percentage, On base Perentage, On base Plus Slugging Percentage .

4) We then perform a merge on the 2 tables, based on the playerID

5) We calculate the age of a player, based on the year he was born and the year he was in the major

NOTE:

Batting Average = Hits/At Bats

Slugging Percentage = total bases divided by at bats: where AB is the number of at-bats for a given player, and 1B, 2B, 3B, and HR are the number of singles, doubles, triples, and home runs, respectively.

On base Percentage = (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies)

On base plus Slugging = On base percentage plus Slugging Percentage

In [312]:
# Keep the playerID and the birthYear from the Master table

keep = [df_master.playerID, df_master.birthYear ]
df_master_data = df_master.select(*keep).filter(df_master.birthYear != "")

# Keep the playerID, yearID, AB, R, H, 2B, 3B, HR, RBI, SB, BB, HBP, SF from the Master Table

keep2 = ['playerID', 'yearID', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'HBP', 'SF']
df_bat_data = df_bat.select(*keep2)

# Replace null entries with Zero in the batting stats
df_bat_no_null = df_bat_data.na.fill(0)

# Filter out statistically insnificant entries for batting
df_bat_filt = df_bat_no_null.filter((df_bat_no_null.AB >= 502)).withColumnRenamed('2B', 'DB').withColumnRenamed('3B', 'TR')

         
In [313]:
# Calculate Advanced batting stats, average, slugging pct, on base pct, on base plus slug pct 

df_bat_stats = df_bat_filt.withColumn("AVG",  round(df_bat_filt.H/df_bat_filt.AB,3)).\
withColumn("SLG", round(((df_bat_filt.H -(df_bat_filt.DB + df_bat_filt.TR + df_bat_filt.HR))+ (2*df_bat_filt.DB) + \
                   (3*df_bat_filt.TR) + (4*df_bat_filt.HR)) /(df_bat_filt.AB),3)).\
withColumn("OBP", round((df_bat_filt.H + df_bat_filt.BB + df_bat_filt.HBP)/(df_bat_filt.AB + df_bat_filt.BB + df_bat_filt.HBP + \
                                                                     df_bat_filt.SF),3))
df_bats_adv_stats = df_bat_stats.withColumn("OPS", round(df_bat_stats.OBP + df_bat_stats.SLG,3))


# Merge the two tables

cond = [df_master_data.playerID == df_bats_adv_stats.playerID]

# Join the Player Table and the Batting Table

df_bats_merge = df_bats_adv_stats.join(df_master_data, cond, 'inner')

# Calculate age of every player in the merged table

df_bats_merge_age = df_bats_merge.withColumn("age", df_bats_merge.yearID - df_bats_merge.birthYear)



df_bats_merge_age.show()
+---------+------+---+---+---+---+---+---+---+---+---+---+---+-----+-----+-----+-----+---------+---------+---+
| playerID|yearID| AB|  R|  H| DB| TR| HR|RBI| SB| BB|HBP| SF|  AVG|  SLG|  OBP|  OPS| playerID|birthYear|age|
+---------+------+---+---+---+---+---+---+---+---+---+---+---+-----+-----+-----+-----+---------+---------+---+
|dalryab01|  1884|521|111|161| 18|  9| 22| 69|  0| 14|  0|  0|0.309|0.505|0.327|0.832|dalryab01|     1857| 27|
|hornujo01|  1884|518|119|139| 27| 10|  7| 51|  0| 17|  0|  0|0.268|  0.4|0.292|0.692|hornujo01|     1857| 27|
|ansonca01|  1886|504|117|187| 35| 11| 10|147| 29| 55|  0|  0|0.371|0.544|0.433|0.977|ansonca01|     1852| 34|
|bierblo01|  1886|522| 56|118| 17|  5|  2| 47| 19| 21|  0|  0|0.226|0.289|0.256|0.545|bierblo01|     1865| 21|
|comisch01|  1886|578| 95|147| 15|  9|  3| 76| 41| 10|  0|  0|0.254|0.327|0.267|0.594|comisch01|     1859| 27|
|corkhpo01|  1886|540| 81|143|  9|  7|  5| 97| 24| 23|  6|  0|0.265|0.335|0.302|0.637|corkhpo01|     1858| 28|
|gleasbi01|  1886|524| 97|141| 18|  5|  0| 61| 19| 43|  7|  0|0.269|0.323|0.333|0.656|gleasbi01|     1858| 28|
|hankifr01|  1886|522| 66|126| 14|  5|  2| 63| 10| 49|  0|  0|0.241|0.299|0.306|0.605|hankifr01|     1856| 30|
|larkihe01|  1886|565|133|180| 36| 16|  2| 74| 32| 59|  7|  0|0.319| 0.45| 0.39| 0.84|larkihe01|     1860| 26|
|lathaar01|  1886|578|152|174| 23|  8|  1| 47| 60| 55|  6|  0|0.301|0.374|0.368|0.742|lathaar01|     1860| 26|
|mannija01|  1886|556| 78|124| 18|  7|  1| 45| 24| 50|  3|  0|0.223|0.286|0.291|0.577|mannija01|     1853| 33|
|mcclebi01|  1886|595|131|152| 33|  9|  1| 68| 43| 56|  2|  0|0.255|0.346|0.322|0.668|mcclebi01|     1856| 30|
|mcphebi01|  1886|560|139|150| 23| 12|  8| 70| 40| 59|  5|  0|0.268|0.395|0.343|0.738|mcphebi01|     1859| 27|
|oneilti01|  1886|579|106|190| 28| 14|  3|107|  9| 47|  7|  0|0.328| 0.44|0.385|0.825|oneilti01|     1858| 28|
|  orrda01|  1886|571| 93|193| 25| 31|  7| 91| 16| 17|  5|  0|0.338|0.527|0.363| 0.89|  orrda01|     1859| 27|
|phillbi01|  1886|585| 68|160| 26| 15|  0| 72| 13| 33|  1|  0|0.274|0.369|0.313|0.682|phillbi01|     1857| 29|
|pinknge01|  1886|597|119|156| 22|  7|  0| 37| 32| 70|  0|  0|0.261|0.322|0.339|0.661|pinknge01|     1859| 27|
|richaha01|  1886|538|125|189| 27| 11| 11| 61| 42| 46|  0|  0|0.351|0.504|0.402|0.906|richaha01|     1855| 31|
|rosemch01|  1886|559| 90|127| 19| 10|  5| 53|  6| 24|  8|  0|0.227|0.324|0.269|0.593|rosemch01|     1856| 30|
|sommejo01|  1886|560| 79|117| 18|  4|  1| 52| 31| 24|  3|  0|0.209|0.261|0.245|0.506|sommejo01|     1858| 28|
+---------+------+---+---+---+---+---+---+---+---+---+---+---+-----+-----+-----+-----+---------+---------+---+
only showing top 20 rows

Pyspark Average Statistics

To put our data into context, we can also look up the following information

1) Group players by their ages

a) Find the Average of the Batting Average for each age group
b) Find the Average of the On base Percentage for each age group
c) Find the Average of the Slugging Percentage for each age group
d) Find the Average of the On Base plus Slugging Percentage for each age group
In [314]:
# Group all the players by their age, and calculate the average batting average for each age group



df_avg_stats = df_bats_merge_age.groupBy(df_bats_merge_age.age).agg({"AVG": "avg","SLG": "avg","OBP": "avg" ,\
                                                                     "OPS": "avg", "age": "count"}).\
orderBy(df_bats_merge_age.age)


df_avg_stats.show()
+---+-------------------+----------+------------------+-------------------+-------------------+
|age|           avg(AVG)|count(age)|          avg(OPS)|           avg(OBP)|           avg(SLG)|
+---+-------------------+----------+------------------+-------------------+-------------------+
| 19| 0.2373333333333333|         3|0.6283333333333333| 0.3113333333333333|              0.317|
| 20| 0.2751428571428572|        14|0.7508571428571428| 0.3387857142857143|0.41207142857142853|
| 21|0.29393846153846154|        65|0.7951692307692305|0.35260000000000014| 0.4425692307692307|
| 22| 0.2806197916666665|       192|0.7581093750000001| 0.3390989583333334|  0.419010416666667|
| 23| 0.2853333333333335|       345|0.7773362318840581|0.34508695652173893|0.43224927536231883|
| 24| 0.2846509803921571|       510|0.7778235294117648| 0.3482215686274508|0.42960196078431384|
| 25| 0.2834105409153954|       721|0.7738072122052704|0.34668099861303786|0.42712621359223296|
| 26| 0.2848235995232417|       839|0.7803325387365907| 0.3491370679380218|0.43119547079856946|
| 27|0.28539377682403416|       932|0.7817263948497853|0.35005042918454915|0.43167596566523614|
| 28|0.28476233183856475|       892|0.7852500000000002| 0.3514417040358744| 0.4338082959641255|
| 29| 0.2860531520395549|       809|0.7887317676143385|0.35305438813349793|0.43567737948084045|
| 30| 0.2844603825136613|       732|0.7843360655737708|0.35189480874316936| 0.4324412568306009|
| 31| 0.2871370481927708|       664|0.7888765060240969|0.35455120481927704| 0.4343253012048188|
| 32|0.28595264116575586|       549|0.7847377049180324|0.35433333333333317| 0.4304043715846994|
| 33| 0.2875866666666671|       450|0.7919955555555558| 0.3560666666666666|0.43592888888888864|
| 34| 0.2866299694189601|       327|0.7902415902140679| 0.3548654434250762| 0.4353761467889906|
| 35|0.28528326180257524|       233|0.7918497854077253| 0.3559570815450642| 0.4358927038626608|
| 36|  0.289639240506329|       158|0.8030316455696203|  0.359981012658228|0.44305063291139263|
| 37| 0.2829727272727273|       110|            0.7764|0.35191818181818174| 0.4244818181818182|
| 38| 0.2794166666666667|        60|0.7771833333333334| 0.3464166666666666| 0.4307666666666666|
+---+-------------------+----------+------------------+-------------------+-------------------+
only showing top 20 rows

Pyspark Median and Quantile Statistics

To put our data into furhter context, we can also look up the following information

1) Group players by their ages

a) Find the Median of the Batting Average for each age group
b) Find the Median of the On base Percentage for each age group
c) Find the Median of the Slugging Percentage for each age group
d) Find the Median of the On Base plus Slugging Percentage for each age group

NOTE: It appears Spark is not able to calculate Quantile information without HIVE. We did not install HIVE on our clusters, so instead we group players by age groups, and calculate a cumulative distribution for the batting average, slugging percentage, ob base percentage and On Base plus slugging percentage. This will allow us to divide players into quantiles. We use these quantiles to provide examples of how to look up the median data for a specific age group

In [315]:
# Calculating median data with HIVE is easy
#df_bats_merge_age.registerTempTable("df")
#df_quant = sqlContext.sql("select age, percentile_approx(AVG,0.5) as approxQuantile from df group by age")

# Without Hive we have to improvise

keep3 = ['yearID', 'AVG', 'SLG', 'OBP', 'OPS', 'age' ]
df_filt_bat_data = df_bats_merge_age.select(*keep3)

windowSpec = Window.partitionBy(df_filt_bat_data['age']).orderBy(df_filt_bat_data['AVG'].desc())
windowSpec2 = Window.partitionBy(df_filt_bat_data['age']).orderBy(df_filt_bat_data['SLG'].desc())
windowSpec3 = Window.partitionBy(df_filt_bat_data['age']).orderBy(df_filt_bat_data['OBP'].desc())
windowSpec4 = Window.partitionBy(df_filt_bat_data['age']).orderBy(df_filt_bat_data['OPS'].desc())

df_med_stats = df_filt_bat_data.withColumn("cumDistAvg", cume_dist().over(windowSpec)).\
withColumn("cumDistSlg", cume_dist().over(windowSpec2)).\
withColumn("cumDistObp", cume_dist().over(windowSpec3)).\
withColumn("cumDistOps", cume_dist().over(windowSpec4))

df_med_stats.show()
+------+-----+-----+-----+-----+---+--------------------+--------------------+--------------------+--------------------+
|yearID|  AVG|  SLG|  OBP|  OPS|age|          cumDistAvg|          cumDistSlg|          cumDistObp|          cumDistOps|
+------+-----+-----+-----+-----+---+--------------------+--------------------+--------------------+--------------------+
|  1997|0.366| 0.72|0.452|1.172| 31|0.010542168674698794|0.001506024096385...|0.016566265060240965|0.001506024096385...|
|  1934|0.363|0.706|0.465|1.171| 31|0.015060240963855422|0.003012048192771...|0.004518072289156626|0.003012048192771...|
|  1938|0.349|0.704|0.462|1.166| 31|  0.0391566265060241|0.004518072289156626|0.007530120481927711|0.004518072289156626|
|  1949|0.343| 0.65| 0.49| 1.14| 31| 0.04819277108433735|0.006024096385542169|0.001506024096385...|0.006024096385542169|
|  2004|0.347| 0.62|0.469|1.089| 31|0.043674698795180725|0.016566265060240965|0.003012048192771...|0.007530120481927711|
|  1951|0.355|0.614|0.449|1.063| 31|  0.0286144578313253| 0.01957831325301205| 0.01957831325301205|0.009036144578313253|
|  2011|0.302|0.608|0.447|1.055| 31| 0.29066265060240964|0.022590361445783132|0.022590361445783132|0.010542168674698794|
|  2006|0.287|0.636|0.413|1.049| 31| 0.48493975903614456|0.009036144578313253| 0.07981927710843373|0.012048192771084338|
|  1966|0.316|0.637| 0.41|1.047| 31| 0.16415662650602408|0.007530120481927711| 0.08734939759036145| 0.01355421686746988|
|  1999|0.304|0.591|0.454|1.045| 31|  0.2740963855421687| 0.03162650602409638|0.012048192771084338|0.015060240963855422|
|  1970|0.329|0.592|0.452|1.044| 31| 0.09036144578313253|  0.0286144578313253|0.016566265060240965|0.016566265060240965|
|  2001|0.291|0.624|0.416| 1.04| 31|  0.4246987951807229|0.015060240963855422| 0.07379518072289157|0.018072289156626505|
|  1927|0.361|0.586|0.448|1.034| 31|0.018072289156626505| 0.03765060240963856| 0.02108433734939759| 0.01957831325301205|
|  2002|0.314|0.598|0.435|1.033| 31| 0.18373493975903615|0.024096385542168676| 0.03463855421686747| 0.02108433734939759|
|  1925|0.393|0.569|0.457|1.026| 31|0.003012048192771...| 0.05120481927710843|0.010542168674698794|0.024096385542168676|
|  1971|0.295|0.628|0.398|1.026| 31| 0.35993975903614456|0.012048192771084338| 0.13102409638554216|0.024096385542168676|
|  2003|0.302|0.593|0.426|1.019| 31| 0.29066265060240964| 0.02710843373493976| 0.05120481927710843|0.025602409638554216|
|  1917|0.383| 0.57|0.444|1.014| 31|0.004518072289156626| 0.04969879518072289|0.024096385542168676|  0.0286144578313253|
|  2003|0.325|0.587|0.427|1.014| 31| 0.10542168674698796| 0.03614457831325301|0.046686746987951805|  0.0286144578313253|
|  1995|0.294|0.577|0.431|1.008| 31| 0.37801204819277107|0.046686746987951805| 0.03614457831325301|0.030120481927710843|
+------+-----+-----+-----+-----+---+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

In [316]:
# Approx Median Batting Average for players of Age 27
# Answer: 0.283

df_med_stats.filter(df_med_stats.age==27).filter("cumDistAvg> 0.495 AND cumDistAvg<0.505").show()
+------+-----+-----+-----+-----+---+------------------+-------------------+-------------------+-------------------+
|yearID|  AVG|  SLG|  OBP|  OPS|age|        cumDistAvg|         cumDistSlg|         cumDistObp|         cumDistOps|
+------+-----+-----+-----+-----+---+------------------+-------------------+-------------------+-------------------+
|  1973|0.283| 0.53| 0.37|  0.9| 27|0.5021459227467812| 0.1072961373390558| 0.2982832618025751|0.13197424892703863|
|  1975|0.283|0.473|0.372|0.845| 27|0.5021459227467812| 0.2800429184549356| 0.2821888412017167| 0.2532188841201717|
|  1961|0.283| 0.47|0.362|0.832| 27|0.5021459227467812|0.28969957081545067|0.36587982832618027| 0.2939914163090129|
|  1977|0.283|0.438|0.362|  0.8| 27|0.5021459227467812| 0.4281115879828326|0.36587982832618027| 0.4012875536480687|
|  1943|0.283|0.395|0.395| 0.79| 27|0.5021459227467812| 0.6459227467811158|0.12124463519313304|0.43776824034334766|
|  2001|0.283|0.448| 0.34|0.788| 27|0.5021459227467812| 0.3927038626609442| 0.5890557939914163|0.44742489270386265|
|  1963|0.283|0.419|0.361| 0.78| 27|0.5021459227467812| 0.5332618025751072|0.38197424892703863|0.48390557939914164|
|  1974|0.283|0.415|0.346|0.761| 27|0.5021459227467812| 0.5515021459227468| 0.5311158798283262|  0.555793991416309|
|  1996|0.283|0.392|0.342|0.734| 27|0.5021459227467812| 0.6630901287553648| 0.5718884120171673|  0.657725321888412|
|  1955|0.283|0.377|0.351|0.728| 27|0.5021459227467812| 0.7317596566523605|0.48068669527896996| 0.6802575107296137|
|  1944|0.283|0.377|0.347|0.724| 27|0.5021459227467812| 0.7317596566523605| 0.5236051502145923| 0.6920600858369099|
|  1972|0.283|0.386|0.334| 0.72| 27|0.5021459227467812| 0.6909871244635193| 0.6523605150214592| 0.7081545064377682|
+------+-----+-----+-----+-----+---+------------------+-------------------+-------------------+-------------------+

In [317]:
# Approx Median OPS for players of Age 30
# Answer: 0.776

df_med_stats.filter(df_med_stats.age==30).filter("cumDistOps> 0.495 AND cumDistOps<0.505").show()
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+-------------------+
|yearID|  AVG|  SLG|  OBP|  OPS|age|         cumDistAvg|         cumDistSlg|         cumDistObp|         cumDistOps|
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+-------------------+
|  1947|0.309|0.416| 0.36|0.776| 30| 0.1953551912568306| 0.5614754098360656| 0.4139344262295082|0.49863387978142076|
|  1899|0.305|0.421|0.355|0.776| 30|0.23770491803278687| 0.5273224043715847|0.46994535519125685|0.49863387978142076|
|  1990|0.273|0.446| 0.33|0.776| 30| 0.6434426229508197|0.39344262295081966| 0.7144808743169399|0.49863387978142076|
|  1989|0.272| 0.45|0.326|0.776| 30| 0.6598360655737705| 0.3770491803278688| 0.7595628415300546|0.49863387978142076|
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+-------------------+

In [318]:
# Approx Median OBP for players of Age 25
# Answer: 0.334

df_med_stats.filter(df_med_stats.age==25).filter("cumDistObp> 0.495 AND cumDistObp<0.505").show()
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+------------------+
|yearID|  AVG|  SLG|  OBP|  OPS|age|         cumDistAvg|         cumDistSlg|         cumDistObp|        cumDistOps|
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+------------------+
|  1986|0.265|0.493|0.344|0.837| 25| 0.7281553398058253|0.20388349514563106|0.49653259361997226|0.2621359223300971|
|  1951|0.277|0.483|0.344|0.827| 25| 0.5617198335644937|0.23855755894590847|0.49653259361997226|0.2912621359223301|
|  1959|0.302| 0.47|0.344|0.814| 25|0.24549237170596394| 0.2843273231622746|0.49653259361997226|0.3342579750346741|
|  1914|0.295|0.404|0.344|0.748| 25| 0.3079056865464632| 0.5839112343966713|0.49653259361997226|0.5644937586685159|
|  1898| 0.27|0.371|0.344|0.715| 25| 0.6601941747572816| 0.7309292649098474|0.49653259361997226|0.7018030513176144|
|  1895|0.254| 0.37|0.344|0.714| 25| 0.8377253814147018|  0.739251040221914|0.49653259361997226|0.7059639389736477|
|  1943| 0.28|0.367|0.344|0.711| 25| 0.5228848821081831| 0.7558945908460472|0.49653259361997226|0.7115117891816921|
+------+-----+-----+-----+-----+---+-------------------+-------------------+-------------------+------------------+

In [319]:
# Approx Median SLG for players of Age 32, find the median Batting average
# Answer: 0.427

df_med_stats.filter(df_med_stats.age==32).filter("cumDistSlg> 0.495 AND cumDistSlg<0.505").show()
+------+-----+-----+-----+-----+---+-------------------+------------------+--------------------+-------------------+
|yearID|  AVG|  SLG|  OBP|  OPS|age|         cumDistAvg|        cumDistSlg|          cumDistObp|         cumDistOps|
+------+-----+-----+-----+-----+---+-------------------+------------------+--------------------+-------------------+
|  1941|0.317|0.427|0.417|0.844| 32| 0.1493624772313297|0.5027322404371585|0.060109289617486336| 0.2568306010928962|
|  1920|0.314|0.427|0.372|0.799| 32| 0.1766848816029144|0.5027322404371585| 0.30965391621129323|0.44990892531876137|
|  1982|0.281|0.427|0.349|0.776| 32| 0.5500910746812386|0.5027322404371585|  0.5428051001821493| 0.5209471766848816|
|  1911|0.291|0.427|0.348|0.775| 32|0.40255009107468126|0.5027322404371585|  0.5573770491803278| 0.5264116575591985|
|  1976| 0.24|0.427|0.335|0.762| 32| 0.9489981785063752|0.5027322404371585|  0.6830601092896175| 0.5737704918032787|
|  2007| 0.29|0.427|0.332|0.759| 32| 0.4207650273224044|0.5027322404371585|  0.7085610200364298|  0.581056466302368|
+------+-----+-----+-----+-----+---+-------------------+------------------+--------------------+-------------------+

Pyspark Test Results

We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These files contain the answers to the questions we posed.

In [335]:
# Examples to show how to print the results to an output file



pandas_bats_merge_age = df_bats_merge_age.toPandas()
pandas_avg_stats = df_avg_stats.toPandas()
pandas_med_stats = df_med_stats.toPandas()


pandas_bats_merge_age.to_csv('spark_question3_bat_stats_with_age.csv')
pandas_avg_stats.to_csv('spark_question3_bat_stats_averages_by_age.csv')
pandas_med_stats.to_csv('spark_question3_bat_stats_quantile_by_age.csv')
In [336]:
sc.stop()