Skip to content
Snippets Groups Projects
Commit badf00d8 authored by Nischol Antao's avatar Nischol Antao
Browse files

Code, ipython notebooks, webpage and results for question 7

parent 351dc1fc
No related branches found
No related tags found
No related merge requests found
No preview for this file type
%% Cell type:markdown id: tags:
## Which Players have shown the most improvement in Batting Average in the post season? Which players have shown the most regression in Batting Average in the post season?
____
In order to determine the difference in a players regular season, and post season performance, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html
This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 3 tables in this database. The Master.csv table lists every player that has played the game from 1870 to 2016, along with their year of birth . Its schema is listed below.
#### Table 1: Master Table Schema
| Field | Description |
| ---------- | -------------------------------------- |
| playerID | A unique code asssigned to each player |
| birthYear | Year player was born |
| birthMonth | Month player was born |
| birthDay | Day player was born |
| birthCount | Country where player was born |
| birthState | State where player was born |
| birthCity | City where player was born |
| deathYear | Year player died |
| deathMonth | Month player died |
| deathDay | Day player died |
| deathCount | Country where player died |
| deathState | State where player died |
| deathCity | City where player died |
| nameFirst | Player's first name |
| nameLast | Player's last name |
| nameGiven | Player's given name |
| weight | Player's weight in pounds |
| height | Player's height in inches |
| bats | Player's batting hand (left, right) |
| throws | Player's throwing hand (left or right) |
| debut | Date that player made first appearance |
| finalGame | Date that player made last appearance |
| retroID | ID used by retrosheet |
| bbrefID | ID used by Baseball Reference website |
The Batting.csv table lists the batting statistics for every player, for every year that he played the game of baseball between 1870 and 2016. Its schema is listed below
#### Table 2 Batting Table schema
| Field | Description |
| -------------- | -------------------------------------- |
| playerID | A unique code asssigned to each player |
| yearID | Year |
| stint | players stint |
| teamID | Team |
| lgID | League |
| G | Games Played |
| AB | At Bats |
| R | Runs Scored |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Base on Balls |
| SO | Strike Outs |
| IBB | Intentional Wals |
| HBP | Hit by Pitch |
| SH | Sacrifice Hits |
| SF | Sacrifice Flies |
| GIDP | Grounded into Double Plays |
#### Table 3 Post Season Batting Table schema
| Field | Description |
| -------------- | -------------------------------------- |
| yearID | Year |
| round | Level of playoffs |
| playerID | A unique code asssigned to each player |
| teamID | Team |
| lgID | League |
| G | Games Played |
| AB | At Bats |
| R | Runs Scored |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Base on Balls |
| SO | Strike Outs |
| IBB | Intentional Wals |
| HBP | Hit by Pitch |
| SH | Sacrifice Hits |
| SF | Sacrifice Flies |
| GIDP | Grounded into Double Plays |
We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1), Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
%% Cell type:markdown id: tags:
#### Pyspark Libraries
Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html
%% Cell type:code id: tags:
``` python
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster mode.
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native database
# Import avg, round from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import avg
from pyspark.sql.functions import round
from pyspark.sql.functions import sum
```
%% Cell type:markdown id: tags:
#### Pyspark Configuration & Instantiation
We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other configuration settings can be programmed as well. A detailed explanation of these can be found at https://spark.apache.org/docs/latest/configuration.html
We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark
%% Cell type:code id: tags:
``` python
# The Master will need to change when running on a cluster.
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available cores.
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html
sc_conf = SparkConf().setMaster('local[*]').setAppName('Question7').set('spark.logConf', True)
```
%% Cell type:code id: tags:
``` python
# We instantiate a SparkContext object with the SparkConfig
sc = SparkContext(conf=sc_conf)
```
%% Cell type:markdown id: tags:
#### Pyspark CSV file Processing
We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were copied over to a local EC2 instance by using the AWS command line interace command
```aws s3 cp s3://cs498ccafinalproject . --recursive```
%% Cell type:code id: tags:
``` python
sqlContext = SQLContext(sc)
df_bat_post =sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('BattingPost.csv')
df_bat = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Batting.csv')
df_master = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Master.csv')
```
%% Cell type:markdown id: tags:
#### Pyspark Data Operations to Determine the effect of Team Salary on Team Performance after 1984
In order to determine how the Global representation of Major League Baseball players has changed over time, we perform the following operations
1) We select the playerID, Hits and At Bats columns from the Regular Season and Post Season Batting Tables
2) We clean the data to remove any Null entries
3) We perform an inner join between the regular season batting table and the post season batting table, to remove players who did not make it to the playoffs in their careers. We consider this as our new regular season data set
4) We group the regular season and post season tables by playerID and calculate the sum of at bats, and the sum of hits for each player in these data frames
5) We filter the post season and the regular season data frames, to only include players who have had a statistically significant number of At-bats, over their careers (60 for post season , 502 for regular season)
6) We perform an inner join between the post season data frame and the regular season data frame, then calculate the difference between post season batting average and regular season batting average in this merged dataframe.
7) We filter the master table for a players name and his playerID
8) We then perform an inner join between the data frame that had our batting average difference, and the filtered master table, so that we are easily able to determine a players name.
%% Cell type:code id: tags:
``` python
# Filter the columns we ned to calculate a players batting average
keep = ['playerID', 'AB', 'H']
df_bat_post_data = df_bat_post.select(*keep).na.fill(0)
df_bat_data = df_bat.select(*keep).na.fill(0)
df_bat_data.join(df_bat_post_data,[df_bat_data.playerID == df_bat_post_data.playerID], 'inner')
# Sum the H and AB for each player
df_bat_post_data_agg = df_bat_post_data.groupBy(df_bat_post_data.playerID).agg({"H": "sum", "AB": "sum"})
df_bat_data_agg = df_bat_data.groupBy(df_bat_data.playerID).agg({"H": "sum", "AB": "sum"})
# Rename the collumns for easier use later
df_bat_post_data_agg = df_bat_post_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_post_data_agg = df_bat_post_data_agg.filter(df_bat_post_data_agg.sumAB >= 60)
df_bat_post_stats = df_bat_post_data_agg.withColumn("PAVG", round(df_bat_post_data_agg.sumH/df_bat_post_data_agg.sumAB,3))
# Calculate the batting average for each player
df_bat_data_agg = df_bat_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_data_agg = df_bat_data_agg.filter(df_bat_data_agg.sumAB >= 502)
df_bat_stats = df_bat_data_agg.withColumn("AVG", round(df_bat_data_agg.sumH/df_bat_data_agg.sumAB,3))
# Calcuate the batting difference between post and regular season
df_bat_diff = df_bat_post_stats.join(df_bat_stats,['playerID'],'inner')
df_bat_diff = df_bat_diff.withColumn("DIFF", round(df_bat_diff.PAVG - df_bat_diff.AVG, 3))
# Add first and last name to list
keep = ['playerID', 'nameFirst', 'nameLast']
df_master = df_master.select(*keep)
df_bat_diff = df_bat_diff.join(df_master,['playerID'],'inner')
# Only show the stuff we care about
keep = ['playerID', 'nameFirst', 'nameLast', 'DIFF']
df_bat_diff = df_bat_diff.select(*keep)
```
%% Cell type:code id: tags:
``` python
# Display the players that showed the most improvement
df_bat_diff.orderBy(df_bat_diff['DIFF'].desc()).show()
```
%% Output
+---------+---------+-----------+-----+
| playerID|nameFirst| nameLast| DIFF|
+---------+---------+-----------+-----+
| wardjo01| John| Ward|0.125|
|brocklo01| Lou| Brock|0.098|
|stanlmi02| Mike| Stanley|0.086|
|yastrca01| Carl|Yastrzemski|0.084|
| penato01| Tony| Pena|0.078|
|watsobo01| Bob| Watson|0.076|
|martibi02| Billy| Martin|0.076|
|castivi02| Vinny| Castilla|0.074|
|dempsri01| Rick| Dempsey| 0.07|
|valenjo02| John| Valentin|0.068|
|glaustr01| Troy| Glaus|0.067|
|loneyja01| James| Loney|0.066|
|munsoth01| Thurman| Munson|0.065|
|bordepa01| Pat| Borders|0.062|
|molitpa01| Paul| Molitor|0.062|
|ripkeca01| Cal| Ripken| 0.06|
|collihu01| Hub| Collins| 0.06|
| snowjt01| J. T.| Snow|0.059|
|yountro01| Robin| Yount|0.059|
|guillca01| Carlos| Guillen|0.059|
+---------+---------+-----------+-----+
only showing top 20 rows
%% Cell type:code id: tags:
``` python
# Display the players that showed the most regression
df_bat_diff.orderBy(df_bat_diff['DIFF']).show()
```
%% Output
+---------+---------+----------+------+
| playerID|nameFirst| nameLast| DIFF|
+---------+---------+----------+------+
|wilsoda01| Dan| Wilson|-0.171|
|jackstr01| Travis| Jackson|-0.142|
|bumbral01| Al| Bumbry| -0.14|
| haasmu01| Mule| Haas|-0.131|
|hrbekke01| Kent| Hrbek|-0.128|
|hafeych01| Chick| Hafey|-0.112|
|bordimi01| Mike| Bordick|-0.112|
|seageco01| Corey| Seager|-0.112|
|bottoji01| Jim| Bottomley| -0.11|
|lowrije01| Jed| Lowrie|-0.108|
|mcinnst01| Stuffy| McInnis|-0.107|
|bancrda01| Dave| Bancroft|-0.107|
|mclemma01| Mark| McLemore|-0.107|
|galaran01| Andres| Galarraga|-0.106|
| corajo01| Joey| Cora|-0.104|
| cobbty01| Ty| Cobb|-0.104|
|heywaja01| Jason| Heyward|-0.104|
|figgich01| Chone| Figgins|-0.104|
|maxvida01| Dal| Maxvill|-0.103|
|richaha01| Hardy|Richardson|-0.102|
+---------+---------+----------+------+
only showing top 20 rows
%% Cell type:markdown id: tags:
#### Pyspark Test Results
We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These files contain the answers to the questions we posed.
%% Cell type:code id: tags:
``` python
# Print the total execution time
pandas_bat_diff = df_bat_diff.toPandas()
pandas_bat_diff.to_csv('spark_question6_post_season_bat_diff.csv')
```
%% Cell type:code id: tags:
``` python
sc.stop()
```
## Which Players have shown the most improvement in Batting Average in the post season? Which players have shown the most regression in Batting Average in the post season?
____
In order to determine the difference in a players regular season, and post season performance, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html
This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 3 tables in this database. The Master.csv table lists every player that has played the game from 1870 to 2016, along with their year of birth . Its schema is listed below.
#### Table 1: Master Table Schema
| Field | Description |
| ---------- | -------------------------------------- |
| playerID | A unique code asssigned to each player |
| birthYear | Year player was born |
| birthMonth | Month player was born |
| birthDay | Day player was born |
| birthCount | Country where player was born |
| birthState | State where player was born |
| birthCity | City where player was born |
| deathYear | Year player died |
| deathMonth | Month player died |
| deathDay | Day player died |
| deathCount | Country where player died |
| deathState | State where player died |
| deathCity | City where player died |
| nameFirst | Player's first name |
| nameLast | Player's last name |
| nameGiven | Player's given name |
| weight | Player's weight in pounds |
| height | Player's height in inches |
| bats | Player's batting hand (left, right) |
| throws | Player's throwing hand (left or right) |
| debut | Date that player made first appearance |
| finalGame | Date that player made last appearance |
| retroID | ID used by retrosheet |
| bbrefID | ID used by Baseball Reference website |
The Batting.csv table lists the batting statistics for every player, for every year that he played the game of baseball between 1870 and 2016. Its schema is listed below
#### Table 2 Batting Table schema
| Field | Description |
| -------------- | -------------------------------------- |
| playerID | A unique code asssigned to each player |
| yearID | Year |
| stint | players stint |
| teamID | Team |
| lgID | League |
| G | Games Played |
| AB | At Bats |
| R | Runs Scored |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Base on Balls |
| SO | Strike Outs |
| IBB | Intentional Wals |
| HBP | Hit by Pitch |
| SH | Sacrifice Hits |
| SF | Sacrifice Flies |
| GIDP | Grounded into Double Plays |
#### Table 3 Post Season Batting Table schema
| Field | Description |
| -------------- | -------------------------------------- |
| yearID | Year |
| round | Level of playoffs |
| playerID | A unique code asssigned to each player |
| teamID | Team |
| lgID | League |
| G | Games Played |
| AB | At Bats |
| R | Runs Scored |
| H | Hits |
| 2B | Doubles |
| 3B | Triples |
| HR | Homeruns |
| RBI | Runs Batted In |
| SB | Stolen Bases |
| CS | Caught Stealing |
| BB | Base on Balls |
| SO | Strike Outs |
| IBB | Intentional Wals |
| HBP | Hit by Pitch |
| SH | Sacrifice Hits |
| SF | Sacrifice Flies |
| GIDP | Grounded into Double Plays |
We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1), Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
#### Pyspark Libraries
Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html
```python
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster mode.
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native database
# Import avg, round from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import avg
from pyspark.sql.functions import round
from pyspark.sql.functions import sum
```
#### Pyspark Configuration & Instantiation
We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other configuration settings can be programmed as well. A detailed explanation of these can be found at https://spark.apache.org/docs/latest/configuration.html
We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark
```python
# The Master will need to change when running on a cluster.
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available cores.
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html
sc_conf = SparkConf().setMaster('local[*]').setAppName('Question7').set('spark.logConf', True)
```
```python
# We instantiate a SparkContext object with the SparkConfig
sc = SparkContext(conf=sc_conf)
```
#### Pyspark CSV file Processing
We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were copied over to a local EC2 instance by using the AWS command line interace command
```aws s3 cp s3://cs498ccafinalproject . --recursive```
```python
sqlContext = SQLContext(sc)
df_bat_post =sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('BattingPost.csv')
df_bat = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Batting.csv')
df_master = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Master.csv')
```
#### Pyspark Data Operations to Determine the effect of Team Salary on Team Performance after 1984
In order to determine how the Global representation of Major League Baseball players has changed over time, we perform the following operations
1) We select the playerID, Hits and At Bats columns from the Regular Season and Post Season Batting Tables
2) We clean the data to remove any Null entries
3) We perform an inner join between the regular season batting table and the post season batting table, to remove players who did not make it to the playoffs in their careers. We consider this as our new regular season data set
4) We group the regular season and post season tables by playerID and calculate the sum of at bats, and the sum of hits for each player in these data frames
5) We filter the post season and the regular season data frames, to only include players who have had a statistically significant number of At-bats, over their careers (60 for post season , 502 for regular season)
6) We perform an inner join between the post season data frame and the regular season data frame, then calculate the difference between post season batting average and regular season batting average in this merged dataframe.
7) We filter the master table for a players name and his playerID
8) We then perform an inner join between the data frame that had our batting average difference, and the filtered master table, so that we are easily able to determine a players name.
```python
# Filter the columns we ned to calculate a players batting average
keep = ['playerID', 'AB', 'H']
df_bat_post_data = df_bat_post.select(*keep).na.fill(0)
df_bat_data = df_bat.select(*keep).na.fill(0)
df_bat_data.join(df_bat_post_data,[df_bat_data.playerID == df_bat_post_data.playerID], 'inner')
# Sum the H and AB for each player
df_bat_post_data_agg = df_bat_post_data.groupBy(df_bat_post_data.playerID).agg({"H": "sum", "AB": "sum"})
df_bat_data_agg = df_bat_data.groupBy(df_bat_data.playerID).agg({"H": "sum", "AB": "sum"})
# Rename the collumns for easier use later
df_bat_post_data_agg = df_bat_post_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_post_data_agg = df_bat_post_data_agg.filter(df_bat_post_data_agg.sumAB >= 60)
df_bat_post_stats = df_bat_post_data_agg.withColumn("PAVG", round(df_bat_post_data_agg.sumH/df_bat_post_data_agg.sumAB,3))
# Calculate the batting average for each player
df_bat_data_agg = df_bat_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_data_agg = df_bat_data_agg.filter(df_bat_data_agg.sumAB >= 502)
df_bat_stats = df_bat_data_agg.withColumn("AVG", round(df_bat_data_agg.sumH/df_bat_data_agg.sumAB,3))
# Calcuate the batting difference between post and regular season
df_bat_diff = df_bat_post_stats.join(df_bat_stats,['playerID'],'inner')
df_bat_diff = df_bat_diff.withColumn("DIFF", round(df_bat_diff.PAVG - df_bat_diff.AVG, 3))
# Add first and last name to list
keep = ['playerID', 'nameFirst', 'nameLast']
df_master = df_master.select(*keep)
df_bat_diff = df_bat_diff.join(df_master,['playerID'],'inner')
# Only show the stuff we care about
keep = ['playerID', 'nameFirst', 'nameLast', 'DIFF']
df_bat_diff = df_bat_diff.select(*keep)
```
```python
# Display the players that showed the most improvement
df_bat_diff.orderBy(df_bat_diff['DIFF'].desc()).show()
```
+---------+---------+-----------+-----+
| playerID|nameFirst| nameLast| DIFF|
+---------+---------+-----------+-----+
| wardjo01| John| Ward|0.125|
|brocklo01| Lou| Brock|0.098|
|stanlmi02| Mike| Stanley|0.086|
|yastrca01| Carl|Yastrzemski|0.084|
| penato01| Tony| Pena|0.078|
|watsobo01| Bob| Watson|0.076|
|martibi02| Billy| Martin|0.076|
|castivi02| Vinny| Castilla|0.074|
|dempsri01| Rick| Dempsey| 0.07|
|valenjo02| John| Valentin|0.068|
|glaustr01| Troy| Glaus|0.067|
|loneyja01| James| Loney|0.066|
|munsoth01| Thurman| Munson|0.065|
|bordepa01| Pat| Borders|0.062|
|molitpa01| Paul| Molitor|0.062|
|ripkeca01| Cal| Ripken| 0.06|
|collihu01| Hub| Collins| 0.06|
| snowjt01| J. T.| Snow|0.059|
|yountro01| Robin| Yount|0.059|
|guillca01| Carlos| Guillen|0.059|
+---------+---------+-----------+-----+
only showing top 20 rows
```python
# Display the players that showed the most regression
df_bat_diff.orderBy(df_bat_diff['DIFF']).show()
```
+---------+---------+----------+------+
| playerID|nameFirst| nameLast| DIFF|
+---------+---------+----------+------+
|wilsoda01| Dan| Wilson|-0.171|
|jackstr01| Travis| Jackson|-0.142|
|bumbral01| Al| Bumbry| -0.14|
| haasmu01| Mule| Haas|-0.131|
|hrbekke01| Kent| Hrbek|-0.128|
|hafeych01| Chick| Hafey|-0.112|
|bordimi01| Mike| Bordick|-0.112|
|seageco01| Corey| Seager|-0.112|
|bottoji01| Jim| Bottomley| -0.11|
|lowrije01| Jed| Lowrie|-0.108|
|mcinnst01| Stuffy| McInnis|-0.107|
|bancrda01| Dave| Bancroft|-0.107|
|mclemma01| Mark| McLemore|-0.107|
|galaran01| Andres| Galarraga|-0.106|
| corajo01| Joey| Cora|-0.104|
| cobbty01| Ty| Cobb|-0.104|
|heywaja01| Jason| Heyward|-0.104|
|figgich01| Chone| Figgins|-0.104|
|maxvida01| Dal| Maxvill|-0.103|
|richaha01| Hardy|Richardson|-0.102|
+---------+---------+----------+------+
only showing top 20 rows
#### Pyspark Test Results
We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These files contain the answers to the questions we posed.
```python
# Print the total execution time
pandas_bat_diff = df_bat_diff.toPandas()
pandas_bat_diff.to_csv('spark_question6_post_season_bat_diff.csv')
```
```python
sc.stop()
```
This diff is collapsed.
# ## Which Players have shown the most improvement in Batting Average in the post season? Which players have shown the most
# regression in Batting Average in the post season?
# ____
#
#
# In order to determine the difference in a players regular season, and post season performance, we look at Historical
# Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over
# the years 1870 to 2016. http://www.seanlahman.com/baseball-database.html
#
#
# This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 3
# tables in this database. The Master.csv table lists every player that has played the game from 1870 to 2016, along with
# their year of birth . Its schema is listed below.
#
# #### Table 1: Master Table Schema
#
#
# | Field | Description |
# | ---------- | -------------------------------------- |
# | playerID | A unique code asssigned to each player |
# | birthYear | Year player was born |
# | birthMonth | Month player was born |
# | birthDay | Day player was born |
# | birthCount | Country where player was born |
# | birthState | State where player was born |
# | birthCity | City where player was born |
# | deathYear | Year player died |
# | deathMonth | Month player died |
# | deathDay | Day player died |
# | deathCount | Country where player died |
# | deathState | State where player died |
# | deathCity | City where player died |
# | nameFirst | Player's first name |
# | nameLast | Player's last name |
# | nameGiven | Player's given name |
# | weight | Player's weight in pounds |
# | height | Player's height in inches |
# | bats | Player's batting hand (left, right) |
# | throws | Player's throwing hand (left or right) |
# | debut | Date that player made first appearance |
# | finalGame | Date that player made last appearance |
# | retroID | ID used by retrosheet |
# | bbrefID | ID used by Baseball Reference website |
#
#
#
# The Batting.csv table lists the batting statistics for every player, for every year that he played the game of baseball
# between 1870 and 2016. Its schema is listed below
#
# #### Table 2 Batting Table schema
#
#
# | Field | Description |
# | -------------- | -------------------------------------- |
# | playerID | A unique code asssigned to each player |
# | yearID | Year |
# | stint | players stint |
# | teamID | Team |
# | lgID | League |
# | G | Games Played |
# | AB | At Bats |
# | R | Runs Scored |
# | H | Hits |
# | 2B | Doubles |
# | 3B | Triples |
# | HR | Homeruns |
# | RBI | Runs Batted In |
# | SB | Stolen Bases |
# | CS | Caught Stealing |
# | BB | Base on Balls |
# | SO | Strike Outs |
# | IBB | Intentional Wals |
# | HBP | Hit by Pitch |
# | SH | Sacrifice Hits |
# | SF | Sacrifice Flies |
# | GIDP | Grounded into Double Plays |
#
#
# #### Table 3 Post Season Batting Table schema
#
# | Field | Description |
# | -------------- | -------------------------------------- |
# | yearID | Year |
# | round | Level of playoffs |
# | playerID | A unique code asssigned to each player |
# | teamID | Team |
# | lgID | League |
# | G | Games Played |
# | AB | At Bats |
# | R | Runs Scored |
# | H | Hits |
# | 2B | Doubles |
# | 3B | Triples |
# | HR | Homeruns |
# | RBI | Runs Batted In |
# | SB | Stolen Bases |
# | CS | Caught Stealing |
# | BB | Base on Balls |
# | SO | Strike Outs |
# | IBB | Intentional Wals |
# | HBP | Hit by Pitch |
# | SH | Sacrifice Hits |
# | SF | Sacrifice Flies |
# | GIDP | Grounded into Double Plays |
#
#
#
# We Utilize Apache Spark to perform the required database operations to answer our questions. The Code below explains the
# process of answering these questions, and shows how easy it is to use Spark to analyze Big Data. The Code to implement
# this query is implemented in Python, and can either be run on a local server or a cluster of servers. The example below
# was run on an Amazon EC2 Free Tier Ubuntu Server instance. The EC2 instance was set up with Python (Anaconda 3-4.1.1),
# Java, Scala, py4j, Spark and Hadoop. The code was written and executed in a Jupyter Notebook. Several guides are available
# on the internet describing how to install and run spark on an EC2 instance. One that particularly covers all these facets
# is https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297
#
# #### Pyspark Libraries
# Import the pyspark libraries to allow python to interact with spark. A description of the basic functionality of each of
# these libaries is provided in the code comments below. A more detailed explanation of the functionality of each of these
# libraries can be found in Apache's documentation on Spark https://spark.apache.org/docs/latest/api/python/index.html
# In[10]:
# Import SparkContext. This is the main entry point for Spark functionality
# Import Sparkconf. We use Spark Conf to easily change the configuration settings when changing between local mode cluster
# mode.
# Import SQLContext from pyspark.sql. We use the libraries here to read in data in csv format. The format of our native
# database
# Import avg, round from pyspark.sql.functions. This is used for the math operations needed to answer our questions
# Import Window from pyspark.sql to allow us to effectively partition and analyze data
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import avg
from pyspark.sql.functions import round
from pyspark.sql.functions import sum
# #### Pyspark Configuration & Instantiation
# We configure spark for local mode or cluster mode, configure our application name, and configure logging. Several other
# configuration settings can be programmed as well. A detailed explanation of these can be found at
# https://spark.apache.org/docs/latest/configuration.html
#
# We pass the configuration to an instance of a SparkContext object, so that we can begin using Apache Spark
# In[11]:
# The Master will need to change when running on a cluster.
# If we need to specify multiple cores we can list something like local[2] for 2 cores, or local[*] to use all available
# cores.
# All the available Configuration settings can be found at https://spark.apache.org/docs/latest/configuration.html
sc_conf = SparkConf().setMaster('local[*]').setAppName('Question7').set('spark.logConf', True)
# In[12]:
# We instantiate a SparkContext object with the SparkConfig
sc = SparkContext(conf=sc_conf)
# #### Pyspark CSV file Processing
# We use the SQLContext library to easily allow us to read the csv files 'Salaries.csv' and 'Teams.csv'. These files are
# currently stored in Amazon s3 storage (s3://cs498ccafinalproject/) and are publicly available for download. They were
# copied over to a local EC2 instance by using the AWS command line interace command
#
# ```aws s3 cp s3://cs498ccafinalproject . --recursive```
# In[13]:
sqlContext = SQLContext(sc)
df_bat_post =sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('BattingPost.csv')
df_bat = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Batting.csv')
df_master = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('Master.csv')
# #### Pyspark Data Operations to Determine the effect of Team Salary on Team Performance after 1984
#
# In order to determine how the Global representation of Major League Baseball players has changed over time, we perform the
# following operations
#
# 1) We select the playerID, Hits and At Bats columns from the Regular Season and Post Season Batting Tables
#
# 2) We clean the data to remove any Null entries
#
# 3) We perform an inner join between the regular season batting table and the post season batting table, to remove players
# who did not make it to the playoffs in their careers. We consider this as our new regular season data set
#
# 4) We group the regular season and post season tables by playerID and calculate the sum of at bats, and the sum of hits for
# each player in these data frames
#
# 5) We filter the post season and the regular season data frames, to only include players who have had a statistically
# significant number of At-bats, over their careers (60 for post season , 502 for regular season)
#
# 6) We perform an inner join between the post season data frame and the regular season data frame, then calculate the
# difference between post season batting average and regular season batting average in this merged dataframe.
#
# 7) We filter the master table for a players name and his playerID
#
# 8) We then perform an inner join between the data frame that had our batting average difference, and the filtered master
# table, so that we are easily able to determine a players name.
#
#
# In[14]:
# Filter the columns we ned to calculate a players batting average
keep = ['playerID', 'AB', 'H']
df_bat_post_data = df_bat_post.select(*keep).na.fill(0)
df_bat_data = df_bat.select(*keep).na.fill(0)
df_bat_data.join(df_bat_post_data,[df_bat_data.playerID == df_bat_post_data.playerID], 'inner')
# Sum the H and AB for each player
df_bat_post_data_agg = df_bat_post_data.groupBy(df_bat_post_data.playerID).agg({"H": "sum", "AB": "sum"})
df_bat_data_agg = df_bat_data.groupBy(df_bat_data.playerID).agg({"H": "sum", "AB": "sum"})
# Rename the collumns for easier use later
df_bat_post_data_agg = df_bat_post_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_post_data_agg = df_bat_post_data_agg.filter(df_bat_post_data_agg.sumAB >= 60)
df_bat_post_stats = df_bat_post_data_agg.withColumn("PAVG", round(df_bat_post_data_agg.sumH/df_bat_post_data_agg.sumAB,3))
# Calculate the batting average for each player
df_bat_data_agg = df_bat_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_data_agg = df_bat_data_agg.filter(df_bat_data_agg.sumAB >= 502)
df_bat_stats = df_bat_data_agg.withColumn("AVG", round(df_bat_data_agg.sumH/df_bat_data_agg.sumAB,3))
# Calcuate the batting difference between post and regular season
df_bat_diff = df_bat_post_stats.join(df_bat_stats,['playerID'],'inner')
df_bat_diff = df_bat_diff.withColumn("DIFF", round(df_bat_diff.PAVG - df_bat_diff.AVG, 3))
# Add first and last name to list
keep = ['playerID', 'nameFirst', 'nameLast']
df_master = df_master.select(*keep)
df_bat_diff = df_bat_diff.join(df_master,['playerID'],'inner')
# Only show the stuff we care about
keep = ['playerID', 'nameFirst', 'nameLast', 'DIFF']
df_bat_diff = df_bat_diff.select(*keep)
# In[15]:
# Display the players that showed the most improvement
df_bat_diff.orderBy(df_bat_diff['DIFF'].desc()).show()
# In[16]:
# Display the players that showed the most regression
df_bat_diff.orderBy(df_bat_diff['DIFF']).show()
# #### Pyspark Test Results
# We convert our spark data frames to pandas data frames, so it is easy to save them in a human readable csv format. These
#files contain the answers to the questions we posed.
# In[17]:
# Print the total execution time
pandas_bat_diff = df_bat_diff.toPandas()
pandas_bat_diff.to_csv('spark_question6_post_season_bat_diff.csv')
# In[18]:
sc.stop()
# Uncomment the line below when running in Zeppelin
#%spark2.pyspark
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
# We create a sql context object, so that we can read in csv files easily, and create a data frame
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJYYXUAARX5BFPXRQ")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "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")
df_bat_post =sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://cs498ccafinalproject/BattingPost.csv')
df_bat = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://cs498ccafinalproject/Batting.csv')
df_master = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('s3n://cs498ccafinalproject/Master.csv')
# 1) Only keep the playerID, AB, H
# 2) Replace null entries with Zero in the batting stats
keep = ['playerID', 'AB', 'H']
df_bat_post_data = df_bat_post.select(*keep).na.fill(0)
df_bat_data = df_bat.select(*keep).na.fill(0)
df_bat_data.join(df_bat_post_data,[df_bat_data.playerID == df_bat_post_data.playerID], 'inner')
# Sum the H and AB for each player
df_bat_post_data_agg = df_bat_post_data.groupBy(df_bat_post_data.playerID).agg({"H": "sum", "AB": "sum"})
df_bat_data_agg = df_bat_data.groupBy(df_bat_data.playerID).agg({"H": "sum", "AB": "sum"})
# Rename the collumns for easier use later
df_bat_post_data_agg = df_bat_post_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_post_data_agg = df_bat_post_data_agg.filter(df_bat_post_data_agg.sumAB >= 60)
df_bat_post_stats = df_bat_post_data_agg.withColumn("PAVG", round(df_bat_post_data_agg.sumH/df_bat_post_data_agg.sumAB,3))
# Calculate the batting average for each player
df_bat_data_agg = df_bat_data_agg.withColumnRenamed('sum(H)', 'sumH').withColumnRenamed('sum(AB)', 'sumAB')
df_bat_data_agg = df_bat_data_agg.filter(df_bat_data_agg.sumAB >= 502)
df_bat_stats = df_bat_data_agg.withColumn("AVG", round(df_bat_data_agg.sumH/df_bat_data_agg.sumAB,3))
# Calcuate the batting difference between post and regular season
df_bat_diff = df_bat_post_stats.join(df_bat_stats,['playerID'],'inner')
df_bat_diff = df_bat_diff.withColumn("DIFF", round(df_bat_diff.PAVG - df_bat_diff.AVG, 3))
#df_bat_diff.filter(df_bat_diff.playerID == 'soriaal01' ).show()
# Add firs and last name to list
keep = ['playerID', 'nameFirst', 'nameLast']
df_master = df_master.select(*keep)
df_bat_diff = df_bat_diff.join(df_master,['playerID'],'inner')
# Only show the stuff we care about
keep = ['playerID', 'nameFirst', 'nameLast', 'DIFF']
df_bat_diff = df_bat_diff.select(*keep)
# Display the values
df_bat_diff.orderBy(df_bat_diff['DIFF'].desc()).show()
df_bat_diff.orderBy(df_bat_diff['DIFF']).show()
# Conver to pandas for visulization
pandas_df = df_bat_diff.toPandas()
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment