We were able to use Apache Spark to Extract the Data needed to answer our questions related to the Age at which a player starts to peak, performance wise. We can now visualize the data, so it is easier to see trends in it. We used Apache Spark to export data to pandas data frames, and csv files. We can now visualize these dataframes using matplotlib.
# Import the necessary libraries to visualize a pandas data frame
# Read the raw data from csv files
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Our Dataframe contains the batting statistics of players of multiple different ages. We first visualize a box plot of the batting average of all players, by age group. We overlay a swarm plot of the data points on top of this, to further illustrate the distribution of each of the sample points, and the number of samples in each age group. Based on the data we see that most players play the game between the ages of 22 to 37. There are a few outliers, but these represent a very small portion of our sample size. The median batting average by age group seems to increase from age 22 to age 29, it then stays steady for 3-4 years, and then starts to decline. However all the median batting averages by age are very close to each other. The data shows that players seem to be most productive between the ages of 29 to 33 , after which their skills start to decline. However there is not very much to separate players in all the different age groups.
# Read in the Data file that contains the Data we wish to visualize
# Create a box plot and overlay it with a swarm plot
dims = (20, 15)
fig = plt.subplots(figsize=dims)
df = pd.read_csv('spark_question3_bat_stats_quantile_by_age.csv')
ax = sns.boxplot(x='age',y='AVG', data=df, orient="v", color='crimson', saturation =1)
ax = sns.swarmplot(x="age", y="AVG", data=df,orient="v", color="black", size=1)
sns.set_style(style ="ticks")
plt.show()
plt.close("all")
When looking at just On base percentage, there is no definite trend with regard to age and performance. The median on base percentage of players between the ages of 22 and 37 are all very close to each other. The 1st to 3rd quartile window, does not seem to show a definite rising or falling trend either.
# Read in the Data file that contains the Data we wish to visualize
# Create a box plot and overlay it with a swarm plot
dims = (20, 15)
fig = plt.subplots(figsize=dims)
ax = sns.boxplot(x='age',y='OBP', data=df, orient="v", color='c', saturation =1)
ax = sns.swarmplot(x='age',y='OBP', data=df, orient="v", color="black", size=1)
plt.show()
plt.close("all")
When looking at OPS (On base plug Slugging) based performance by age, we do not see a definite trend either. A players OPS slowly increase from age 25 to 29, then starts to dip from age 31 to 34, after which ist starts to rise from age 34 to 36, before it starts to dip again. So there is no definte correlation between Age and On base plus slugging percentage.
dims = (20, 12)
fig = plt.subplots(figsize=dims)
ax = sns.violinplot(x='age', y='OPS', data=df, orient="v")
plt.show()
plt.close("all")