We were able to use Apache Spark to Extract the Data needed to answer our questions related to the Global Representation of Major League Baseball players, and the growth of the Sport over the Past 15 years. We can now visualize these, so it is easier to see Data Trends. We used Apache Spark to export data to pandas data frames, and csv files. We can now visualize these dataframes using plot functionality built into pandas, that is based on matplotlib.
# Import the necessary libraries to visualize a pandas data frame
# Read the raw data from csv files
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df_raw = pd.read_csv('spark_question1_global_change_last_15.csv', index_col=2)
print (df_raw.filter(items=['country2016', 'countNum2016', 'countNum2001', 'diff', 'percentChange']))
We filter the Country of Origin, and the Change in player representation from our pandas dataframe. We then plot a bar chart showing the Change in Player representation by country from 2001 to 2016. The Data shows that the United States had the greatest increase in number of players (68), with Venezuela in second (52). Puerto Rico showed the greates decline in Numbers (-27)
# Extract only the columns we need
df = df_raw.filter(items=['country2016', 'diff'])
# Color Scale
color_list = plt.cm.tab20c(np.linspace(0,1,20))
# Plot a bar chart, and label the axes
ax = df['diff'].plot(kind='bar', title ="MLB Global Player Representation Change 2001-2016", color=color_list, figsize=(15, 10), fontsize=12)
ax.set_xlabel("Country", fontsize=12)
ax.set_ylabel("Change in Player Representation (count)", fontsize=12)
plt.show()
We can also visualize the Change in Number of players, as a percentage. This highlights countries where the sport may be growing rapidly, as opposed to countries where the sport is already established and may be growing more slowly. Germany and Colombia show large growth, however that is based on the fact that they had very low representation in 2001. Venezuela shows the greatest significant increase over the past 15 years (104%). Aruba shows the greatest numerical percentage decrease, over this time period, due to the fact that they had low representation in 2001 and 2016. Puerto Rico on the other hand shows the greatest significant percentage decrease over this time period (-51%)
# Extract only the columns we need
df = df_raw.filter(items=['country2016', 'percentChange'])
# Color Scale
color_list = plt.cm.tab20c(np.linspace(0,1,20))
# Plot a bar chart, and label the axes
ax = df['percentChange'].plot(kind='bar', title ="MLB Global Player Representation Change Percentage 2001-2016", color=color_list, figsize=(15, 10), fontsize=12)
ax.set_xlabel("Country", fontsize=12)
ax.set_ylabel("Change in Player Representation (Percent)", fontsize=12)
plt.show()
We can track the growth of Baseball in Latin American countries, where the sport is popular, by looking at the Number of Major League Baseball players from some of thes countries from 2001 to 2016. We extract dataframes showing the number of players represented by Venezuela, the Dominican Republic, Puerto Rico and Cuba for the years between 2001 and 2016.
# Read in data that shows each Countries major league baseball representation for the past 15 year.
# Extract individual axes for each country whose data we wish to plot.
df_lam = pd.read_csv('spark_question1_last_15.csv')
ven_ax = df_lam.loc[df_lam['birthCountry'] == 'Venezuela'].rename(index=str, columns={"count":"Venezuela"})
dr_ax = df_lam.loc[df_lam['birthCountry'] == 'D.R.'].rename(index=str, columns={"count":"Dominican Rep."})
pr_ax = df_lam.loc[df_lam['birthCountry'] == 'P.R.'].rename(index=str, columns={"count":"Puerto Rico"})
cub_ax = df_lam.loc[df_lam['birthCountry'] == 'Cuba'].rename(index=str, columns={"count":"Cuba"})
We can visualize the growth by plotting a line chart, that shows the number of Major League Baseball players from Venezuela, Dominican Republic, Puerto Rico and Cuba every year from 2001 to 2016.
plt.close("all")
plt.title("Number of Major League Baseball Players from Latin America (2001-2016)")
plt.ylabel("Players (Count)")
plt.xlabel("Year")
plt.plot( ven_ax['yearID'], ven_ax['Venezuela'], marker='.', markersize=6, color='Coral', linewidth=4)
plt.plot( dr_ax['yearID'], dr_ax['Dominican Rep.'], marker='.', markersize=6, color='MediumAquamarine', linewidth=4)
plt.plot( pr_ax['yearID'], pr_ax['Puerto Rico'], marker='.', markersize=6, color='Plum', linewidth=4)
plt.plot( cub_ax['yearID'], cub_ax['Cuba'], marker='.', markersize=6, color='SkyBlue', linewidth=4)
plt.legend()
plt.show()