diff --git a/notebooks/question2.ipynb b/notebooks/question2.ipynb index c9aec23c57e40bb19873e7512ecfffdc1f05ee6b..211de4212cfe734ec60ffbdfb3dfa27f4ba22f58 100644 --- a/notebooks/question2.ipynb +++ b/notebooks/question2.ipynb @@ -12,7 +12,7 @@ "\n", "This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Salaries.csv table lists every player that played in major league baseball, along with their team, and their associated salary. This data is only provided for the years 1985 and later. Its schema is listed below. \n", "\n", - "#### Table 1: Master Table Schema\n", + "#### Table 1: Salary Table Schema\n", "\n", "\n", "| Field | Description |\n", @@ -31,7 +31,7 @@ "\n", "The Teams.csv table lists the Team statistics for every team, that has played the game of baseball from 1870 to 2016, along with the year those statistics were recorded. Its schema is listed below\n", "\n", - "#### Table 2 Fielding Table schema\n", + "#### Table 2 Team Table schema\n", "\n", "\n", "| Field | Description |\n", @@ -98,7 +98,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 11, "metadata": { "collapsed": false }, @@ -137,7 +137,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 12, "metadata": { "collapsed": true }, @@ -152,7 +152,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 13, "metadata": { "collapsed": false }, @@ -175,9 +175,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 14, "metadata": { - "collapsed": true + "collapsed": false }, "outputs": [], "source": [ @@ -211,7 +211,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 15, "metadata": { "collapsed": false }, @@ -784,7 +784,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 17, "metadata": { "collapsed": false }, diff --git a/notebooks/question2.md b/notebooks/question2.md index 4934d095e5cb4de5445c05bf2ad6f4aca9ebc81b..75eddd5a833985000817294e34fe6f2a1c545f0c 100644 --- a/notebooks/question2.md +++ b/notebooks/question2.md @@ -7,7 +7,7 @@ In order to determine how the effect Team Salary expenditure has on Major League This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Salaries.csv table lists every player that played in major league baseball, along with their team, and their associated salary. This data is only provided for the years 1985 and later. Its schema is listed below. -#### Table 1: Master Table Schema +#### Table 1: Salary Table Schema | Field | Description | @@ -26,7 +26,7 @@ This database has 27 tables. However to obtain the answer for our query above, w The Teams.csv table lists the Team statistics for every team, that has played the game of baseball from 1870 to 2016, along with the year those statistics were recorded. Its schema is listed below -#### Table 2 Fielding Table schema +#### Table 2 Team Table schema | Field | Description | diff --git a/results/question2.html b/results/question2.html index 4cb63158a35b6a0b005a5903aee5da538d397579..b3021eda7ca7db80c7f4f8b7278ca396566aad17 100644 --- a/results/question2.html +++ b/results/question2.html @@ -11754,7 +11754,7 @@ div#notebook { <h2 id="Does-money-buy-Championships?-How-have-the-Highest-Spending-Major-League-Baseball-Teams-performed-over-Time?">Does money buy Championships? How have the Highest Spending Major League Baseball Teams performed over Time?<a class="anchor-link" href="#Does-money-buy-Championships?-How-have-the-Highest-Spending-Major-League-Baseball-Teams-performed-over-Time?">¶</a></h2><hr> <p>In order to determine how the effect Team Salary expenditure has on Major League Baseball Team Performance, we look at Historical Baseball Data available on the Internet. The specific source of data chosen here is a database of baseball statistics over the years 1870 to 2016. <a href="http://www.seanlahman.com/baseball-database.html">http://www.seanlahman.com/baseball-database.html</a></p> <p>This database has 27 tables. However to obtain the answer for our query above, we need to cross reference data from 2 tables in this database. The Salaries.csv table lists every player that played in major league baseball, along with their team, and their associated salary. This data is only provided for the years 1985 and later. Its schema is listed below.</p> -<h4 id="Table-1:-Master-Table-Schema">Table 1: Master Table Schema<a class="anchor-link" href="#Table-1:-Master-Table-Schema">¶</a></h4><table> +<h4 id="Table-1:-Salary-Table-Schema">Table 1: Salary Table Schema<a class="anchor-link" href="#Table-1:-Salary-Table-Schema">¶</a></h4><table> <thead><tr> <th>Field</th> <th>Description</th> @@ -11785,7 +11785,7 @@ div#notebook { </table> <p><em>Note: At the Time of writing, the teamID in the Salaries.csv table for the year 2016 did not follow the convention of teamID's used throughout the rest of the table, and the entire database. Specifically 12 teams had teamIDs that did not match the code that had been used for their teamIDs in previous years. This data was manually cleaned to make sure it did not affect the Results obtained.</em></p> <p>The Teams.csv table lists the Team statistics for every team, that has played the game of baseball from 1870 to 2016, along with the year those statistics were recorded. Its schema is listed below</p> -<h4 id="Table-2-Fielding-Table-schema">Table 2 Fielding Table schema<a class="anchor-link" href="#Table-2-Fielding-Table-schema">¶</a></h4><table> +<h4 id="Table-2-Team-Table-schema">Table 2 Team Table schema<a class="anchor-link" href="#Table-2-Team-Table-schema">¶</a></h4><table> <thead><tr> <th>Field</th> <th>Description</th> @@ -12003,7 +12003,7 @@ div#notebook { </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [1]:</div> +<div class="prompt input_prompt">In [11]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Import SparkContext. This is the main entry point for Spark functionality</span> @@ -12040,7 +12040,7 @@ div#notebook { </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [2]:</div> +<div class="prompt input_prompt">In [12]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># The Master will need to change when running on a cluster. </span> @@ -12057,7 +12057,7 @@ div#notebook { </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [3]:</div> +<div class="prompt input_prompt">In [13]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># We instantiate a SparkContext object with the SparkConfig</span> @@ -12083,7 +12083,7 @@ div#notebook { </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [4]:</div> +<div class="prompt input_prompt">In [14]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># We create a sql context object, so that we can read in csv files easily, and create a data frame</span> @@ -12115,7 +12115,7 @@ div#notebook { </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [5]:</div> +<div class="prompt input_prompt">In [15]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># Keep the year, team and salary data from the salary table</span> @@ -12816,7 +12816,7 @@ only showing top 20 rows </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> -<div class="prompt input_prompt">In [23]:</div> +<div class="prompt input_prompt">In [17]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span></span><span class="n">sc</span><span class="o">.</span><span class="n">stop</span><span class="p">()</span> diff --git a/src/question_2_pyspark.py b/src/question_2_pyspark.py index 0c51425fa3039605064c9fac28676f4c53f1b231..e31200b19659417eda9c92b948831c31d98383c2 100644 --- a/src/question_2_pyspark.py +++ b/src/question_2_pyspark.py @@ -10,7 +10,7 @@ # tables in this database. The Salaries.csv table lists every player that played in major league baseball, along with their # team, and their associated salary. This data is only provided for the years 1985 and later. Its schema is listed below. # -# #### Table 1: Master Table Schema +# #### Table 1: Salary Table Schema # # # | Field | Description | @@ -33,7 +33,7 @@ # The Teams.csv table lists the Team statistics for every team, that has played the game of baseball from 1870 to 2016, # along with the year those statistics were recorded. Its schema is listed below # -# #### Table 2 Fielding Table schema +# #### Table 2 Team Table schema # # # | Field | Description |