Skip to content
Snippets Groups Projects
Commit 12037bde authored by Nischol Antao's avatar Nischol Antao
Browse files

Challenges_Encountered for Question 3

parent 9e5811ac
No related branches found
No related tags found
No related merge requests found
......@@ -19,4 +19,12 @@ for some of the teams in the database. The Entire Database used a fixed conventi
however this convention was changed for the Year 2016 in the Salaries.csv file. When a database
join was performed between the data in the Salaries.csv file and the Teams.csv file, this resulted
in the salary data for 12 teams being omitted from the results. The data in the Salaries.csv file
had to be manually cleaned to match the convention in the rest of the database to fix this.
\ No newline at end of file
had to be manually cleaned to match the convention in the rest of the database to fix this.
IV] Question 4
a) There is no good way to calculate percentile data, in spark without hive. Hive context has
percentile_approx as a function, however we did not install HIVE in our clusters so could not
use this. We were able to calculate percentiles for each age group using a more round about
approach by windowing the data and using a cumulative distribution. This would stil be easier
to code and visualize in a statistical programming language such as R.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment