Challenges_Encountered for Question 3

12037bde · Nischol Antao · 9e5811ac · 12037bde
Commit 12037bde authored 6 years ago by Nischol Antao
--- a/docs/Challenges_Encountered.txt
+++ b/docs/Challenges_Encountered.txt
@@ -19,4 +19,12 @@ for some of the teams in the database. The Entire Database used a fixed conventi
 however this convention was changed for the Year 2016 in the Salaries.csv file. When a database
 join was performed between the data in the Salaries.csv file and the Teams.csv file, this resulted 
 in the salary data for 12 teams being omitted from the results. The data in the Salaries.csv file
-had to be manually cleaned to match the convention in the rest of the database to fix this. 
\ No newline at end of file
+had to be manually cleaned to match the convention in the rest of the database to fix this. 
+
+IV] Question 4
+
+a) There is no good way to calculate percentile data, in spark without hive. Hive context has 
+percentile_approx as a function, however we did not install HIVE in our clusters so could not 
+use this. We were able to calculate percentiles for each age group using a more round about 
+approach by windowing the data and using a cumulative distribution. This would stil be easier 
+to code and visualize in a statistical programming language such as R.