Replace Project.Rmd

8b177999 · kbdaly2 · c8d74a94 · 8b177999
Commit 8b177999 authored 2 years ago by kbdaly2
--- a/Project.Rmd
+++ b/Project.Rmd
@@ -29,7 +29,7 @@ opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
 Name           | NetID
 -------------- | -------------
-Kieran Daly    | 
+Kieran Daly    | kbdaly2
 Jack Meyers    | jsmeyrs2
 Serhat Tuncay  | stuncay2
@@ -46,7 +46,8 @@ library(dplyr)
 ```{r, warning=FALSE}
 #We have subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` == 2022 for ease of use
-state_employee_payroll_2022_data = read_csv('./dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip')
+state_employee_payroll_2022_data = read_csv("State_Employee_Payroll_Data_Calendar_Year_2022.csv")
 ```
 This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
@@ -106,10 +107,62 @@ We can now consider transforming some of the data. Looking at the `Term Date` va
 We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
+```{r}
+payroll_data
+```
 **Exploring Collinearity and Correlation of Predictors**
 **Model Selection**
+1. Base Model  
+This base model will use several features that we think could be the most influential. 
+The main factor that determines the wage of a CT state employee should be agency. The agency variable is a factor variable that holds dozens of different agencies. Before potentially any further data manipulation it might be helpful to view the largest and smallest coefficients. 
+```{r}
+mod_start = lm(`Annual Rate`~ Agency, data = payroll_data)
+```
+```{r}
+sort(mod_start$coefficients)[1:5]
+```
+```{r}
+sort(mod_start$coefficients)[80:85]
+```
+It seems like the agencies with the lowest average annual income are CCC in Three Rivers, Middlesex, Gateway, Manchester, and Norwalk. The agencies with the highest average annual salary are the State Board of Ed, Comm on WomenChilSenEquityOpty, CT Innovations Inc, CT Board of Regents, and Judicial branch. 
+2. Besides from agency, age (as a proxy for seniority) might be a statistically significant factor. 
+```{r}
+mod_2 = lm(`Annual Rate`~ Agency+Age, data = payroll_data)
+```
+From the summary we can see that age is definitely a statistically significant variable. We will need to be wary of the range of values age was trained on. If we interpret the regression for someone who is 100 years old and works in the Judicial Branch then they would make $354906 every year.
+```{r}
+predict(mod_2, newdata=data.frame(Age=100,Agency='Judicial Branch'))
+```
+```{r}
+summary(mod_2)
+```
+```{r}
+anova(mod_start,mod_2)
+```
+From the ANOVA model above we would want to keep the model that adds age because the F statistic is 9377 and the p value is <2e-16. 
+(Eventually here we would put step with backwards aic and bic)
+We may experience colinearity with agency and any location zip codes because many of the agencies have regional salaries. 
 **Model Analysis**