While that model performed alright, the goal was to use to all of the categorical model to also inform the regression. We ran a backward search using AIC with all of the predictors in order to find a small model that would have more predictive power.
While that model performed alright, the goal was to use some of the categorical predictors to also inform the regression. We ran a backward search using AIC with all of the factor predictors in order to find the factor variables which influence the regression.
```{r}
```{r}
reduced_numeric_model = step(numeric_model, direction = "backward", trace = 0)
**Exploring Collinearity and Correlation of Predictors**
**Exploring Collinearity and Correlation of Predictors**
...
@@ -170,7 +172,7 @@ This base model will use several features that we think could be the most influe
...
@@ -170,7 +172,7 @@ This base model will use several features that we think could be the most influe
The main factor that determines the wage of a CT state employee should be agency. The agency variable is a factor variable that holds dozens of different agencies. Before potentially any further data manipulation it might be helpful to view the largest and smallest coefficients.
The main factor that determines the wage of a CT state employee should be agency. The agency variable is a factor variable that holds dozens of different agencies. Before potentially any further data manipulation it might be helpful to view the largest and smallest coefficients.
```{r}
```{r}
mod_start = lm(`AnnualRate`~ Agency, data = payroll_data)
mod_start = lm(`AnnualRate`~ Agency, data = payroll_data)
```
```
```{r}
```{r}
...
@@ -186,7 +188,7 @@ It seems like the agencies with the lowest average annual income are CCC in Thre
...
@@ -186,7 +188,7 @@ It seems like the agencies with the lowest average annual income are CCC in Thre
2. Besides from agency, age (as a proxy for seniority) might be a statistically significant factor.
2. Besides from agency, age (as a proxy for seniority) might be a statistically significant factor.
```{r}
```{r}
mod_2 = lm(`AnnualRate`~ Agency+Age, data = payroll_data)
mod_2 = lm(`AnnualRate`~ Agency+Age, data = payroll_data)
```
```
From the summary we can see that age is definitely a statistically significant variable. We will need to be wary of the range of values age was trained on. If we interpret the regression for someone who is 100 years old and works in the Judicial Branch then they would make $354906 every year.
From the summary we can see that age is definitely a statistically significant variable. We will need to be wary of the range of values age was trained on. If we interpret the regression for someone who is 100 years old and works in the Judicial Branch then they would make $354906 every year.