This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
...
@@ -106,10 +107,62 @@ We can now consider transforming some of the data. Looking at the `Term Date` va
...
@@ -106,10 +107,62 @@ We can now consider transforming some of the data. Looking at the `Term Date` va
We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
```{r}
payroll_data
```
**Exploring Collinearity and Correlation of Predictors**
**Exploring Collinearity and Correlation of Predictors**
**Model Selection**
**Model Selection**
1. Base Model
This base model will use several features that we think could be the most influential.
The main factor that determines the wage of a CT state employee should be agency. The agency variable is a factor variable that holds dozens of different agencies. Before potentially any further data manipulation it might be helpful to view the largest and smallest coefficients.
```{r}
mod_start = lm(`Annual Rate`~ Agency, data = payroll_data)
```
```{r}
sort(mod_start$coefficients)[1:5]
```
```{r}
sort(mod_start$coefficients)[80:85]
```
It seems like the agencies with the lowest average annual income are CCC in Three Rivers, Middlesex, Gateway, Manchester, and Norwalk. The agencies with the highest average annual salary are the State Board of Ed, Comm on WomenChilSenEquityOpty, CT Innovations Inc, CT Board of Regents, and Judicial branch.
2. Besides from agency, age (as a proxy for seniority) might be a statistically significant factor.
```{r}
mod_2 = lm(`Annual Rate`~ Agency+Age, data = payroll_data)
```
From the summary we can see that age is definitely a statistically significant variable. We will need to be wary of the range of values age was trained on. If we interpret the regression for someone who is 100 years old and works in the Judicial Branch then they would make $354906 every year.