First we subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` was equal to 2022 and saved that to a separate file in order to speed up development. After the data was imported, we decided to see if there were any issues with it.
First we subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` was equal to 2022 and saved that to a separate file in order to speed up development. After the data was imported, we decided to see if there were any issues with it.
It looked like there was an issue with the `Chk Status` column, but it didn't seem to contain much useful information anyway so we chose to remove it. There were a few other columns that didn't contribute any meaningful information to a regression so we chose to remove those here too.
It looked like there was an issue with the `Chk Status` column, but it didn't seem to contain much useful information anyway so we chose to remove it. There were a few other columns that didn't contribute any meaningful information to a regression so we chose to remove those here too.
Since each row of the dataset represented a single paystub, an employee would show up multiple times in different rows. We chose to transform the data in order have only one row per employee since that made our analysis more straightforward.
Like this example employee below, most state employees will have multiple rows in the data set with each corresponding to a payroll stub.
Since we wanted each row our data to represent a single employee, we decided to sample the day with the most paychecks in order to maximize our data. The date was an arbitrary decision, the more influential factor would be choosing a consistent date for every employee in order to accurately capture raise or bonus cycles.
We've been able to reduce our data set down to `r nrow(payroll_data)` rows which made it much easier to train a regression. We decided to remove the check date and employee ID columns at this point as well since the data has been filtered.
With the transformed data, we were finally able to look at the columns and look for any un
While that model performed alright, the goal was to use to all of the categorical model to also inform the regression. We ran a backward search using AIC with all of the predictors in order to find a small model that would have more predictive power.
//We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
** data and adding predictors**
**How to deal with names**
We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
```{r}
```{r}
payroll_data
reduced_numeric_model = step(numeric_model, direction = "backward", trace = 0)
summary(reduced_numeric_model)
```
```
**Exploring Collinearity and Correlation of Predictors**
**Exploring Collinearity and Correlation of Predictors**