This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
Our proposal is to study the Connecticut state employee payroll data provided by the office of the state comptroller in order to study trends in state employees’ pay. Each row in the dataset details an individual payroll check issued to a state employee in Connecticut starting from 2015 and contains 38 columns. This dataset contains many data features that can help us identify wage trends such as ethnicity, sex, age, government agency, and location. The dataset contains over 14MM rows which gives us plenty of data to work with, our plan is to isolate a single year (2022) in order to tighten the scope of our study. It will be an interesting endeavor to investigate wages of these state employees (as Kieran is from Connecticut). From just sorting the data, one state employee makes over $11,000,000 a year! This sounds like a large salary for a government employee, so we can't wait to start looking into it further.
There are 38 columns in data with `Tot Gross` being the dependent variable.
## Methods
**Cleaning Dataset**
We will start off by fixing some issues with the data.
```{r}
```{r}
str(state_employee_payroll_2022_data)
#Looks like there is an issue with the `Chk Status` column which we probably don't need.
problems(state_employee_payroll_2022_data)
#The values are just FALSE and N/A so let's get rid of that
* **Observation** There are 38 columns in data. Variable Tot Gross being the dependent variable.
###Exploratory Analysis
Once we fixed the issues with the data and removed some of the un-influential variables, we then decided to fix up some of the factor variables in the dataset.
We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
**How to deal with names**
We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
**Exploring Collinearity and Correlation of Predictors**
**Model Selection**
**Model Analysis**
## Discussion
## Appendix
**Bad models and old code**
We can use this section to store old models that didn't work or ideas that led to a dead end.