cleaned up the data a bit

c8d74a94 · Jack Meyers · ce52741e · c8d74a94 · c8d74a94
Commit c8d74a94 authored 2 years ago by Jack Meyers
--- a/Project.Rmd
+++ b/Project.Rmd
@@ -17,8 +17,8 @@ output:
 ```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
-options(scipen = 1, digits = 4, width = 80)
 library(knitr)
+options(scipen = 1, digits = 4, width = 80)
 opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
 ```
@@ -30,33 +30,95 @@ opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
 Name           | NetID
 -------------- | -------------
 Kieran Daly    | 
-Jack Meyers    | 
+Jack Meyers    | jsmeyrs2
 Serhat Tuncay  | stuncay2
 ## Introduction
-###DataSet
+### DataSet
 ```{r kable,message=FALSE,echo=FALSE,warning=FALSE}
 # Libraries, Helpers and read the data.
 library(readr)
+library(dplyr)
 ```
-```{r}
+```{r, warning=FALSE}
+#We have subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` == 2022 for ease of use
 state_employee_payroll_2022_data = read_csv('./dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip')
 ```
-* **Summary:**
+This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
+Our proposal is to study the Connecticut state employee payroll data provided by the office of the state comptroller in order to study trends in state employees’ pay. Each row in the dataset details an individual payroll check issued to a state employee in Connecticut starting from 2015 and contains 38 columns. This dataset contains many data features that can help us identify wage trends such as ethnicity, sex, age, government agency, and location. The dataset contains over 14MM rows which gives us plenty of data to work with, our plan is to isolate a single year (2022) in order to tighten the scope of our study. It will be an interesting endeavor to investigate wages of these state employees (as Kieran is from Connecticut). From just sorting the data, one state employee makes over $11,000,000 a year! This sounds like a large salary for a government employee, so we can't wait to start looking into it further. 
+There are 38 columns in data with `Tot Gross` being the dependent variable.
+## Methods
+**Cleaning Dataset**
+We will start off by fixing some issues with the data. 
 ```{r}
-str(state_employee_payroll_2022_data)
+#Looks like there is an issue with the `Chk Status` column which we probably don't need.
+problems(state_employee_payroll_2022_data)
+#The values are just FALSE and N/A so let's get rid of that
+unique(state_employee_payroll_2022_data$`Chk Status`)
+#There are a few columns with data that won't help with training a regression model
+unique(state_employee_payroll_2022_data$`Pyrl Fiscal Yr`)
+unique(state_employee_payroll_2022_data$`Calendar Year`)
+unique(state_employee_payroll_2022_data$`State`)
+#We can remove the problem column and some other insignificant columns
+payroll_data = subset(state_employee_payroll_2022_data, select = -c(`Chk Status`, `Pyrl Fiscal Yr`, `Calendar Year`, `Check #`, `Check Dt`, `State`))
+#Now that's looking better
+problems(payroll_data)
 ```
-* **Observation** There are 38 columns in data. Variable Tot Gross being the dependent variable.
-###Exploratory Analysis
+Once we fixed the issues with the data and removed some of the un-influential variables, we then decided to fix up some of the factor variables in the dataset.
 ```{r}
+payroll_data$Agency = as.factor(payroll_data$Agency)
+payroll_data$`Chk Option` = as.factor(payroll_data$`Chk Option`)
+payroll_data$City = as.factor(payroll_data$City)
+payroll_data$`EE Class Descr` = as.factor(payroll_data$`EE Class Descr`)
+payroll_data$`Ethnic Grp` = as.factor(payroll_data$`Ethnic Grp`)
+payroll_data$`Full/Part` = as.factor(payroll_data$`Full/Part`)
+payroll_data$`Job Cd Descr` = as.factor(payroll_data$`Job Cd Descr`)
+payroll_data$`Job Indicator` = as.factor(payroll_data$`Job Indicator`)
+payroll_data$`Name Suffix` = as.factor(payroll_data$`Name Suffix`)
+payroll_data$Postal = as.factor(payroll_data$Postal)
+payroll_data$Sex = as.factor(payroll_data$Sex)
+payroll_data$`Union Descr` = as.factor(payroll_data$`Union Descr`)
+```
+We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
+**How to deal with names**
+We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
+**Exploring Collinearity and Correlation of Predictors**
+**Model Selection**
+**Model Analysis**
+## Discussion
+## Appendix
+**Bad models and old code**
+We can use this section to store old models that didn't work or ideas that led to a dead end.
-```
\ No newline at end of file
--- a/Project.html
+++ b/Project.html