cleaned up data, created an initial model

afab1261 · Jack Meyers · dd4fefb8 · afab1261 · afab1261
Commit afab1261 authored 1 year ago by Jack Meyers
--- a/Project.Rmd
+++ b/Project.Rmd
@@ -43,130 +43,114 @@ As a part of a state data plan, Connecticut publishes datasets from different ex
 ## Methods
-**Cleaning Dataset**
 ```{r kable,message=FALSE,echo=FALSE,warning=FALSE}
 # Libraries, Helpers and read the data.
 library(readr)
 library(dplyr)
 library(r2r)
+library(faraway)
 ```
+**Importing and Fixing Data**
 ```{r, warning=FALSE}
 state_employee_payroll_2022_data = read_csv("dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip")
+colnames(state_employee_payroll_2022_data) <- gsub(" ", "", colnames(state_employee_payroll_2022_data))
 ```
 First we subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` was equal to 2022 and saved that to a separate file in order to speed up development. After the data was imported, we decided to see if there were any issues with it.
 ```{r}
 problems(state_employee_payroll_2022_data) 
-unique(state_employee_payroll_2022_data$`Chk Status`)
+unique(state_employee_payroll_2022_data$`ChkStatus`)
 ```
 It looked like there was an issue with the `Chk Status` column, but it didn't seem to contain much useful information anyway so we chose to remove it. There were a few other columns that didn't contribute any meaningful information to a regression so we chose to remove those here too. 
 ```{r}
-unique(state_employee_payroll_2022_data$`Pyrl Fiscal Yr`)
+unique(state_employee_payroll_2022_data$`PyrlFiscalYr`)
-unique(state_employee_payroll_2022_data$`Calendar Year`)
+unique(state_employee_payroll_2022_data$`CalendarYear`)
 unique(state_employee_payroll_2022_data$`State`)
-payroll_data = subset(state_employee_payroll_2022_data, select = -c(`Chk Status`, `Pyrl Fiscal Yr`, `Calendar Year`, `Check #`, `Check Dt`, `State`))
+payroll_data = subset(state_employee_payroll_2022_data, select = -c(`ChkStatus`, `PyrlFiscalYr`, `CalendarYear`, `State`, `TermDate`, `Check#`, `FirstName`, `MiddleInitial`, `LastName`, `Other`, `Bi-WeeklyCompRate`, `Salaries&Wages`, `TotGross`))
+payroll_data[is.na(payroll_data)] <- ""
 problems(payroll_data)
 ```
+**Data Preparation**
 Once the problem with the dataset was solved we converted the categorical predictors into factor variables so that they can be used to train a model.
 ```{r}
 payroll_data$Agency = as.factor(payroll_data$Agency)
-payroll_data$`Chk Option` = as.factor(payroll_data$`Chk Option`)
+payroll_data$CheckDt = as.factor(payroll_data$CheckDt)
+payroll_data$`ChkOption` = as.factor(payroll_data$`ChkOption`)
 payroll_data$City = as.factor(payroll_data$City)
-payroll_data$`EE Class Descr` = as.factor(payroll_data$`EE Class Descr`)
+payroll_data$DeptID = as.factor(payroll_data$DeptID)
-payroll_data$`Ethnic Grp` = as.factor(payroll_data$`Ethnic Grp`)
+payroll_data$`EEClassDescr` = as.factor(payroll_data$`EEClassDescr`)
+payroll_data$`EthnicGrp` = as.factor(payroll_data$`EthnicGrp`)
 payroll_data$`Full/Part` = as.factor(payroll_data$`Full/Part`)
-payroll_data$`Job Cd Descr` = as.factor(payroll_data$`Job Cd Descr`)
+payroll_data$`JobCdDescr` = as.factor(payroll_data$`JobCdDescr`)
-payroll_data$`Job Indicator` = as.factor(payroll_data$`Job Indicator`)
+payroll_data$`JobIndicator` = as.factor(payroll_data$`JobIndicator`)
-payroll_data$`Name Suffix` = as.factor(payroll_data$`Name Suffix`)
+payroll_data$`NameSuffix` = as.factor(payroll_data$`NameSuffix`)
 payroll_data$Postal = as.factor(payroll_data$Postal)
 payroll_data$Sex = as.factor(payroll_data$Sex)
-payroll_data$`Union Descr` = as.factor(payroll_data$`Union Descr`)
+payroll_data$`UnionDescr` = as.factor(payroll_data$`UnionDescr`)
+payroll_data$OrigHire = substr(payroll_data$OrigHire,7,10) #select the year
+payroll_data$OrigHire = strtoi(payroll_data$OrigHire)
 ```
-Since each row of the dataset represented a single paystub, an employee would show up multiple times in different rows. We chose to transform the data in order have only one row per employee since that made our analysis more straightforward.
+Like this example employee below, most state employees will have multiple rows in the data set with each corresponding to a payroll stub. 
 ```{r}
-empId = payroll_data$`EmplId-Empl Rcd`[1]
+empId = payroll_data$`EmplId-EmplRcd`[1]
-(empRecord = payroll_data[payroll_data$`EmplId-Empl Rcd` == empId, ])
+(empRecord = payroll_data[payroll_data$`EmplId-EmplRcd` == empId, ])
 ```
-Here we filtered the data.
+Since we wanted each row our data to represent a single employee, we decided to sample the day with the most paychecks in order to maximize our data. The date was an arbitrary decision, the more influential factor would be choosing a consistent date for every employee in order to accurately capture raise or bonus cycles. 
 ```{r}
-length(unique(payroll_data$`EmplId-Empl Rcd`))
+dates = as.data.frame(unique(payroll_data$CheckDt))
-ct_employees = hashmap()
+biggest_paycheck = 0
-for(i in 1:nrow(payroll_data)){
+biggest_paycheck_idx = 1
-  employee_id = payroll_data[i,]$`EmplId-Empl Rcd`
+for (i in 1:nrow(dates)){
-  if (is.null(ct_employees[[employee_id]])){
+  paycheck_count = nrow(payroll_data[payroll_data$CheckDt == dates[i,1],])
-    ct_employees[[employee_id]] = payroll_data[i,]
+  if(paycheck_count > biggest_paycheck){
-  }else{
+    biggest_paycheck = paycheck_count
-    if (ct_employees[[employee_id]]$`Annual Rate` < payroll_data[i,]$`Annual Rate`){
+    biggest_paycheck_idx = i
-      ct_employees[[employee_id]] = payroll_data[i,]
-    }
  }
 }
-```
+dates[biggest_paycheck_idx,1]
+biggest_paycheck
-```{r}
-ct_employees_df = as.data.frame(ct_employees)
-test = as.data.frame(values(ct_employees))
-emp = data.frame(oof[1])
-colnames(emp) <- gsub(".", " ", colnames(emp))
-#do I gsub earlier, ughhhh
-oof = values(ct_employees)
-for(i in 2:length(oof)){
-  emp <- rbind(emp, oof[i])
-}
-emp <- rbind(emp, list(oof[3]))
-oof[3]
+biggest_payroll = payroll_data[payroll_data$CheckDt == dates[biggest_paycheck_idx,1],]
-oof[i]
 ```
+We've been able to reduce our data set down to `r nrow(payroll_data)` rows which made it much easier to train a regression. We decided to remove the check date and employee ID columns at this point as well since the data has been filtered.
-With the transformed data, we were finally able to look at the columns and look for any un
 ```{r}
+payroll = subset(biggest_payroll, select= -c(`CheckDt`, `EmplId-EmplRcd`))
 ```
+**Model Selection**
+Once the data was finally organized we were able to test some models. Below we created a model based solely on the remaining numeric predictors.
+```{r}
+payroll_numeric = payroll %>% select_if(~class(.) != 'factor')
+numeric_model = lm(AnnualRate ~ ., payroll)
+```
+While that model performed alright, the goal was to use to all of the categorical model to also inform the regression. We ran a backward search using AIC with all of the predictors in order to find a small model that would have more predictive power.
-//We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
-** data and adding predictors**
-**How to deal with names**
-We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
 ```{r}
-payroll_data
+reduced_numeric_model = step(numeric_model, direction = "backward", trace = 0)
+summary(reduced_numeric_model)
 ```
 **Exploring Collinearity and Correlation of Predictors**
-**Model Selection**
 **Model Analysis**

--- a/Week 12 - Final Data Project.pdf
+++ b/Week 12 - Final Data Project.pdf