Skip to content
Snippets Groups Projects
Commit c8d74a94 authored by Jack Meyers's avatar Jack Meyers
Browse files

cleaned up the data a bit

parent ce52741e
No related branches found
No related tags found
No related merge requests found
...@@ -17,8 +17,8 @@ output: ...@@ -17,8 +17,8 @@ output:
```{r setup, echo = FALSE, message = FALSE, warning = FALSE} ```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80)
library(knitr) library(knitr)
options(scipen = 1, digits = 4, width = 80)
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE) opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
``` ```
...@@ -30,33 +30,95 @@ opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE) ...@@ -30,33 +30,95 @@ opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
Name | NetID Name | NetID
-------------- | ------------- -------------- | -------------
Kieran Daly | Kieran Daly |
Jack Meyers | Jack Meyers | jsmeyrs2
Serhat Tuncay | stuncay2 Serhat Tuncay | stuncay2
## Introduction ## Introduction
###DataSet ### DataSet
```{r kable,message=FALSE,echo=FALSE,warning=FALSE} ```{r kable,message=FALSE,echo=FALSE,warning=FALSE}
# Libraries, Helpers and read the data. # Libraries, Helpers and read the data.
library(readr) library(readr)
library(dplyr)
``` ```
```{r}
```{r, warning=FALSE}
#We have subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` == 2022 for ease of use
state_employee_payroll_2022_data = read_csv('./dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip') state_employee_payroll_2022_data = read_csv('./dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip')
``` ```
* **Summary:**
This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
Our proposal is to study the Connecticut state employee payroll data provided by the office of the state comptroller in order to study trends in state employees’ pay. Each row in the dataset details an individual payroll check issued to a state employee in Connecticut starting from 2015 and contains 38 columns. This dataset contains many data features that can help us identify wage trends such as ethnicity, sex, age, government agency, and location. The dataset contains over 14MM rows which gives us plenty of data to work with, our plan is to isolate a single year (2022) in order to tighten the scope of our study. It will be an interesting endeavor to investigate wages of these state employees (as Kieran is from Connecticut). From just sorting the data, one state employee makes over $11,000,000 a year! This sounds like a large salary for a government employee, so we can't wait to start looking into it further.
There are 38 columns in data with `Tot Gross` being the dependent variable.
## Methods
**Cleaning Dataset**
We will start off by fixing some issues with the data.
```{r} ```{r}
str(state_employee_payroll_2022_data) #Looks like there is an issue with the `Chk Status` column which we probably don't need.
problems(state_employee_payroll_2022_data)
#The values are just FALSE and N/A so let's get rid of that
unique(state_employee_payroll_2022_data$`Chk Status`)
#There are a few columns with data that won't help with training a regression model
unique(state_employee_payroll_2022_data$`Pyrl Fiscal Yr`)
unique(state_employee_payroll_2022_data$`Calendar Year`)
unique(state_employee_payroll_2022_data$`State`)
#We can remove the problem column and some other insignificant columns
payroll_data = subset(state_employee_payroll_2022_data, select = -c(`Chk Status`, `Pyrl Fiscal Yr`, `Calendar Year`, `Check #`, `Check Dt`, `State`))
#Now that's looking better
problems(payroll_data)
``` ```
* **Observation** There are 38 columns in data. Variable Tot Gross being the dependent variable.
###Exploratory Analysis Once we fixed the issues with the data and removed some of the un-influential variables, we then decided to fix up some of the factor variables in the dataset.
```{r} ```{r}
payroll_data$Agency = as.factor(payroll_data$Agency)
payroll_data$`Chk Option` = as.factor(payroll_data$`Chk Option`)
payroll_data$City = as.factor(payroll_data$City)
payroll_data$`EE Class Descr` = as.factor(payroll_data$`EE Class Descr`)
payroll_data$`Ethnic Grp` = as.factor(payroll_data$`Ethnic Grp`)
payroll_data$`Full/Part` = as.factor(payroll_data$`Full/Part`)
payroll_data$`Job Cd Descr` = as.factor(payroll_data$`Job Cd Descr`)
payroll_data$`Job Indicator` = as.factor(payroll_data$`Job Indicator`)
payroll_data$`Name Suffix` = as.factor(payroll_data$`Name Suffix`)
payroll_data$Postal = as.factor(payroll_data$Postal)
payroll_data$Sex = as.factor(payroll_data$Sex)
payroll_data$`Union Descr` = as.factor(payroll_data$`Union Descr`)
```
We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
**How to deal with names**
We should do an approach with and without names to see if first/last/anonymous influences the regression. We can try creating factor variables of the names or some other approach.
**Exploring Collinearity and Correlation of Predictors**
**Model Selection**
**Model Analysis**
## Discussion
## Appendix
**Bad models and old code**
We can use this section to store old models that didn't work or ideas that led to a dead end.
```
\ No newline at end of file
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment