Skip to content
Snippets Groups Projects
Commit dd4fefb8 authored by Jack Meyers's avatar Jack Meyers
Browse files

added an intro and kept cleaning up the data, now ready, I'll keep trying to figure it out

parent 8b177999
No related branches found
No related tags found
No related merge requests found
......@@ -36,55 +36,46 @@ Serhat Tuncay | stuncay2
## Introduction
### DataSet
What is this data? Where did it come from? What are the variables? Why is it interesting to you?
Why are you creating a model for this data? What is the goal of this model?
As a part of a state data plan, Connecticut publishes datasets from different executive branch agencies to a website where citizens or other agencies can download and interact with it. Our team explored the site and chose to analyze the state employee payroll dataset in order to learn about what factors influence a state employee's salary. We downloaded the dataset which contained 38 columns and over 15 million rows and decided to focus only on the year 2022 in order to reduce the number of rows of data that we needed to analyze. After filtering for just 2022, we were left with almost 1.7 rows which each contained a single million payroll stub with columns such as: annual rate, age, job code description, sex, ethnic group, etc. Our goal was to use the rest of the predictors in the payroll row to predict the `Annual Rate` variable which represented the salary of an employee. In order to do this we had to transform the dataset from rows of payroll stubs to be employee payroll data so that each row was a unique employee. This allowed us to then use each row as a single observation representing an employee's salary, demographic and employment data. In order to understand how a Connecticut state employee's salary is influenced by their demographics and specific job, we analyzed this dataset using a variety of methods and produced a regression model which helps describe the variety of influences.
## Methods
**Cleaning Dataset**
```{r kable,message=FALSE,echo=FALSE,warning=FALSE}
# Libraries, Helpers and read the data.
library(readr)
library(dplyr)
library(r2r)
```
```{r, warning=FALSE}
#We have subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` == 2022 for ease of use
state_employee_payroll_2022_data = read_csv("State_Employee_Payroll_Data_Calendar_Year_2022.csv")
```{r, warning=FALSE}
state_employee_payroll_2022_data = read_csv("dataset/State_Employee_Payroll_Data_Calendar_Year_2022.zip")
```
This section should mostly be a text introduction where we broadly explain the dataset and what our goals are in fitting a model to it. Our proposal is below, we can use it to inform the intro text. I moved the exploratory analysis and data cleaning code down to the methods section.
Our proposal is to study the Connecticut state employee payroll data provided by the office of the state comptroller in order to study trends in state employees’ pay. Each row in the dataset details an individual payroll check issued to a state employee in Connecticut starting from 2015 and contains 38 columns. This dataset contains many data features that can help us identify wage trends such as ethnicity, sex, age, government agency, and location. The dataset contains over 14MM rows which gives us plenty of data to work with, our plan is to isolate a single year (2022) in order to tighten the scope of our study. It will be an interesting endeavor to investigate wages of these state employees (as Kieran is from Connecticut). From just sorting the data, one state employee makes over $11,000,000 a year! This sounds like a large salary for a government employee, so we can't wait to start looking into it further.
There are 38 columns in data with `Tot Gross` being the dependent variable.
## Methods
**Cleaning Dataset**
We will start off by fixing some issues with the data.
First we subsetted the original dataset to only have rows where `Pyrl Fiscal Yr` was equal to 2022 and saved that to a separate file in order to speed up development. After the data was imported, we decided to see if there were any issues with it.
```{r}
#Looks like there is an issue with the `Chk Status` column which we probably don't need.
problems(state_employee_payroll_2022_data)
#The values are just FALSE and N/A so let's get rid of that
problems(state_employee_payroll_2022_data)
unique(state_employee_payroll_2022_data$`Chk Status`)
```
It looked like there was an issue with the `Chk Status` column, but it didn't seem to contain much useful information anyway so we chose to remove it. There were a few other columns that didn't contribute any meaningful information to a regression so we chose to remove those here too.
#There are a few columns with data that won't help with training a regression model
```{r}
unique(state_employee_payroll_2022_data$`Pyrl Fiscal Yr`)
unique(state_employee_payroll_2022_data$`Calendar Year`)
unique(state_employee_payroll_2022_data$`State`)
#We can remove the problem column and some other insignificant columns
payroll_data = subset(state_employee_payroll_2022_data, select = -c(`Chk Status`, `Pyrl Fiscal Yr`, `Calendar Year`, `Check #`, `Check Dt`, `State`))
#Now that's looking better
problems(payroll_data)
```
Once we fixed the issues with the data and removed some of the un-influential variables, we then decided to fix up some of the factor variables in the dataset.
Once the problem with the dataset was solved we converted the categorical predictors into factor variables so that they can be used to train a model.
```{r}
payroll_data$Agency = as.factor(payroll_data$Agency)
......@@ -101,7 +92,66 @@ payroll_data$Sex = as.factor(payroll_data$Sex)
payroll_data$`Union Descr` = as.factor(payroll_data$`Union Descr`)
```
We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
Since each row of the dataset represented a single paystub, an employee would show up multiple times in different rows. We chose to transform the data in order have only one row per employee since that made our analysis more straightforward.
```{r}
empId = payroll_data$`EmplId-Empl Rcd`[1]
(empRecord = payroll_data[payroll_data$`EmplId-Empl Rcd` == empId, ])
```
Here we filtered the data.
```{r}
length(unique(payroll_data$`EmplId-Empl Rcd`))
ct_employees = hashmap()
for(i in 1:nrow(payroll_data)){
employee_id = payroll_data[i,]$`EmplId-Empl Rcd`
if (is.null(ct_employees[[employee_id]])){
ct_employees[[employee_id]] = payroll_data[i,]
}else{
if (ct_employees[[employee_id]]$`Annual Rate` < payroll_data[i,]$`Annual Rate`){
ct_employees[[employee_id]] = payroll_data[i,]
}
}
}
```
```{r}
ct_employees_df = as.data.frame(ct_employees)
test = as.data.frame(values(ct_employees))
emp = data.frame(oof[1])
colnames(emp) <- gsub(".", " ", colnames(emp))
#do I gsub earlier, ughhhh
oof = values(ct_employees)
for(i in 2:length(oof)){
emp <- rbind(emp, oof[i])
}
emp <- rbind(emp, list(oof[3]))
oof[3]
oof[i]
```
With the transformed data, we were finally able to look at the columns and look for any un
```{r}
```
//We can now consider transforming some of the data. Looking at the `Term Date` variable which describes when an employee was terminated, we transform that to a binary predictor which describes whether or not an employee was terminated in 2022 since the date is less important than the state of employment. (We also don't have to do that)
** data and adding predictors**
**How to deal with names**
......@@ -116,6 +166,20 @@ payroll_data
**Model Selection**
**Model Analysis**
## Discussion
## Appendix
**Alternative models**
We can use this section to store old models that didn't work or ideas that led to a dead end.
1. Base Model
This base model will use several features that we think could be the most influential.
......@@ -162,16 +226,3 @@ From the ANOVA model above we would want to keep the model that adds age because
(Eventually here we would put step with backwards aic and bic)
We may experience colinearity with agency and any location zip codes because many of the agencies have regional salaries.
**Model Analysis**
## Discussion
## Appendix
**Bad models and old code**
We can use this section to store old models that didn't work or ideas that led to a dead end.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment