Erin Grand
December 12, 2017
R-Ladies NYC
A charter school is an independently run public school granted greater flexibility in its operations, in return for greater accountability for performance. The “charter” establishing each school is a performance contract detailing the school's mission, program, students served, performance goals, and methods of assessment.
Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with janitor and save their thinking for more fun tasks. (Sam Firke)
If you're experienced with Tidyverse in general, you should be able to do everything inside janitor on your own, but we don't have the time to always clean up data without help.
Image credit to Sam Firke
read_excel(filepath, sheet="Sheet1", col_types = "text") %>%
clean_names() %>%
remove_empty_cols() %>%
remove_empty_rows() %>%
mutate_at(vars(entrydate, exitdate, student_id, yearsinuncommon), as.numeric) %>%
mutate_at(vars(entrydate, exitdate), excel_numeric_to_date)
library(tidyverse)
library(janitor)
library(readxl)
students <- read_excel(filepath, sheet="Sheet1", col_types = "text") %>%
clean_names() %>%
remove_empty_cols() %>%
mutate_at(vars(entrydate, exitdate, student_id, yearsinuncommon), as.numeric) %>%
mutate_at(vars(entrydate, exitdate), excel_numeric_to_date)
students %>%
get_dupes(student_id)
# A tibble: 2 x 6
student_id dupe_count grade yearsinuncommon entrydate exitdate
<dbl> <int> <dbl> <dbl> <date> <date>
1 7851976 2 5 1 2017-11-12 2017-12-12
2 7851976 2 6 1 2017-11-12 2017-12-12
if_else
or case_when
mutate(students, grade = if_else(student_id == 7851976, 5, grade))
group_by(students, student_id) %>% summarize(grade = min(grade))
dupes_correct <- read_csv("dupes_correct.csv")
left_join(students, dupes_correct) %>%
replace_na(list(keep = 0)) %>%
assert(not_na, keep) %>%
filter(keep = 0)
Using get_dupes
and verify()
from the assertr package is a great way to put in checks in case the data changes (which it will).
check <- students %>%
get_dupes(student_id) %>%
verify(nrow(.) == 0)
If a student ID changes, or new duplicates occur, the code will HALT at this step alerting that something is off.
Learnings Along the Way
Entire state test analyses from raw data to dashboard is done with scripts (push button analysis)
files <- list.files("../Input/", pattern = ".xlsx", full.names = TRUE)
nys <- map_dfr(files, prep_nys_files)
library(rpart)
library(rpart.plot)
# Model Fit
fit <- rpart(proficient ~ ia_score, data=dat, maxdepth = 1, method= "class")
# Accuracy Calculation
root.node.error <- fit$frame[1, 'dev']/fit$frame[1, 'n']
xerr <- min(fit$cptable[, 3])
cp.err <- root.node.error * xerr
acc <- round(1 - cp.err, 3)
# First Split
cut_score <- as.data.frame(fit$splits)$index
# Plot Tree
prp(fit, fallen.leaves = TRUE, type = 3,
extra = 1, under = TRUE, varlen=0, faclen=0)
title(sub = paste("Accuracy:", paste(100 * acc, "%", sep="")))