R in the World of Education

Erin Grand
December 12, 2017

R-Ladies NYC

About Me

Studied Astronomy and Physics at University of Maryland
Data Science MS from Columbia University 2011
Currently Associate Director of Data Analytics at Uncommon Schools

What is a charter school?

A charter school is an independently run public school granted greater flexibility in its operations, in return for greater accountability for performance. The “charter” establishing each school is a performance contract detailing the school's mission, program, students served, performance goals, and methods of assessment.

Uncommon Schools

Established in 1997 (oldest school opened - North Star Academy in Newark); 2005 - CMO formed
83% Free and reduced meal population
Intensive PD for teachers & leaders; Partnerships with local districts to deliver PD; Summer Teaching Fellows diversity recruitment program; Camp Uncommon

What kind of data do we work with?

Assessments: Interim assessments
Exams: Common Core aligned state exams, SAT, PSAT, APs, …etc
Classroom: Grades, attendance, suspensions, …etc
Teacher: student - course - teacher linkage
Staff Data: HR and Recruitment

Data Challenges

Missing/Incomplete data
Different data sources without matching IDs (i.e HR to Teacher to Student)
Students and teachers may move between schools or courses
Alignment of data and data processes across all schools and regions
Changing student IDs (not many)
Human data reporting error
Historical data quality

Data Challenges We Can Fix!

Messy excel sheets (historical or human entered)
Column names that don't apply anymore
Lack of historical documentation
Finding duplicate exams
Students take half of one test and the other half of another
Vanishing leading zeros

Tracking of student IDs that change
Common definitions (i.e “cohort”)
How to refer to school years or school abbreviations
Data audits

Looking at Duplicates with Janitor

Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with janitor and save their thinking for more fun tasks. (Sam Firke)

If you're experienced with Tidyverse in general, you should be able to do everything inside janitor on your own, but we don't have the time to always clean up data without help.

Benefits to using Janitor over writing your own code

Functions are tested
Generally obeys Hadley's official style guide
Turn many lines of code into one or two
Pipe-able functions
Written for the education data space

Messy Excel Sheets

Image credit to Sam Firke

Using Janitor to Clean Excel

read_excel(filepath, sheet="Sheet1", col_types = "text") %>%
  clean_names() %>%
  remove_empty_cols() %>%
  remove_empty_rows() %>%
  mutate_at(vars(entrydate, exitdate, student_id, yearsinuncommon), as.numeric) %>%
  mutate_at(vars(entrydate, exitdate), excel_numeric_to_date)

Finding Duplicates

library(tidyverse)
library(janitor)
library(readxl)

students <- read_excel(filepath, sheet="Sheet1", col_types = "text") %>%
  clean_names() %>%
  remove_empty_cols() %>%
  mutate_at(vars(entrydate, exitdate, student_id, yearsinuncommon), as.numeric) %>%
  mutate_at(vars(entrydate, exitdate), excel_numeric_to_date)

students %>% 
  get_dupes(student_id)

# A tibble: 2 x 6
  student_id dupe_count grade yearsinuncommon  entrydate   exitdate
       <dbl>      <int> <dbl>           <dbl>     <date>     <date>
1    7851976          2     5               1 2017-11-12 2017-12-12
2    7851976          2     6               1 2017-11-12 2017-12-12

Now What?

Correct the dupes individually with if_else or case_when

mutate(students, grade = if_else(student_id == 7851976, 5, grade))

Summarize by taking minimum date / grade, if that is causing the problem

group_by(students, student_id) %>% summarize(grade = min(grade))

Output the duplicates and manually choose which version to keep

dupes_correct <- read_csv("dupes_correct.csv")
left_join(students, dupes_correct) %>%
  replace_na(list(keep = 0)) %>%
  assert(not_na, keep) %>%
  filter(keep = 0)

Managing Data Changes

Using get_dupes and verify() from the assertr package is a great way to put in checks in case the data changes (which it will).

check <- students %>% 
  get_dupes(student_id) %>% 
  verify(nrow(.) == 0)

If a student ID changes, or new duplicates occur, the code will HALT at this step alerting that something is off.

Introducing R to my Team

Learnings Along the Way

Choose the packages that are needed every day
Have someone that is active in R community, so that you can be on the cutting edge.
The more practice someone has, the faster they'll learn. Pair PD sessions with coding projects.

Model Building

Automation of State Tests

Entire state test analyses from raw data to dashboard is done with scripts (push button analysis)

Gathering, cleaning and basic processing

files <- list.files("../Input/", pattern = ".xlsx", full.names =  TRUE)
nys <- map_dfr(files, prep_nys_files)

Combine historical and new data from all three states
Data is outputted into Tableau dashboards

Cut Scores: Predicted Pass Rates

library(rpart)
library(rpart.plot)

# Model Fit
fit <- rpart(proficient ~ ia_score, data=dat,  maxdepth = 1, method= "class")

# Accuracy Calculation
root.node.error <- fit$frame[1, 'dev']/fit$frame[1, 'n']
xerr <- min(fit$cptable[, 3])
cp.err <- root.node.error * xerr
acc <- round(1 - cp.err, 3)

# First Split
cut_score <- as.data.frame(fit$splits)$index

# Plot Tree
prp(fit, fallen.leaves = TRUE, type = 3, 
    extra = 1, under = TRUE, varlen=0, faclen=0)
title(sub = paste("Accuracy:", paste(100 * acc, "%", sep="")))

Other Model Projects

SGI (small group learning)
Recruitment projections: Projected Attrition + Projected Growth
Teacher Effect : Value added model(s)

Q & A