More cool things with mutate

June 6, 2017

Using mutate can sometimes feel like…

But soon you will be a mutate expert and everything will be…


awesome.

Mutate - itself

No more nested Ifs!

mutate(grade_group = 
         ifelse(Grade >= 5 & Grade <= 8 & Subject == "ELA", "5-8", 
                  ifelse(Grade >= 5 & Grade <= 8 & Subject == "Math", "5-7", 
                         ifelse(Grade < 5, "3-4", "Unknown")
                        )))

Correction 1 - multiple if statements, on their own:

I find this works really well for small data, otherwise you’re adding time to the process.

mutate(grade_group = ifelse(Grade >= 5 & Grade <= 8 & Subject == "ELA", "5-8", NA),
       grade_group = ifelse(Grade >= 5 & Grade <= 8 & Subject == "Math", "5-7", grade_group),
       grade_group = ifelse(Grade < 5, "3-4", "grade_group"))

Correction 2 - translation table

For larger data, joining against a translation table may be faster.

grade_group <- tibble(grade = 3:8, 
                      ela_group = c("3-4", "3-4", rep("5-8", 4)), 
                      math_group = c("3-4", "3-4", rep("5-7", 4))
                      )
grade_group
## # A tibble: 6 x 3
##   grade ela_group math_group
##   <int>     <chr>      <chr>
## 1     3       3-4        3-4
## 2     4       3-4        3-4
## 3     5       5-8        5-7
## 4     6       5-8        5-7
## 5     7       5-8        5-7
## 6     8       5-8        5-7
results <- results %>% left_join(grade_group, by="grade")

Correction 3 - case_when()

THANK YOU DPLYR! Now that this is in full working order, this is the best solution to the nested if problem.

mutate(grade_group = case_when(
  Grade < 5 ~ "3-4",
  Grade <=8 & Subject == "ELA" ~ "5-8",
  Grade <= 8 & Subject == "Math" ~ "5-7",
  TRUE ~ NA
)

group_by and mutate

A grouped data frame is really useful to work with mutate statements. You can do ranking, summarizing, filtering..etc on your groups without losing data.

Example: So I have this file with student profile info, but each student has one record per school year, so Ying has 9th grade, 10th grade, and 11th grade. I want to create a file with one row per student, taking only the latest year.

df %>%
  group_by(student_id) %>% 
  mutate(max_year = max(school_year) %>% 
  filter(school_year == max_year) %>% 
  ungroup()

Example: I want to rank schools by their sccores on a particular subject.

schools %>% 
  group_by(subject) %>%
  mutate(rank = min_rank(percent_correct))

Update Multiple Columns at Once

mutate_at/summarise_at:

  • Useful for when you need to apply the same function to many columns (i.e as.numeric())
  • It will automatically mutate all your columns at once.
  • You need to tell mutate which columns you want it to look at, and what to do with them.
  • You can give it multiple columns and multiple functions. All functions will act on all columns.
mutate_at(vars(), funs())

Example

results %>%
  mutate_at(vars(score, grade), as.numeric)

mutate_if/summarise_if:

inputs: test, function(s)

summarise_if(is.numeric, funs(mean, min, max, sd, n))
summarise_if(is.numeric, as.character)
summarise_if(is.numeric, as.factor)

Task 1

Find the the mean of ALL the columns in the PARCC scores data set with “SCORE” in the name AND all columns that R knows are numbers.

parcc_scores <- read_rds("S:/Data Analytics/State Test Analysis/2016-2017/PARCC/Results Release/~Data/Processed/Uncommon Student-level results/Output/PARCC_SY1617_results_ESMS.rds") 
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 1 x 8
##      score proficient voidscorecode voidscorereason testreadingscalescore
##      <dbl>      <dbl>         <dbl>           <dbl>                 <dbl>
## 1 766.2965    0.70374           NaN             NaN              53.25137
## # ... with 3 more variables: testwritingscalescore <dbl>,
## #   tested_diff_school <dbl>, testscorecomplete <dbl>

Task 2

  • Take the PARCC state test data and create a table of percent correct per exam AND average scaled scores. Use summarise_at or summarise_if NOT summarise.
  • Extra credit: Use SPREAD to create ELA and Math versions of BOTH proficient and score.
parcc_scores <- read_rds("S:/Data Analytics/State Test Analysis/2016-2017/PARCC/Results Release/~Data/Processed/Uncommon Student-level results/Output/PARCC_SY1617_results_ESMS.rds")
## # A tibble: 6 x 4
## # Groups:   grade [3]
##   grade report_subject proficient    score
##   <chr>          <chr>      <dbl>    <dbl>
## 1     3            ELA  0.7145877 774.9471
## 2     3           Math  0.7864693 770.5455
## 3     4            ELA  0.7816594 774.5917
## 4     4           Math  0.6965066 762.2009
## 5     5            ELA  0.7325349 769.6747
## 6     5           Math  0.5588822 753.7784

Task 3

Clean up this set of code to use summarise_at and mutate_at where useful. Do not worry about all the column names being the same as they are now. :)

combinedSurveyTransp <- read_rds("S:/Data Analytics/Staff Analyses/Staff Survey/temp/combined_survey_tranporse.rds")

schoolSummary<-combinedSurveyTransp %>%
  group_by(Region,School,Grade.Span, School.Year, Please.select.your.role, Q.Display, Q.Category,For.how.long.have.you.been.with.Uncommon.Schools, Principal, DOO) %>%
  summarise(
  LikertAvg=sum(Likert,na.rm=TRUE),
  Likert5=sum(Likert5,na.rm=TRUE),
  Likert4=sum(Likert4,na.rm=TRUE),
  Likert3=sum(Likert3,na.rm=TRUE),
  Likert2=sum(Likert2,na.rm=TRUE),
  Likert1=sum(Likert1,na.rm=TRUE),
  Likert54=sum(Likert54,na.rm=TRUE),
  Likert12=sum(Likert12,na.rm=TRUE),
  LikertN=sum(ifelse(is.na(Likert), 0, 1)),
  NPSAvg=sum(NPS,na.rm=TRUE),
  NPS9.10=sum(NPS9.10,na.rm=TRUE),
  NPS7.8=sum(NPS7.8,na.rm=TRUE),
  NPS0.6=sum(NPS0.6,na.rm=TRUE),
  NPSN=sum(ifelse(is.na(NPS), 0, 1)),
  workloadLow=sum(workloadLow,na.rm=TRUE),
  workloadRight=sum(workloadRight, na.rm=TRUE),
  workloadHigh=sum(workloadHigh, na.rm=TRUE),
  workloadTooHigh=sum(workloadTooHigh, na.rm=TRUE),
  workloadN=sum(ifelse(is.na(workloadN), 0, 1)),
  leaveEOY=sum(leaveEOY,na.rm=TRUE),
  leave1.2=sum(leave1.2,na.rm=TRUE),
  leave3.4=sum(leave3.4,na.rm=TRUE),
  leave5.6=sum(leave5.6,na.rm=TRUE),
  leave.Many=sum(leave.Many,na.rm=TRUE),
  leaveN=sum(ifelse(is.na(expectedYrsN), 0, 1)),
  reasonCareer=sum(reasonCareer,na.rm=TRUE),
  reasonLearn=sum(reasonLearn,na.rm=TRUE),
  reasonLeader=sum(reasonLeader,na.rm=TRUE),
  reasonLifestyle=sum(reasonLifestyle,na.rm=TRUE),
  reasonMoving=sum(reasonMoving,na.rm=TRUE),
  reasonAdvance=sum(reasonAdvance,na.rm=TRUE),
  reasonSalary=sum(reasonSalary,na.rm=TRUE),
  reasonStaffCulture=sum(reasonStaffCulture,na.rm=TRUE),
  reasonStudentCulture=sum(reasonStudentCulture,na.rm=TRUE),
  reasonOther=sum(reasonOther,na.rm=TRUE),
  reasonN=sum(ifelse(is.na(whyLeaveN), 0, 1)),
  desiredChangeAvg=sum(desiredChange, na.rm=TRUE),
  desiredChangeN=sum(ifelse(is.na(desiredChange), 0, 1))
  )