Introduction to R: Data Transformation

Erin Grand

3/16/2017

Introduction to R: Data Manipulation

Packages

Download tidyverse. Imports: dplyr, ggplot2, lubricate, readr, readxl, stringr, tibble, tidyr, magrittr

library(readxl)
library(tidyverse)

Tidy Data

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Learn to think in Code

How to do you currently think about a task in excel or tableau? Learn how to translate and use that process in code.

  1. What is your process for creating a pivot table?
  2. What is your process for creating a new column?
  3. How do you sort data?

DPLYR

Five Most Important DPLYR Functions

  1. Filter
  2. Select
  3. Mutate
  4. Group_by, Summarize (Summarise)
  5. Arrange

All these function work similarly, the first argument is a data frame, the next describe what you want to do with the data frame, and the result is a new data frame.

Real world example

You’re going to use of each these functions to look at the NY schools with the highest predicted passing rates for students taking the Regents exam.

Filter

Task: Filter the IA data to only include NY 8th grade students’ math tests. This way you’re only looking at Regents test takers.

Select

  • It’s not uncommon to get datasets with many variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on just those.
  • You select columns by name without quotes

Select Continued

There are a number of other different ways to select columns. You can find out more about each of these on the Tidy Data cheat sheet, or by looking at ?select

  • column1:column3: selects all the column between column1 and column3
  • starts_with("abc"): matches column names that begin with “abc”
  • ends_with("abc"): matches column names that end with “abc”
  • contains("ID"): will select only columns that contain ID (not case-sensitive)
  • num_range("x", 1:3): selects columns titled x1, x2, x3
  • matches(): selects variables that match a regular expression, don’t use this unless you know regular expressions
  • -X3 will select all column expect X3
  • one_of() takes a vector of characters and will select all the columns that have the same name as you provide

Task: The IA data has SO MANY columns which aren’t really useful if we only want to the passing rate of these students. Find the ones we need and select only them.

Mutate

  • Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
  • You can refer to column you’ve just created too.
library(fivethirtyeight)
avengers <- mutate(fivethirtyeight::avengers, 
       death1 = ifelse(!is.na(death1) & death1, 1, 0 ),
       death2 = ifelse(!is.na(death2) & death2, 1, 0 ),
       death3 = ifelse(!is.na(death3) & death3, 1, 0 ),
       death4 = ifelse(!is.na(death4) & death4, 1, 0 ),
       death5 = ifelse(!is.na(death5) & death5, 1, 0 ),
       total.deaths = death1 + death2 + death3 + death4 + death5) 

Task: Create a column called Pass that describes if the student is predicted to pass the exam (1) or not (0).

Group by and Summarize

  • summarize() summarizes the data to a single row
  • it is not really useful on it’s own, but becomes super powerful when you pair it with group_by().
  • This changes the “unit” of analysis from the complete dataset to individual groups.
  • So instead of a a single row, you get a single row per group.
by_year <- group_by(avengers, year, gender)
summarise(by_year, avgDeaths = mean(total.deaths, na.rm = TRUE))
## Source: local data frame [64 x 3]
## Groups: year [?]
## 
##     year gender avgDeaths
##    <int>  <chr>     <dbl>
## 1   1900 FEMALE 0.0000000
## 2   1900   MALE 0.1428571
## 3   1963 FEMALE 1.0000000
## 4   1963   MALE 1.0000000
## 5   1964   MALE 1.0000000
## 6   1965 FEMALE 1.0000000
## 7   1965   MALE 1.3333333
## 8   1967   MALE 0.0000000
## 9   1968   MALE 0.5000000
## 10  1969   MALE 0.0000000
## # ... with 54 more rows
## Source: local data frame [42 x 4]
## Groups: year [42]
## 
##     year FEMALE      MALE gender.difference
##    <int>  <dbl>     <dbl>             <dbl>
## 1   1900    0.0 0.1428571        -0.1428571
## 2   1963    1.0 1.0000000         0.0000000
## 3   1964     NA 1.0000000                NA
## 4   1965    1.0 1.3333333        -0.3333333
## 5   1967     NA 0.0000000                NA
## 6   1968     NA 0.5000000                NA
## 7   1969     NA 0.0000000                NA
## 8   1973    1.0        NA                NA
## 9   1975     NA 2.0000000                NA
## 10  1976    1.5 0.0000000         1.5000000
## # ... with 32 more rows

Arrange

  • Arrange works by you providing the column names you’d like to sort by. The order you provide them will determine the ties in the preceding columns.
  • You can use desc() to order in descending order
  • Missing values (NAs) are always sorted at the end

Task: Sort the final table to look at the schools with the highest predicted pass rate for the Regents Math exam.

Result:

## # A tibble: 15 × 2
##     School Pass.Rate
##      <chr>     <chr>
## 1  LPOH-MA     92.8%
## 2  LPBS-MA     91.9%
## 3     TPMS     89.1%
## 4     RPBC     85.7%
## 5      OHC     85.5%
## 6     KCCS       84%
## 7  LPBV-MA     77.8%
## 8  LPCN-MA     75.9%
## 9     WCCS     74.7%
## 10     BEC     73.4%
## 11    RPWC     70.7%
## 12     BSC     69.9%
## 13     BVC     67.2%
## 14 EGCS-MA     59.5%
## 15 EBCS-MA     56.5%

Questions

Discussion Questions

  1. When do you think you should you use mutate() vs creating a new column with df$x?
  2. What would be a use case to have a group_by() WITHOUT summarize()?
  3. These steps are great for analysis! What about data cleaning and QCing? Discuss how you might use dplyr functions in data cleaning and/or QCing.

Joins

Which Join is which?

Type of Join What it does Example Code
Left Join Keeps rows in 1st table, joins matching rows from 2nd left_join(x, y, by="ID")
Right Join Keeps rows in 2nd table, joins matching rows from 1st right_join(x, y, by="ID")
Type of Join What it does Example Code
Full Join Keeps the rows from both tables, filling in missing rows with NA full_join(x, y, by="ID")
Inner Join Keeps ONLY the rows in both tables inner_join(x, y, by="ID")
Semi Join keeps one row for every row in the 1st table that matches the 2nd semi_join(x, y, by="ID")
Anti Join Removes rows that match the 2nd table anti_row(x, y, by="ID")

Joining by more than one Column

left_join(newTE, oldTE, by = c("Teacher"="Teacher.Name", "School"), suffix=c(".new", ".old") 

When to use what Join?

Type of Join Use case
Left Join Default
Right Join You’re map is the first in the pipe
Full Join You don’t want to lose any rows in either data set
Inner Join You don’t want any NAs
Semi Join You don’t want any “duplicates” of the first data set
Anti Join You want to filter out rows

Advantages of Anti-Join

  • It can be really useful to “filter” out data from one data set that is in another
  • For example, in text analysis you want to ignore stop words (and, but, the..etc).

What about Right-Join?

  • There aren’t many use cases for right joining data, HOWEVER if you are piping a set of set which creates your map and you want to join it back to the original data, to continue the pipe you would use a right join.

Putting it Together

Assigment

Task: Talent Dev has requested a rundown on the racial demographics of our NY students

  1. Read in the enrollment data file
  2. Explore the file, is there anything missing that you need?
  3. Join any data that you may need that was missing from the enrollment set.
  4. QC and clean the enrollment data
  5. Find how many many students are in each ethnicity bucket
  6. Arrange the data for ease of requester