Introduction to R: Data Transformation

Erin Grand

3/16/2017

Introduction to R: Data Manipulation

Packages

Download tidyverse. Imports: dplyr, ggplot2, lubricate, readr, readxl, stringr, tibble, tidyr, magrittr

library(readxl)
library(tidyverse)

Tidy Data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Learn to think in Code

How to do you currently think about a task in excel or tableau? Learn how to translate and use that process in code.

What is your process for creating a pivot table?
What is your process for creating a new column?
How do you sort data?

DPLYR

Five Most Important DPLYR Functions

Filter
Select
Mutate
Group_by, Summarize (Summarise)
Arrange

All these function work similarly, the first argument is a data frame, the next describe what you want to do with the data frame, and the result is a new data frame.

Real world example

You’re going to use of each these functions to look at the NY schools with the highest predicted passing rates for students taking the Regents exam.

Filter

Task: Filter the IA data to only include NY 8th grade students’ math tests. This way you’re only looking at Regents test takers.

Select

It’s not uncommon to get datasets with many variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on just those.
You select columns by name without quotes

Select Continued

There are a number of other different ways to select columns. You can find out more about each of these on the Tidy Data cheat sheet, or by looking at ?select

column1:column3: selects all the column between column1 and column3
starts_with("abc"): matches column names that begin with “abc”
ends_with("abc"): matches column names that end with “abc”
contains("ID"): will select only columns that contain ID (not case-sensitive)

num_range("x", 1:3): selects columns titled x1, x2, x3
matches(): selects variables that match a regular expression, don’t use this unless you know regular expressions
-X3 will select all column expect X3
one_of() takes a vector of characters and will select all the columns that have the same name as you provide

Task: The IA data has SO MANY columns which aren’t really useful if we only want to the passing rate of these students. Find the ones we need and select only them.

Mutate

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
You can refer to column you’ve just created too.

library(fivethirtyeight)
avengers <- mutate(fivethirtyeight::avengers, 
       death1 = ifelse(!is.na(death1) & death1, 1, 0 ),
       death2 = ifelse(!is.na(death2) & death2, 1, 0 ),
       death3 = ifelse(!is.na(death3) & death3, 1, 0 ),
       death4 = ifelse(!is.na(death4) & death4, 1, 0 ),
       death5 = ifelse(!is.na(death5) & death5, 1, 0 ),
       total.deaths = death1 + death2 + death3 + death4 + death5)

Task: Create a column called Pass that describes if the student is predicted to pass the exam (1) or not (0).

Group by and Summarize

summarize() summarizes the data to a single row
it is not really useful on it’s own, but becomes super powerful when you pair it with group_by().
This changes the “unit” of analysis from the complete dataset to individual groups.
So instead of a a single row, you get a single row per group.

by_year <- group_by(avengers, year, gender)
summarise(by_year, avgDeaths = mean(total.deaths, na.rm = TRUE))

## Source: local data frame [64 x 3]
## Groups: year [?]
## 
##     year gender avgDeaths
##    <int>  <chr>     <dbl>
## 1   1900 FEMALE 0.0000000
## 2   1900   MALE 0.1428571
## 3   1963 FEMALE 1.0000000
## 4   1963   MALE 1.0000000
## 5   1964   MALE 1.0000000
## 6   1965 FEMALE 1.0000000
## 7   1965   MALE 1.3333333
## 8   1967   MALE 0.0000000
## 9   1968   MALE 0.5000000
## 10  1969   MALE 0.0000000
## # ... with 54 more rows

## Source: local data frame [42 x 4]
## Groups: year [42]
## 
##     year FEMALE      MALE gender.difference
##    <int>  <dbl>     <dbl>             <dbl>
## 1   1900    0.0 0.1428571        -0.1428571
## 2   1963    1.0 1.0000000         0.0000000
## 3   1964     NA 1.0000000                NA
## 4   1965    1.0 1.3333333        -0.3333333
## 5   1967     NA 0.0000000                NA
## 6   1968     NA 0.5000000                NA
## 7   1969     NA 0.0000000                NA
## 8   1973    1.0        NA                NA
## 9   1975     NA 2.0000000                NA
## 10  1976    1.5 0.0000000         1.5000000
## # ... with 32 more rows

Arrange

Arrange works by you providing the column names you’d like to sort by. The order you provide them will determine the ties in the preceding columns.
You can use desc() to order in descending order
Missing values (NAs) are always sorted at the end

Task: Sort the final table to look at the schools with the highest predicted pass rate for the Regents Math exam.

Result:

## # A tibble: 15 × 2
##     School Pass.Rate
##      <chr>     <chr>
## 1  LPOH-MA     92.8%
## 2  LPBS-MA     91.9%
## 3     TPMS     89.1%
## 4     RPBC     85.7%
## 5      OHC     85.5%
## 6     KCCS       84%
## 7  LPBV-MA     77.8%
## 8  LPCN-MA     75.9%
## 9     WCCS     74.7%
## 10     BEC     73.4%
## 11    RPWC     70.7%
## 12     BSC     69.9%
## 13     BVC     67.2%
## 14 EGCS-MA     59.5%
## 15 EBCS-MA     56.5%

Questions

Discussion Questions

When do you think you should you use mutate() vs creating a new column with df$x?
What would be a use case to have a group_by() WITHOUT summarize()?
These steps are great for analysis! What about data cleaning and QCing? Discuss how you might use dplyr functions in data cleaning and/or QCing.

Joins

Which Join is which?

Type of Join	What it does	Example Code
Left Join	Keeps rows in 1st table, joins matching rows from 2nd	`left_join(x, y, by="ID")`
Right Join	Keeps rows in 2nd table, joins matching rows from 1st	`right_join(x, y, by="ID")`

Type of Join	What it does	Example Code
Full Join	Keeps the rows from both tables, filling in missing rows with NA	`full_join(x, y, by="ID")`
Inner Join	Keeps ONLY the rows in both tables	`inner_join(x, y, by="ID")`
Semi Join	keeps one row for every row in the 1st table that matches the 2nd	`semi_join(x, y, by="ID")`
Anti Join	Removes rows that match the 2nd table	`anti_row(x, y, by="ID")`

Joining by more than one Column

left_join(newTE, oldTE, by = c("Teacher"="Teacher.Name", "School"), suffix=c(".new", ".old")

When to use what Join?

Type of Join	Use case
Left Join	Default
Right Join	You’re map is the first in the pipe
Full Join	You don’t want to lose any rows in either data set
Inner Join	You don’t want any NAs
Semi Join	You don’t want any “duplicates” of the first data set
Anti Join	You want to filter out rows

Advantages of Anti-Join

It can be really useful to “filter” out data from one data set that is in another
For example, in text analysis you want to ignore stop words (and, but, the..etc).

What about Right-Join?

There aren’t many use cases for right joining data, HOWEVER if you are piping a set of set which creates your map and you want to join it back to the original data, to continue the pipe you would use a right join.

Putting it Together

Assigment

Task: Talent Dev has requested a rundown on the racial demographics of our NY students

Read in the enrollment data file
Explore the file, is there anything missing that you need?
Join any data that you may need that was missing from the enrollment set.
QC and clean the enrollment data
Find how many many students are in each ethnicity bucket
Arrange the data for ease of requester