Download tidyverse. Imports: dplyr, ggplot2, lubricate, readr, readxl, stringr, tibble, tidyr, magrittr
library(readxl)
library(tidyverse)
How to do you currently think about a task in excel or tableau? Learn how to translate and use that process in code.
Filter
Select
Mutate
Group_by
, Summarize
(Summarise
)Arrange
All these function work similarly, the first argument is a data frame, the next describe what you want to do with the data frame, and the result is a new data frame.
You’re going to use of each these functions to look at the NY schools with the highest predicted passing rates for students taking the Regents exam.
Task: Filter the IA data to only include NY 8th grade students’ math tests. This way you’re only looking at Regents test takers.
select()
allows you to rapidly zoom in on just those.There are a number of other different ways to select columns. You can find out more about each of these on the Tidy Data cheat sheet, or by looking at ?select
column1:column3
: selects all the column between column1 and column3starts_with("abc")
: matches column names that begin with “abc”ends_with("abc")
: matches column names that end with “abc”contains("ID")
: will select only columns that contain ID (not case-sensitive)num_range("x", 1:3)
: selects columns titled x1
, x2
, x3
matches()
: selects variables that match a regular expression, don’t use this unless you know regular expressions-X3
will select all column expect X3
one_of()
takes a vector of characters and will select all the columns that have the same name as you provideTask: The IA data has SO MANY columns which aren’t really useful if we only want to the passing rate of these students. Find the ones we need and select only them.
mutate()
.library(fivethirtyeight)
avengers <- mutate(fivethirtyeight::avengers,
death1 = ifelse(!is.na(death1) & death1, 1, 0 ),
death2 = ifelse(!is.na(death2) & death2, 1, 0 ),
death3 = ifelse(!is.na(death3) & death3, 1, 0 ),
death4 = ifelse(!is.na(death4) & death4, 1, 0 ),
death5 = ifelse(!is.na(death5) & death5, 1, 0 ),
total.deaths = death1 + death2 + death3 + death4 + death5)
Task: Create a column called Pass
that describes if the student is predicted to pass the exam (1) or not (0).
summarize()
summarizes the data to a single rowgroup_by()
.by_year <- group_by(avengers, year, gender)
summarise(by_year, avgDeaths = mean(total.deaths, na.rm = TRUE))
## Source: local data frame [64 x 3]
## Groups: year [?]
##
## year gender avgDeaths
## <int> <chr> <dbl>
## 1 1900 FEMALE 0.0000000
## 2 1900 MALE 0.1428571
## 3 1963 FEMALE 1.0000000
## 4 1963 MALE 1.0000000
## 5 1964 MALE 1.0000000
## 6 1965 FEMALE 1.0000000
## 7 1965 MALE 1.3333333
## 8 1967 MALE 0.0000000
## 9 1968 MALE 0.5000000
## 10 1969 MALE 0.0000000
## # ... with 54 more rows
## Source: local data frame [42 x 4]
## Groups: year [42]
##
## year FEMALE MALE gender.difference
## <int> <dbl> <dbl> <dbl>
## 1 1900 0.0 0.1428571 -0.1428571
## 2 1963 1.0 1.0000000 0.0000000
## 3 1964 NA 1.0000000 NA
## 4 1965 1.0 1.3333333 -0.3333333
## 5 1967 NA 0.0000000 NA
## 6 1968 NA 0.5000000 NA
## 7 1969 NA 0.0000000 NA
## 8 1973 1.0 NA NA
## 9 1975 NA 2.0000000 NA
## 10 1976 1.5 0.0000000 1.5000000
## # ... with 32 more rows
desc()
to order in descending orderNA
s) are always sorted at the endTask: Sort the final table to look at the schools with the highest predicted pass rate for the Regents Math exam.
Result:
## # A tibble: 15 × 2
## School Pass.Rate
## <chr> <chr>
## 1 LPOH-MA 92.8%
## 2 LPBS-MA 91.9%
## 3 TPMS 89.1%
## 4 RPBC 85.7%
## 5 OHC 85.5%
## 6 KCCS 84%
## 7 LPBV-MA 77.8%
## 8 LPCN-MA 75.9%
## 9 WCCS 74.7%
## 10 BEC 73.4%
## 11 RPWC 70.7%
## 12 BSC 69.9%
## 13 BVC 67.2%
## 14 EGCS-MA 59.5%
## 15 EBCS-MA 56.5%
mutate()
vs creating a new column with df$x
?group_by()
WITHOUT summarize()
?dplyr
functions in data cleaning and/or QCing.Type of Join | What it does | Example Code |
---|---|---|
Left Join | Keeps rows in 1st table, joins matching rows from 2nd | left_join(x, y, by="ID") |
Right Join | Keeps rows in 2nd table, joins matching rows from 1st | right_join(x, y, by="ID") |
Type of Join | What it does | Example Code |
---|---|---|
Full Join | Keeps the rows from both tables, filling in missing rows with NA | full_join(x, y, by="ID") |
Inner Join | Keeps ONLY the rows in both tables | inner_join(x, y, by="ID") |
Semi Join | keeps one row for every row in the 1st table that matches the 2nd | semi_join(x, y, by="ID") |
Anti Join | Removes rows that match the 2nd table | anti_row(x, y, by="ID") |
left_join(newTE, oldTE, by = c("Teacher"="Teacher.Name", "School"), suffix=c(".new", ".old")
Type of Join | Use case |
---|---|
Left Join | Default |
Right Join | You’re map is the first in the pipe |
Full Join | You don’t want to lose any rows in either data set |
Inner Join | You don’t want any NAs |
Semi Join | You don’t want any “duplicates” of the first data set |
Anti Join | You want to filter out rows |
Task: Talent Dev has requested a rundown on the racial demographics of our NY students