Example: Is this untidy or tidy data?
## Classes 'tbl_df', 'tbl' and 'data.frame': 403 obs. of 71 variables:
## $ episode : chr "S01E01" "S01E02" "S01E03" "S01E04" ...
## $ season : num 1 1 1 1 1 1 1 1 1 1 ...
## $ episode_num : num 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "A WALK IN THE WOODS" "MT. MCKINLEY" "EBONY SUNSET" "WINTER MIST" ...
## $ apple_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ aurora_borealis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ barn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ beach : int 0 0 0 0 0 0 0 0 1 0 ...
## $ boat : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge : int 0 0 0 0 0 0 0 0 0 0 ...
## $ building : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bushes : int 1 0 0 1 0 0 0 1 0 1 ...
## $ cabin : int 0 1 1 0 0 1 0 0 0 0 ...
## $ cactus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ circle_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cirrus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cliff : int 0 0 0 0 0 0 0 0 0 0 ...
## $ clouds : int 0 1 0 1 0 0 0 0 1 0 ...
## $ conifer : int 0 1 1 1 0 1 0 1 0 1 ...
## $ cumulus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ deciduous : int 1 0 0 0 1 0 1 0 0 1 ...
## $ diane_andre : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dock : int 0 0 0 0 0 0 0 0 0 0 ...
## $ double_oval_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ farm : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fence : int 0 0 1 0 0 0 0 0 1 0 ...
## $ fire : int 0 0 0 0 0 0 0 0 0 0 ...
## $ florida_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ flowers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fog : int 0 0 0 0 0 0 0 0 0 0 ...
## $ framed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ grass : int 1 0 0 0 0 0 0 0 0 0 ...
## $ guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ half_circle_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ half_oval_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hills : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lake : int 0 0 0 1 0 1 1 1 0 1 ...
## $ lakes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lighthouse : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mill : int 0 0 0 0 0 0 0 0 0 0 ...
## $ moon : int 0 0 0 0 0 1 0 0 0 0 ...
## $ mountain : int 0 1 1 1 0 1 1 1 0 1 ...
## $ mountains : int 0 0 1 0 0 1 1 1 0 0 ...
## $ night : int 0 0 0 0 0 1 0 0 0 0 ...
## $ ocean : int 0 0 0 0 0 0 0 0 1 0 ...
## $ oval_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ palm_trees : int 0 0 0 0 0 0 0 0 0 0 ...
## $ path : int 0 0 0 0 0 0 0 0 0 0 ...
## $ person : int 0 0 0 0 0 0 0 0 0 0 ...
## $ portrait : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rectangle_3d_frame: int 0 0 0 0 0 0 0 0 0 0 ...
## $ rectangular_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ river : int 1 0 0 0 1 0 0 0 0 0 ...
## $ rocks : int 0 0 0 0 1 0 0 0 0 0 ...
## $ seashell_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ snow : int 0 1 0 0 0 1 0 0 0 0 ...
## $ snowy_mountain : int 0 1 0 1 0 1 1 0 0 0 ...
## $ split_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ steve_ross : int 0 0 0 0 0 0 0 0 0 0 ...
## $ structure : int 0 0 1 0 0 1 0 0 0 0 ...
## $ sun : int 0 0 1 0 0 0 0 0 0 0 ...
## $ tomb_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tree : int 1 1 1 1 1 1 1 1 0 1 ...
## $ trees : int 1 1 1 1 1 1 1 1 0 1 ...
## $ triple_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ waterfall : int 0 0 0 0 0 0 0 0 0 0 ...
## $ waves : int 0 0 0 0 0 0 0 0 0 0 ...
## $ windmill : int 0 0 0 0 0 0 0 0 0 0 ...
## $ window_frame : int 0 0 0 0 0 0 0 0 0 0 ...
## $ winter : int 0 1 1 0 0 1 0 0 0 0 ...
## $ wood_framed : int 0 0 0 0 0 0 0 0 0 0 ...
Is this tidy data?
## # A tibble: 6 x 13
## name height mass hair_color skin_color eye_color birth_year
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 C-3PO 167 75 <NA> gold yellow 112.0
## 3 R2-D2 96 32 <NA> white, blue red 33.0
## 4 Darth Vader 202 136 none white yellow 41.9
## 5 Leia Organa 150 49 brown light brown 19.0
## 6 Owen Lars 178 120 brown, grey light blue 52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
tidyr::gather()
and tidyr::spread()
to appropriately transform your datatidyr | gather |
spread |
---|---|---|
reshape2 | melt |
cast |
spreadsheets | unpivot | pivot |
databases | fold | unfold |
select
gather(key, value, ...)
gather(fivethirtyeight::bob_ross, object, included, apple_frame:wood_framed) %>%
arrange(episode, season, episode_num, title)
## # A tibble: 27,001 x 6
## episode season episode_num title object included
## <chr> <dbl> <dbl> <chr> <chr> <int>
## 1 S01E01 1 1 A WALK IN THE WOODS apple_frame 0
## 2 S01E01 1 1 A WALK IN THE WOODS aurora_borealis 0
## 3 S01E01 1 1 A WALK IN THE WOODS barn 0
## 4 S01E01 1 1 A WALK IN THE WOODS beach 0
## 5 S01E01 1 1 A WALK IN THE WOODS boat 0
## 6 S01E01 1 1 A WALK IN THE WOODS bridge 0
## 7 S01E01 1 1 A WALK IN THE WOODS building 0
## 8 S01E01 1 1 A WALK IN THE WOODS bushes 1
## 9 S01E01 1 1 A WALK IN THE WOODS cabin 0
## 10 S01E01 1 1 A WALK IN THE WOODS cactus 0
## # ... with 26,991 more rows
spread(key, value, fill=0)
mutate
2015-16 New York State Exams
read_rds("S:/Data Analytics/State Test Analysis/2016-2017/New York State Exams/Results Release/~Data/Processed/Uncommon Student-level results/Output/NYS_SY1516_results.rds") %>%
filter(grade != 13) %>%
clean_names() %>%
group_by(school, school_type, region, assessment_description) %>%
summarise(percent_proficient = mean(proficient)) %>%
spread(assessment_description, percent_proficient)
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 22 x 14
## # Groups: school, school_type, region [22]
## school school_type region `Grade 3 ELA` `Grade 3 Math`
## * <chr> <chr> <chr> <dbl> <dbl>
## 1 BEC MS Uncommon NYC NA NA
## 2 BSC MS Uncommon NYC NA NA
## 3 BVC MS Uncommon NYC NA NA
## 4 EBCS-EA ES Uncommon NYC 0.4000000 0.6705882
## 5 EBCS-MA MS Uncommon NYC NA NA
## 6 EGCS-EA ES Uncommon NYC 0.8068182 0.8750000
## 7 EGCS-MA MS Uncommon NYC NA NA
## 8 KCCS MS Uncommon NYC NA NA
## 9 LPBS-EA ES Uncommon NYC 0.6091954 0.8160920
## 10 LPBS-MA MS Uncommon NYC NA NA
## # ... with 12 more rows, and 9 more variables: `Grade 4 ELA` <dbl>, `Grade
## # 4 Math` <dbl>, `Grade 5 ELA` <dbl>, `Grade 5 Math` <dbl>, `Grade 6
## # ELA` <dbl>, `Grade 6 Math` <dbl>, `Grade 7 ELA` <dbl>, `Grade 7
## # Math` <dbl>, `Grade 8 ELA` <dbl>
unite(temp, school, grade)
Creates columns called “temp” from school and grade.
unite(temp, school, grade, sep="")
The opposite of unite! - separates values in a single column into new columns - Works best when every value in the column has the same number of possible elements to separate - If not, fill parameter - Define what the split is with sep=“_“. The default value is any non-alpha-numeric values (.,_…etc)
Subject, Raw Score, Scaled Score -> ELA Raw Score, ELA Scaled Score, Math Raw Score, Math Scaled Score
(df <- tibble(student = c("bob", "bob"), subject = c("ELA", "Math"), scaled_score = c(100, 200), raw_score = c(20, 50)))
## # A tibble: 2 x 4
## student subject scaled_score raw_score
## <chr> <chr> <dbl> <dbl>
## 1 bob ELA 100 20
## 2 bob Math 200 50
df %>% unite(temp, raw_score, scaled_score) %>%
spread(subject, temp) %>%
separate(ELA, c("ela_raw_score", "ela_scaled_score")) %>%
separate(Math, c("math_raw_score", "math_scaled_score"))
## # A tibble: 1 x 5
## student ela_raw_score ela_scaled_score math_raw_score math_scaled_score
## * <chr> <chr> <chr> <chr> <chr>
## 1 bob 20 100 50 200
Using the AP Input student scores, create a table which shows how many student received each level per grade and school. We want a column for each level and a total.
## # A tibble: 6 x 10
## year school grade subject l1 l2 l3
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2009 NSA-WPHS 11th Grade English Lit and Composition 0 3 0
## 2 2009 NSA-WPHS 12th Grade Calc AB 4 3 3
## 3 2009 NSA-WPHS 12th Grade Calc AB&BC 4 3 3
## 4 2009 NSA-WPHS 12th Grade English Lit and Composition 0 4 1
## 5 2009 NSA-WPHS 12th Grade US History 3 3 3
## 6 2010 NSA-WPHS 11th Grade English Lang and Composition 0 3 4
## # ... with 3 more variables: l4 <dbl>, l5 <dbl>, total <dbl>