Tidy Data

Erin Grand

July 7, 2017

Tidy Data

Tiday Data

Tidy Data

Hadley
Hadley

The essence of tidy data is:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Why?

  • Standardized
  • Easier for you and others to analyze
  • Easier for the computer to analyze
  • Easier to go from Tidy format into other formats if needed

Example: Is this untidy or tidy data?

## Classes 'tbl_df', 'tbl' and 'data.frame':    403 obs. of  71 variables:
##  $ episode           : chr  "S01E01" "S01E02" "S01E03" "S01E04" ...
##  $ season            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ episode_num       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ title             : chr  "A WALK IN THE WOODS" "MT. MCKINLEY" "EBONY SUNSET" "WINTER MIST" ...
##  $ apple_frame       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ aurora_borealis   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ barn              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ beach             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ boat              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bridge            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ building          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bushes            : int  1 0 0 1 0 0 0 1 0 1 ...
##  $ cabin             : int  0 1 1 0 0 1 0 0 0 0 ...
##  $ cactus            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ circle_frame      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cirrus            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cliff             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ clouds            : int  0 1 0 1 0 0 0 0 1 0 ...
##  $ conifer           : int  0 1 1 1 0 1 0 1 0 1 ...
##  $ cumulus           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ deciduous         : int  1 0 0 0 1 0 1 0 0 1 ...
##  $ diane_andre       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dock              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ double_oval_frame : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ farm              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fence             : int  0 0 1 0 0 0 0 0 1 0 ...
##  $ fire              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ florida_frame     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ flowers           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fog               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ framed            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ grass             : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ half_circle_frame : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ half_oval_frame   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hills             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lake              : int  0 0 0 1 0 1 1 1 0 1 ...
##  $ lakes             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lighthouse        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mill              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ moon              : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ mountain          : int  0 1 1 1 0 1 1 1 0 1 ...
##  $ mountains         : int  0 0 1 0 0 1 1 1 0 0 ...
##  $ night             : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ ocean             : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ oval_frame        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ palm_trees        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ path              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ person            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ portrait          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangle_3d_frame: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rectangular_frame : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ river             : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ rocks             : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ seashell_frame    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ snow              : int  0 1 0 0 0 1 0 0 0 0 ...
##  $ snowy_mountain    : int  0 1 0 1 0 1 1 0 0 0 ...
##  $ split_frame       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ steve_ross        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ structure         : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ sun               : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ tomb_frame        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tree              : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ trees             : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ triple_frame      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ waterfall         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ waves             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ windmill          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ window_frame      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ winter            : int  0 1 1 0 0 1 0 0 0 0 ...
##  $ wood_framed       : int  0 0 0 0 0 0 0 0 0 0 ...

Is this tidy data?

## # A tibble: 6 x 13
##             name height  mass  hair_color  skin_color eye_color birth_year
##            <chr>  <int> <dbl>       <chr>       <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77       blond        fair      blue       19.0
## 2          C-3PO    167    75        <NA>        gold    yellow      112.0
## 3          R2-D2     96    32        <NA> white, blue       red       33.0
## 4    Darth Vader    202   136        none       white    yellow       41.9
## 5    Leia Organa    150    49       brown       light     brown       19.0
## 6      Owen Lars    178   120 brown, grey       light      blue       52.0
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Cleaning and restructuring data

Step 1: Make sure it imported correctly

  • Are your column types correct?
  • Are all the columns and rows there?
  • How about your headers (if you have them)?
  • Did R give you any warnings?
  • Look at your data!

Step 2: Tidy it up

  • Think about how you want your data structured for analysis
  • What are your observations? What are your variables? Do you need multiple tables to capture your data better?
  • Use tidyr::gather() and tidyr::spread() to appropriately transform your data
  • If you have super messy data with lots of code needed to clean it up, save the cleaned data separately for the final analysis output.

Gather and Spread

tidyr gather spread
reshape2 melt cast
spreadsheets unpivot pivot
databases fold unfold

Gather

  • moves data into long form
  • syntax is similar to select
  • gather(key, value, ...)

Example

gather(fivethirtyeight::bob_ross, object, included, apple_frame:wood_framed) %>%
  arrange(episode, season, episode_num, title) 
## # A tibble: 27,001 x 6
##    episode season episode_num               title          object included
##      <chr>  <dbl>       <dbl>               <chr>           <chr>    <int>
##  1  S01E01      1           1 A WALK IN THE WOODS     apple_frame        0
##  2  S01E01      1           1 A WALK IN THE WOODS aurora_borealis        0
##  3  S01E01      1           1 A WALK IN THE WOODS            barn        0
##  4  S01E01      1           1 A WALK IN THE WOODS           beach        0
##  5  S01E01      1           1 A WALK IN THE WOODS            boat        0
##  6  S01E01      1           1 A WALK IN THE WOODS          bridge        0
##  7  S01E01      1           1 A WALK IN THE WOODS        building        0
##  8  S01E01      1           1 A WALK IN THE WOODS          bushes        1
##  9  S01E01      1           1 A WALK IN THE WOODS           cabin        0
## 10  S01E01      1           1 A WALK IN THE WOODS          cactus        0
## # ... with 26,991 more rows

Spread

  • moves data into long form
  • you can only easily spread one column at a time
  • if values for groups are missing, there are filled in with NA by default
  • you can change the fill parameter in the spread call to fill in missing results with something else spread(key, value, fill=0)
  • sometimes you don’t have a value column to spread out, but you can always make one with mutate

Example

2015-16 New York State Exams

read_rds("S:/Data Analytics/State Test Analysis/2016-2017/New York State Exams/Results Release/~Data/Processed/Uncommon Student-level results/Output/NYS_SY1516_results.rds") %>% 
  filter(grade != 13) %>% 
  clean_names() %>%
  group_by(school, school_type, region, assessment_description) %>%
  summarise(percent_proficient = mean(proficient)) %>%
  spread(assessment_description, percent_proficient) 
## Warning: package 'bindrcpp' was built under R version 3.3.3
## # A tibble: 22 x 14
## # Groups:   school, school_type, region [22]
##     school school_type       region `Grade 3 ELA` `Grade 3 Math`
##  *   <chr>       <chr>        <chr>         <dbl>          <dbl>
##  1     BEC          MS Uncommon NYC            NA             NA
##  2     BSC          MS Uncommon NYC            NA             NA
##  3     BVC          MS Uncommon NYC            NA             NA
##  4 EBCS-EA          ES Uncommon NYC     0.4000000      0.6705882
##  5 EBCS-MA          MS Uncommon NYC            NA             NA
##  6 EGCS-EA          ES Uncommon NYC     0.8068182      0.8750000
##  7 EGCS-MA          MS Uncommon NYC            NA             NA
##  8    KCCS          MS Uncommon NYC            NA             NA
##  9 LPBS-EA          ES Uncommon NYC     0.6091954      0.8160920
## 10 LPBS-MA          MS Uncommon NYC            NA             NA
## # ... with 12 more rows, and 9 more variables: `Grade 4 ELA` <dbl>, `Grade
## #   4 Math` <dbl>, `Grade 5 ELA` <dbl>, `Grade 5 Math` <dbl>, `Grade 6
## #   ELA` <dbl>, `Grade 6 Math` <dbl>, `Grade 7 ELA` <dbl>, `Grade 7
## #   Math` <dbl>, `Grade 8 ELA` <dbl>

Unite and Seperate

Unite is a concatenate function.

  • You define the name of the new column, and the ones you want to combine, in the order you wish to do so
unite(temp, school, grade)

Creates columns called “temp” from school and grade.

  • You can also set the separator value with sep=" “, the default is”snake case" i.e sep=“_"
unite(temp, school, grade, sep="")

Separate

The opposite of unite! - separates values in a single column into new columns - Works best when every value in the column has the same number of possible elements to separate - If not, fill parameter - Define what the split is with sep=“_“. The default value is any non-alpha-numeric values (.,_…etc)

Example: Spead two columns

Subject, Raw Score, Scaled Score -> ELA Raw Score, ELA Scaled Score, Math Raw Score, Math Scaled Score

(df <- tibble(student = c("bob", "bob"), subject = c("ELA", "Math"), scaled_score = c(100, 200), raw_score = c(20, 50)))
## # A tibble: 2 x 4
##   student subject scaled_score raw_score
##     <chr>   <chr>        <dbl>     <dbl>
## 1     bob     ELA          100        20
## 2     bob    Math          200        50
df %>% unite(temp, raw_score, scaled_score) %>%
  spread(subject, temp) %>%
  separate(ELA, c("ela_raw_score", "ela_scaled_score")) %>%
  separate(Math, c("math_raw_score", "math_scaled_score")) 
## # A tibble: 1 x 5
##   student ela_raw_score ela_scaled_score math_raw_score math_scaled_score
## *   <chr>         <chr>            <chr>          <chr>             <chr>
## 1     bob            20              100             50               200

You try

Task 1:

Using the AP Input student scores, create a table which shows how many student received each level per grade and school. We want a column for each level and a total.

## # A tibble: 6 x 10
##    year   school      grade                      subject    l1    l2    l3
##   <chr>    <chr>      <chr>                        <chr> <dbl> <dbl> <dbl>
## 1  2009 NSA-WPHS 11th Grade  English Lit and Composition     0     3     0
## 2  2009 NSA-WPHS 12th Grade                      Calc AB     4     3     3
## 3  2009 NSA-WPHS 12th Grade                   Calc AB&BC     4     3     3
## 4  2009 NSA-WPHS 12th Grade  English Lit and Composition     0     4     1
## 5  2009 NSA-WPHS 12th Grade                   US History     3     3     3
## 6  2010 NSA-WPHS 11th Grade English Lang and Composition     0     3     4
## # ... with 3 more variables: l4 <dbl>, l5 <dbl>, total <dbl>