Focusing on filter, select, group_by and summarise

Filter

The most important thing you can do to do big data is filter is down to smaller pieces. You can filter the data in different ways, as detailed in the image below.

** Example: Find all the cars from Audi or Volkswagen: **

mpg %>% filter(manufacturer %in% c("audi", "volkswagen"))

Select

Select columns of data based on the name of the column. - Select groups of columns by col1:coln, where the columns you want are col1, coln and all the columns in between. - Select columns by calling them out by name (the most common way) - You can deselect columns by using a negative sign. I.E. select(-col5) will select everything except col5.

Example: Select manuafacturer, model, year, cty and hwy. Two ways to do the same thing!

select(mpg, manufacturer, model, year, cty, hwy)
select(mpg, manufacturer:hwy, -displ, -trans, -cyl, -drv)

Arrange

Arrange the data to be sorted by a particular column

Example: What model / year of car get’s the worst highway MPG?

arrange(mpg, hwy)

Mutate: Make new variables

Mutate the original data frame to have additional columns. You are always given the SAME amount of rows back.

Example: Convert the MPG to KPL for both highway and city speeds.

mutate(mpg, hwy.kpl = hwy * 0.4251, hwy.cty = cty * 0.4251)

You do: Add a column to the mpg data set with a flag for “non-energy efficient” cars. Define this as a HWY MPG below 30 mpg.

Group By / Summarize

This is how you might create a pivot_table like object. You group columns together to get a summary value. The summarise function create the summary statistic.

Summary Statistic R expression
Average / mean mean()
Maximum max()
Minimum min()
Median median()
Standard deviation sd()
Count n()
Count Distinct n_distinct()
# use dplyr to group by species and compute the average and standard deviation of sepal length for each species
iris %>%
  group_by(Species) %>%
  summarize(mean_sepal_length=mean(Sepal.Length),
            sd_sepal_length=sd(Sepal.Length))

Relationship of summarize to table:

Remember This command?

# count the number of rows for each species
table(iris$Species)

It’s helpful to look at, but you will probably want the same information in a data-frame so we can use it later.

# use dplyr to do the same, returning a dataframe (instead of table)
iris %>%
  group_by(Species) %>%
  summarize(num_rows=n())
LS0tDQp0aXRsZTogJ0ludHJvZHVjdGlvbiB0byBSOiBEYXRhIE1hbmlwdWxhdGlvbicNCm91dHB1dDoNCiAgaHRtbF9ub3RlYm9vazoNCiAgICB0aGVtZTogY2VydWxlYW4NCiAgICB0b2M6IHllcw0KICAgIHRvY19kZXB0aDogMg0KICBodG1sX2RvY3VtZW50Og0KICAgIHRvYzogeWVzDQogICAgdG9jX2RlcHRoOiAnMicNCi0tLQ0KDQpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRSwgaW5jbHVkZT1GQUxTRX0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KYGBgDQojIEZvY3VzaW5nIG9uIGZpbHRlciwgc2VsZWN0LCBncm91cF9ieSBhbmQgc3VtbWFyaXNlDQoNCiMjIEZpbHRlcg0KVGhlIG1vc3QgaW1wb3J0YW50IHRoaW5nIHlvdSBjYW4gZG8gdG8gZG8gYmlnIGRhdGEgaXMgZmlsdGVyIGlzIGRvd24gdG8gc21hbGxlciBwaWVjZXMuDQpZb3UgY2FuIGZpbHRlciB0aGUgZGF0YSBpbiBkaWZmZXJlbnQgd2F5cywgYXMgZGV0YWlsZWQgaW4gdGhlIGltYWdlIGJlbG93Lg0KDQohW10oaW1hZ2VzL2ZpbHRlcnMucG5nKQ0KDQoqKiBFeGFtcGxlOiBGaW5kIGFsbCB0aGUgY2FycyBmcm9tIEF1ZGkgb3IgVm9sa3N3YWdlbjogKioNCmBgYHtyfQ0KbXBnICU+JSBmaWx0ZXIobWFudWZhY3R1cmVyICVpbiUgYygiYXVkaSIsICJ2b2xrc3dhZ2VuIikpDQpgYGANCg0KIyMgU2VsZWN0DQpTZWxlY3QgY29sdW1ucyBvZiBkYXRhIGJhc2VkIG9uIHRoZSBuYW1lIG9mIHRoZSBjb2x1bW4uDQotIFNlbGVjdCBncm91cHMgb2YgY29sdW1ucyBieSBjb2wxOmNvbG4sIHdoZXJlIHRoZSBjb2x1bW5zIHlvdSB3YW50IGFyZSBjb2wxLCBjb2xuIGFuZCBhbGwgdGhlIGNvbHVtbnMgaW4gYmV0d2Vlbi4NCi0gU2VsZWN0IGNvbHVtbnMgYnkgY2FsbGluZyB0aGVtIG91dCBieSBuYW1lICh0aGUgbW9zdCBjb21tb24gd2F5KQ0KLSBZb3UgY2FuIGRlc2VsZWN0IGNvbHVtbnMgYnkgdXNpbmcgYSBuZWdhdGl2ZSBzaWduLiBJLkUuIHNlbGVjdCgtY29sNSkgd2lsbCBzZWxlY3QgZXZlcnl0aGluZyBleGNlcHQgY29sNS4NCg0KKipFeGFtcGxlOiBTZWxlY3QgIG1hbnVhZmFjdHVyZXIsIG1vZGVsLCB5ZWFyLCBjdHkgYW5kIGh3eS4qKg0KVHdvIHdheXMgdG8gZG8gdGhlIHNhbWUgdGhpbmchDQpgYGB7cn0NCnNlbGVjdChtcGcsIG1hbnVmYWN0dXJlciwgbW9kZWwsIHllYXIsIGN0eSwgaHd5KQ0Kc2VsZWN0KG1wZywgbWFudWZhY3R1cmVyOmh3eSwgLWRpc3BsLCAtdHJhbnMsIC1jeWwsIC1kcnYpDQpgYGANCg0KIyMgQXJyYW5nZQ0KQXJyYW5nZSB0aGUgZGF0YSB0byBiZSBzb3J0ZWQgYnkgYSBwYXJ0aWN1bGFyIGNvbHVtbg0KDQoqKkV4YW1wbGU6IFdoYXQgbW9kZWwgLyB5ZWFyIG9mIGNhciBnZXQncyB0aGUgd29yc3QgaGlnaHdheSBNUEc/KioNCmBgYHtyfQ0KYXJyYW5nZShtcGcsIGh3eSkNCmBgYA0KDQoNCiMjIE11dGF0ZTogTWFrZSBuZXcgdmFyaWFibGVzDQpNdXRhdGUgdGhlIG9yaWdpbmFsIGRhdGEgZnJhbWUgdG8gaGF2ZSBhZGRpdGlvbmFsIGNvbHVtbnMuDQpZb3UgYXJlIGFsd2F5cyBnaXZlbiB0aGUgU0FNRSBhbW91bnQgb2Ygcm93cyBiYWNrLg0KDQoqKkV4YW1wbGU6IENvbnZlcnQgdGhlIE1QRyB0byBLUEwgZm9yIGJvdGggaGlnaHdheSBhbmQgY2l0eSBzcGVlZHMuKioNCg0KYGBge3J9DQptdXRhdGUobXBnLCBod3kua3BsID0gaHd5ICogMC40MjUxLCBod3kuY3R5ID0gY3R5ICogMC40MjUxKQ0KYGBgDQoNCioqWW91IGRvOiBBZGQgYSBjb2x1bW4gdG8gdGhlIG1wZyBkYXRhIHNldCB3aXRoIGEgZmxhZyBmb3IgIm5vbi1lbmVyZ3kgZWZmaWNpZW50IiBjYXJzLiBEZWZpbmUgDQp0aGlzIGFzIGEgSFdZIE1QRyBiZWxvdyAzMCBtcGcuKioNCg0KYGBge3IsIGVjaG89RiwgaW5jbHVkZSA9RiwgZXZhbD1GfQ0KbXBnICU+JSBtdXRhdGUobm90LmVuZXJnLmVmZiA9IGlmZWxzZShod3kgPCAzMCwgMSwgMCkpDQpgYGANCg0KDQojIyBHcm91cCBCeSAvIFN1bW1hcml6ZQ0KVGhpcyBpcyBob3cgeW91IG1pZ2h0IGNyZWF0ZSBhIHBpdm90X3RhYmxlIGxpa2Ugb2JqZWN0Lg0KWW91IGdyb3VwIGNvbHVtbnMgdG9nZXRoZXIgdG8gZ2V0IGEgc3VtbWFyeSB2YWx1ZS4NClRoZSBzdW1tYXJpc2UgZnVuY3Rpb24gY3JlYXRlIHRoZSBzdW1tYXJ5IHN0YXRpc3RpYy4NCg0KU3VtbWFyeSBTdGF0aXN0aWMgfCBSIGV4cHJlc3Npb24NCi0tLS0tLS0tLS0tLS0gfCAtLS0tLS0tLS0tLS0tDQpBdmVyYWdlIC8gbWVhbiB8IGBtZWFuKClgDQpNYXhpbXVtIHwgYG1heCgpYA0KTWluaW11bSB8IGBtaW4oKWANCk1lZGlhbiB8IGBtZWRpYW4oKWANClN0YW5kYXJkIGRldmlhdGlvbiB8IGBzZCgpYA0KQ291bnQgfCBgbigpYA0KQ291bnQgRGlzdGluY3QgfCBgbl9kaXN0aW5jdCgpYA0KDQoNCmBgYHtyfQ0KIyB1c2UgZHBseXIgdG8gZ3JvdXAgYnkgc3BlY2llcyBhbmQgY29tcHV0ZSB0aGUgYXZlcmFnZSBhbmQgc3RhbmRhcmQgZGV2aWF0aW9uIG9mIHNlcGFsIGxlbmd0aCBmb3IgZWFjaCBzcGVjaWVzDQppcmlzICU+JQ0KICBncm91cF9ieShTcGVjaWVzKSAlPiUNCiAgc3VtbWFyaXplKG1lYW5fc2VwYWxfbGVuZ3RoPW1lYW4oU2VwYWwuTGVuZ3RoKSwNCiAgICAgICAgICAgIHNkX3NlcGFsX2xlbmd0aD1zZChTZXBhbC5MZW5ndGgpKQ0KYGBgDQoNCiMjIyBSZWxhdGlvbnNoaXAgb2YgYHN1bW1hcml6ZWAgdG8gYHRhYmxlYDoNClJlbWVtYmVyIFRoaXMgY29tbWFuZD8NCmBgYHtyfQ0KIyBjb3VudCB0aGUgbnVtYmVyIG9mIHJvd3MgZm9yIGVhY2ggc3BlY2llcw0KdGFibGUoaXJpcyRTcGVjaWVzKQ0KYGBgDQpJdCdzIGhlbHBmdWwgdG8gbG9vayBhdCwgYnV0IHlvdSB3aWxsIHByb2JhYmx5IHdhbnQgdGhlIHNhbWUgaW5mb3JtYXRpb24gaW4gYSBkYXRhLWZyYW1lIHNvIHdlIGNhbiB1c2UgaXQgbGF0ZXIuDQpgYGB7cn0NCiMgdXNlIGRwbHlyIHRvIGRvIHRoZSBzYW1lLCByZXR1cm5pbmcgYSBkYXRhZnJhbWUgKGluc3RlYWQgb2YgdGFibGUpDQppcmlzICU+JQ0KICBncm91cF9ieShTcGVjaWVzKSAlPiUNCiAgc3VtbWFyaXplKG51bV9yb3dzPW4oKSkNCmBgYA0KDQo=