Introduction to R: The Basics

Erin

2018-02-01

Welcome and Introduction

Before we jump into it, make sure that you have Rstudio, R and the package “Tidyverse” installed.

All, set? Great let’s begin!

Which language do people use for what?

What.do.you.want.to.do. Language
Building a website HTML, CSS
Web application Java Script
Phone Application Swift, Objective C, Java
Numerical Computation IDL, Python, Matlab
Stats R, Strata
Visualization R, JavaScript (D3)
Typesetting LaTeX, Mardown
Big Data SQL, Scala, Python

why R?

What problems have you seen or come across in the way you currently do data analyses and manipulation that you wish you could change?

Discuss with team.

Why R?

“Struggling through programming helps you learn” - Chester Ismay

  • Free: R and Rstudio are free and open sourced!
  • Reproducible: Analyses done using R are easy to run again with updated data.
  • Collaboration: Results are easy to share as a PDF.
  • Manipulation: Data manipulation is much easier due to cool packages.
  • Understandable: You can follow how new variables are created in someone else’s code and to remember what past-you did.

Why R?

  • Spot Checking: You can also QC yourself along the way, instead of all at the end. PLUS code is often easier to QC because it’ll yell at you when something goes wrong.
  • Visualizations: Play with visualizations before throwing into a dashboard to see if it looks interesting.
  • Help: Finding answers to your programming questions is a quick google away.
  • Ease: You can create multiple data tables (files) in R to be read into -tableau OR you can join them together in R and only have tableau read in one file.
  • Community: Actively developed and a very active user community

R History

What is R?

R is calculator

2 + 2

NO!!

R is a programming language. Specifically it’s a programming language built for statistics.
And that’s what it’s best at.

  • R is a dialect of the S language which was developed by John Chambers at Bell Labs in 1976 and still exists today although hasn’t changed much since 1998.
  • The philosophy behind S (and R) was to allow users to begin in an interactive environment that didn’t explicitly feel like programming.
  • As their needs and skills grew they could move into more of the programming aspects. This helps us understand some of why R is the way it is.

Packages

  • Packages are simply bits of code, external to the core R code that are designed to perform a specific function.
  • The vast majority of the usefulness and functionality of R resides in packages.
  • These packages live in online repositories and can be installed on your own system to be used.

Installing packages

  • Packages need only be installed once, although you may have to re-install when upgrading R or when you want to use a newer version of a package.
  • To install from CRAN all one needs to do is:
install.packages("tidyverse")

Using packages

Once installed all the functions in a package are available to be used.

library(tidyverse)
filter(iris, Species == "setosa")
  • library() loads the package into memory and allows you to use the functions within without naming the package directly every time.
  • Technically what is happening here is that when attaching a package R puts those functions in your search path, the place R looks first for objects and functions.
  • This may cause problems if packages have functions with the same name. R will choose the version for the package loaded last.
  • Packages are attached in your current session and need to be attached every time you start a new session.

Data Types

Data Types

Data type Example
Integer 1
Logical TRUE
Numeric 1.1
String / character "Red"
Factor (enumerated string) "Amber" or 2 in c("Red","Amber","Green")
Complex i
Date 2015-04-24
NA NA

Data Functions

Function Use
is.[data type] Whether a vector is of a particular type
as.[data type] Attempts to coerce a vector to a data type
str Structure of an object including class/data type, dimensions
class The class(es)/data type(s) an object belongs to
summary Summarizes an object
dput Get R code that recreates an object
unlist Simplify a list to a vector
dim Dimensions of a data type
length Length of a vector or list

Compound data types

Data type Info Construction example(s)
Vector A 1D set of values of the same data type c(1,"a") , 1:3
List A collection of objects of various data types list(vector=c(1,"a"), df=data.frame(a=1:6))
Data.frame A 2D set of values of different data types data.frame(a=1:26, b=11:36)

Other important key R functions

Generating and manipulating sequences

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 20, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19

seq(from = 2, by = 2.5, length.out = 10)
##  [1]  2.0  4.5  7.0  9.5 12.0 14.5 17.0 19.5 22.0 24.5
rep(2, 3)
## [1] 2 2 2

rep(1:3, 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

rep(1:3, each = 2)
## [1] 1 1 2 2 3 3

rep(c("A", "B", "C"), each = 6)
##  [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C"
## [18] "C"

Basic statistics

set.seed(3823)
x <- sample(1:1000, size = 50, replace = TRUE)
max(x)
## [1] 982

min(x)
## [1] 4

range(x)
## [1]   4 982

mean(x)
## [1] 511.7

median(x)
## [1] 511

sum(x)
## [1] 25585

sd(x)
## [1] 265.8911

var(x)
## [1] 70698.09
sum(ifelse(x > 400, 1, 0))
## [1] 31
y <- rnorm(x, 1, 0.2 * x) + x
plot(x,y)

Simple linear model

my_model <- lm(y ~ x)
summary(my_model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278.58  -50.71    9.29   53.68  326.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.1253    35.2760  -1.024    0.311    
## x             1.1210     0.0613  18.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.1 on 48 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.8719 
## F-statistic: 334.4 on 1 and 48 DF,  p-value: < 2.2e-16
plot(x, y, main = "Linear model")
abline(my_model, col = "red")

Assignment and Subsetting

Variables are asigned with <-

x <- 5 
x <- c(1, 3, 4, 5)
df <- data.frame(x = 1:5, y = 2:6) 

Shortcut: Alt + -

Columns in a Dataframe can be accesed or created with $

df$z <- df$x + df$y

Though if you need to create more than one new column, it’s much more concise to use mutate().

Accessing elements

From @hadleywickham

x <- 1:5
x[3]
## [1] 3
x[3] <- 42
x
## [1]  1  2 42  4  5

Different Ways to Subset Vectors

Subset Type Example Example Output
Nothing x[] 1, 2, 42, 4, 5
Positive integers x[5] 5
Negative integers x[-1] 2, 42, 4, 5
Logical vectors x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] 1, 42, 5
Zero x[0]

Subsetting Data frames

  • $ + column name to create or access a column
  • df[row, column]

Note that a much more detailed overview of subsetting can be found in the Subsetting chapter of Advanced R.

Where to get help

Built-in help

R-package authors are required to document their functions although this happens at a various levels of usefulness.

  • Simply type ?function_name to get help on a function.
  • Look carefully what parameters the function requires and what type they are.
  • Some are required (listed first, no default) and some are optional (a default value is usually listed).
  • Most function help will also indicate what the function returns.
  • Good documentation also has more information on what the function is doing.

Elsewhere

  • Sometimes authors will provide more detailed documentation online.
  • This is more common for more recent packages where the authors may have a github repository and associated webpage.
  • Often discussion pages (Google groups, Stack overflow) can also be a useful source of help

Errors

GOOGLE IT!!!

If I get an error I haven’t seen, the first thing I will do is Google it. Usually within a few clicks I can find what went wrong.

But sometimes an easy answer can’t be found so here’s a quick process to walk through:

  1. Re-read the error and then think about it for a minute. See if you can’t get a grasp on what’s really going wrong.
  2. Check your code for errors. Spelling errors, misplaced commas, forgotten parenthesis can all cause problems
  3. Look it up - I very, very rarely get an error that someone else hasn’t seen before.
  4. If you still can’t find a solution then you can ask for help. I can answer brief questions or you can post questions online at Stack Overflow.

To get you started here are few of the more common errors you might see:

Think about what is going wrong for each of these.

my_object
## Error in eval(expr, envir, enclos): object 'my_object' not found
iris[, 6]
## Error in `[.data.frame`(iris, , 6): undefined columns selected
  • Hint: How many columns does the iris data frame have?
a < - 5
## Error in eval(expr, envir, enclos): object 'a' not found
  • Hint: look carefully
sample[1:10,]
## Error in sample[1:10, ]: object of type 'closure' is not subsettable
  • Hint: What does typeof(sample) give you? What about sample(10)? Or ?sample
pet_a_cat()
## Error in eval(expr, envir, enclos): could not find function "pet_a_cat"
nothing = NA
if (nothing == NA) {
    print("empty")
}
## Error in if (nothing == NA) {: missing value where TRUE/FALSE needed
  • Hint: What does nothing == NA give you? How about is.na(nothing)?
my_data <- read.table("mydata.txt")
## Warning in file(file, "rt"): cannot open file 'mydata.txt': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
  • Hint: Read the error message carefully.
x <- data.frame(y = NULL)
x$y = 1:4
## Error in `$<-.data.frame`(`*tmp*`, "y", value = 1:4): replacement has 4 rows, data has 0
  • Hint: How many rows does x have?
mean(c(NA, 4, 2))
## [1] NA
  • Hint: Is this an error?

Questions so Far

Scripting

What are the important parts of a program?

  • Set working directory
  • Load packages
  • Load data
  • Functions
  • Code!

How do you write fast, good, clean code?

  • Comment comment comment
  • Use tabs and spaces to the best of your ability
  • THINK VERY HARD ABOUT EVERY FOR LOOP YOU USE. Do you really need it? (We’ll talk about this more in a bit.)
  • If you’re using the same chunk of code over and over again, write it into a function.

Sample Script…

Sample Script…

# Set working directory
setwd("S:/Data Analytics/State Test Analysis/2015-2016/New York State Exams/Results Release/Tableau Data Files/")

# Load packages
library(readxl)
library(tidyverse)

# Define Functions file = file name, sheetname =
# sheet number or name, cn = T or F
loaddata <- function(file, sheetname, cn) {
    x <- read_excel(file, sheet = sheetname, na = "", 
        col_names = cn)
    valid_names <- make.names(names = names(x), unique = TRUE, 
        allow_ = TRUE)
    names(x) <- valid_names
    return(x)
}

# Load Data
statetestNYS <- loaddata("Uncommon Student Level Results 15-16.xlsx", 
    "Student Scores", T)

# Look at historical state test profiency by school
# for ELA and Math in schools
plotdat <- statetestNYS %>% filter(Subject != "Science", 
    Region == "NY", !(Grade == 8 & Subject == "Math")) %>% 
    mutate(prof = ifelse(Standard.Achieved %in% c("Level 3", 
        "Level 4"), 1, 0)) %>% group_by(Year, Uncommon.school.abbreviation, 
    Subject, Grade) %>% summarise(avgP = mean(prof))

ggplot(plotdat, aes(Year, avgP, color = Subject)) + 
    geom_boxplot() + facet_grid(Subject ~ Grade)

Now you try!

Using the comma_survey dataset in the fivethirtyeight dataset…

  1. How do you load the fivethirtyeight package to find the comma_survey data set?
  2. What is the income, education and location of the respondent in the 42nd row?
  3. How many respondents are in this data set? How many locations?
  4. How many respondents care about the oxford comma A lot?
  5. What is the average care rate of the singular vs plural form of “data”?
  6. How many Male respondents in the age range of 45-60 think that care about proper grammar is Very important?

Answers!

  1. How do you load the fivethirtyeight package to find the comma_survey dataset?
library(tidyverse)
library(fivethirtyeight)
df <- fivethirtyeight::comma_survey
  1. What is the income, education and location of the respondent in the 42nd row?
## # A tibble: 1 x 3
##   household_income                        education location
##             <fctr>                           <fctr>    <chr>
## 1     $0 - $24,999 Some college or Associate degree  Pacific
  1. How many respondents are in this data set? How many locations?
## [1] "1129, 10"
  1. How many respondents care about the oxford comma A lot?
## [1] 291
  1. What is the average care rate of the singular vs plural form of “data”?
## [1] "49.9%"
  1. How many Male respondents in the age range of 45-60 think that care about proper grammar is Very important?
## Warning: package 'bindrcpp' was built under R version 3.3.3
## [1] 88