Introduction to R: The Basics

Erin

2018-02-01

Welcome and Introduction

Before we jump into it, make sure that you have Rstudio, R and the package “Tidyverse” installed.

All, set? Great let’s begin!

Which language do people use for what?

What.do.you.want.to.do.	Language
Building a website	HTML, CSS
Web application	Java Script
Phone Application	Swift, Objective C, Java
Numerical Computation	IDL, Python, Matlab
Stats	R, Strata
Visualization	R, JavaScript (D3)
Typesetting	LaTeX, Mardown
Big Data	SQL, Scala, Python

why R?

What problems have you seen or come across in the way you currently do data analyses and manipulation that you wish you could change?

Discuss with team.

Why R?

“Struggling through programming helps you learn” - Chester Ismay

Free: R and Rstudio are free and open sourced!
Reproducible: Analyses done using R are easy to run again with updated data.
Collaboration: Results are easy to share as a PDF.
Manipulation: Data manipulation is much easier due to cool packages.
Understandable: You can follow how new variables are created in someone else’s code and to remember what past-you did.

Why R?

Spot Checking: You can also QC yourself along the way, instead of all at the end. PLUS code is often easier to QC because it’ll yell at you when something goes wrong.
Visualizations: Play with visualizations before throwing into a dashboard to see if it looks interesting.
Help: Finding answers to your programming questions is a quick google away.
Ease: You can create multiple data tables (files) in R to be read into -tableau OR you can join them together in R and only have tableau read in one file.
Community: Actively developed and a very active user community

R History

What is R?

R is calculator

2 + 2

NO!!

R is a programming language. Specifically it’s a programming language built for statistics.
And that’s what it’s best at.

R is a dialect of the S language which was developed by John Chambers at Bell Labs in 1976 and still exists today although hasn’t changed much since 1998.
The philosophy behind S (and R) was to allow users to begin in an interactive environment that didn’t explicitly feel like programming.
As their needs and skills grew they could move into more of the programming aspects. This helps us understand some of why R is the way it is.

Packages

Packages are simply bits of code, external to the core R code that are designed to perform a specific function.
The vast majority of the usefulness and functionality of R resides in packages.
These packages live in online repositories and can be installed on your own system to be used.

Installing packages

Packages need only be installed once, although you may have to re-install when upgrading R or when you want to use a newer version of a package.
To install from CRAN all one needs to do is:

install.packages("tidyverse")

Using packages

Once installed all the functions in a package are available to be used.

library(tidyverse)
filter(iris, Species == "setosa")

library() loads the package into memory and allows you to use the functions within without naming the package directly every time.
Technically what is happening here is that when attaching a package R puts those functions in your search path, the place R looks first for objects and functions.
This may cause problems if packages have functions with the same name. R will choose the version for the package loaded last.
Packages are attached in your current session and need to be attached every time you start a new session.

Data Types

Data type	Example
Integer	`1`
Logical	`TRUE`
Numeric	`1.1`
String / character	`"Red"`
Factor (enumerated string)	`"Amber"` or 2 in `c("Red","Amber","Green")`
Complex	`i`
Date	2015-04-24
NA	NA

Data Functions

Function	Use
`is.[data type]`	Whether a vector is of a particular type
`as.[data type]`	Attempts to coerce a vector to a data type
`str`	Structure of an object including class/data type, dimensions
`class`	The class(es)/data type(s) an object belongs to
`summary`	Summarizes an object
`dput`	Get R code that recreates an object
`unlist`	Simplify a list to a vector
`dim`	Dimensions of a data type
`length`	Length of a vector or list

Compound data types

Data type	Info	Construction example(s)
Vector	A 1D set of values of the same data type	`c(1,"a")` , `1:3`
List	A collection of objects of various data types	`list(vector=c(1,"a")`, `df=data.frame(a=1:6))`
Data.frame	A 2D set of values of different data types	`data.frame(a=1:26, b=11:36)`

Other important key R functions

Generating and manipulating sequences

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 20, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19

seq(from = 2, by = 2.5, length.out = 10)
##  [1]  2.0  4.5  7.0  9.5 12.0 14.5 17.0 19.5 22.0 24.5

rep(2, 3)
## [1] 2 2 2

rep(1:3, 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

rep(1:3, each = 2)
## [1] 1 1 2 2 3 3

rep(c("A", "B", "C"), each = 6)
##  [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C"
## [18] "C"

Basic statistics

set.seed(3823)
x <- sample(1:1000, size = 50, replace = TRUE)

max(x)
## [1] 982

min(x)
## [1] 4

range(x)
## [1]   4 982

mean(x)
## [1] 511.7

median(x)
## [1] 511

sum(x)
## [1] 25585

sd(x)
## [1] 265.8911

var(x)
## [1] 70698.09

sum(ifelse(x > 400, 1, 0))
## [1] 31

y <- rnorm(x, 1, 0.2 * x) + x
plot(x,y)

Simple linear model

my_model <- lm(y ~ x)
summary(my_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -278.58  -50.71    9.29   53.68  326.62 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.1253    35.2760  -1.024    0.311    
## x             1.1210     0.0613  18.286   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.1 on 48 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.8719 
## F-statistic: 334.4 on 1 and 48 DF,  p-value: < 2.2e-16

plot(x, y, main = "Linear model")
abline(my_model, col = "red")

Assignment and Subsetting

Variables are asigned with `<-`

x <- 5 
x <- c(1, 3, 4, 5)
df <- data.frame(x = 1:5, y = 2:6)

Shortcut: Alt + -

Columns in a Dataframe can be accesed or created with `$`

df$z <- df$x + df$y

Though if you need to create more than one new column, it’s much more concise to use mutate().

Accessing elements

From @hadleywickham

x <- 1:5
x[3]
## [1] 3
x[3] <- 42
x
## [1]  1  2 42  4  5

Different Ways to Subset Vectors

Subset Type	Example	Example Output
Nothing	`x[]`	1, 2, 42, 4, 5
Positive integers	`x[5]`	5
Negative integers	`x[-1]`	2, 42, 4, 5
Logical vectors	`x[c(TRUE, FALSE, TRUE, FALSE, TRUE)]`	1, 42, 5
Zero	`x[0]`

Subsetting Data frames

$ + column name to create or access a column
df[row, column]

Note that a much more detailed overview of subsetting can be found in the Subsetting chapter of Advanced R.

Where to get help

Built-in help

R-package authors are required to document their functions although this happens at a various levels of usefulness.

Simply type ?function_name to get help on a function.
Look carefully what parameters the function requires and what type they are.
Some are required (listed first, no default) and some are optional (a default value is usually listed).
Most function help will also indicate what the function returns.
Good documentation also has more information on what the function is doing.

Elsewhere

Sometimes authors will provide more detailed documentation online.
This is more common for more recent packages where the authors may have a github repository and associated webpage.
Often discussion pages (Google groups, Stack overflow) can also be a useful source of help

Errors

GOOGLE IT!!!

If I get an error I haven’t seen, the first thing I will do is Google it. Usually within a few clicks I can find what went wrong.

But sometimes an easy answer can’t be found so here’s a quick process to walk through:

Re-read the error and then think about it for a minute. See if you can’t get a grasp on what’s really going wrong.
Check your code for errors. Spelling errors, misplaced commas, forgotten parenthesis can all cause problems
Look it up - I very, very rarely get an error that someone else hasn’t seen before.
If you still can’t find a solution then you can ask for help. I can answer brief questions or you can post questions online at Stack Overflow.

To get you started here are few of the more common errors you might see:

Think about what is going wrong for each of these.

my_object

## Error in eval(expr, envir, enclos): object 'my_object' not found

iris[, 6]

## Error in `[.data.frame`(iris, , 6): undefined columns selected

Hint: How many columns does the iris data frame have?

a < - 5

## Error in eval(expr, envir, enclos): object 'a' not found

Hint: look carefully

sample[1:10,]

## Error in sample[1:10, ]: object of type 'closure' is not subsettable

Hint: What does typeof(sample) give you? What about sample(10)? Or ?sample

pet_a_cat()

## Error in eval(expr, envir, enclos): could not find function "pet_a_cat"

nothing = NA
if (nothing == NA) {
    print("empty")
}

## Error in if (nothing == NA) {: missing value where TRUE/FALSE needed

Hint: What does nothing == NA give you? How about is.na(nothing)?

my_data <- read.table("mydata.txt")

## Warning in file(file, "rt"): cannot open file 'mydata.txt': No such file or
## directory

## Error in file(file, "rt"): cannot open the connection

Hint: Read the error message carefully.

x <- data.frame(y = NULL)
x$y = 1:4

## Error in `$<-.data.frame`(`*tmp*`, "y", value = 1:4): replacement has 4 rows, data has 0

Hint: How many rows does x have?

mean(c(NA, 4, 2))

## [1] NA

Hint: Is this an error?

Questions so Far

Scripting

What are the important parts of a program?

Set working directory
Load packages
Load data
Functions
Code!

How do you write fast, good, clean code?

Comment comment comment
Use tabs and spaces to the best of your ability
THINK VERY HARD ABOUT EVERY FOR LOOP YOU USE. Do you really need it? (We’ll talk about this more in a bit.)
If you’re using the same chunk of code over and over again, write it into a function.

Sample Script…

# Set working directory
setwd("S:/Data Analytics/State Test Analysis/2015-2016/New York State Exams/Results Release/Tableau Data Files/")

# Load packages
library(readxl)
library(tidyverse)

# Define Functions file = file name, sheetname =
# sheet number or name, cn = T or F
loaddata <- function(file, sheetname, cn) {
    x <- read_excel(file, sheet = sheetname, na = "", 
        col_names = cn)
    valid_names <- make.names(names = names(x), unique = TRUE, 
        allow_ = TRUE)
    names(x) <- valid_names
    return(x)
}

# Load Data
statetestNYS <- loaddata("Uncommon Student Level Results 15-16.xlsx", 
    "Student Scores", T)

# Look at historical state test profiency by school
# for ELA and Math in schools
plotdat <- statetestNYS %>% filter(Subject != "Science", 
    Region == "NY", !(Grade == 8 & Subject == "Math")) %>% 
    mutate(prof = ifelse(Standard.Achieved %in% c("Level 3", 
        "Level 4"), 1, 0)) %>% group_by(Year, Uncommon.school.abbreviation, 
    Subject, Grade) %>% summarise(avgP = mean(prof))

ggplot(plotdat, aes(Year, avgP, color = Subject)) + 
    geom_boxplot() + facet_grid(Subject ~ Grade)

Now you try!

Using the comma_survey dataset in the fivethirtyeight dataset…

How do you load the fivethirtyeight package to find the comma_survey data set?
What is the income, education and location of the respondent in the 42nd row?
How many respondents are in this data set? How many locations?
How many respondents care about the oxford comma A lot?
What is the average care rate of the singular vs plural form of “data”?
How many Male respondents in the age range of 45-60 think that care about proper grammar is Very important?

Answers!

How do you load the fivethirtyeight package to find the comma_survey dataset?

library(tidyverse)
library(fivethirtyeight)
df <- fivethirtyeight::comma_survey

What is the income, education and location of the respondent in the 42nd row?

## # A tibble: 1 x 3
##   household_income                        education location
##             <fctr>                           <fctr>    <chr>
## 1     $0 - $24,999 Some college or Associate degree  Pacific

How many respondents are in this data set? How many locations?

## [1] "1129, 10"

How many respondents care about the oxford comma A lot?

## [1] 291

What is the average care rate of the singular vs plural form of “data”?

## [1] "49.9%"

How many Male respondents in the age range of 45-60 think that care about proper grammar is Very important?

## Warning: package 'bindrcpp' was built under R version 3.3.3

## [1] 88

Introduction to R: The Basics

Erin

2018-02-01

Welcome and Introduction

Before we jump into it, make sure that you have Rstudio, R and the package “Tidyverse” installed.

Which language do people use for what?

why R?

What problems have you seen or come across in the way you currently do data analyses and manipulation that you wish you could change?

Why R?

Why R?

R History

What is R?

NO!!

Packages

Installing packages

Using packages

Data Types

Data Types

Data Functions

Compound data types

Other important key R functions

Generating and manipulating sequences

Basic statistics

Simple linear model

Assignment and Subsetting

Variables are asigned with <-

Columns in a Dataframe can be accesed or created with $

Accessing elements

Different Ways to Subset Vectors

Subsetting Data frames

Where to get help

Built-in help

Elsewhere

Errors

GOOGLE IT!!!

Hint: Is this an error?

Questions so Far

Scripting

What are the important parts of a program?

How do you write fast, good, clean code?

Sample Script…

Sample Script…

Now you try!

Answers!

Variables are asigned with `<-`

Columns in a Dataframe can be accesed or created with `$`