All, set? Great let’s begin!
What.do.you.want.to.do. | Language |
---|---|
Building a website | HTML, CSS |
Web application | Java Script |
Phone Application | Swift, Objective C, Java |
Numerical Computation | IDL, Python, Matlab |
Stats | R, Strata |
Visualization | R, JavaScript (D3) |
Typesetting | LaTeX, Mardown |
Big Data | SQL, Scala, Python |
Discuss with team.
“Struggling through programming helps you learn” - Chester Ismay
R is calculator
2 + 2
R is a programming language. Specifically it’s a programming language built for statistics.
And that’s what it’s best at.
install.packages("tidyverse")
Once installed all the functions in a package are available to be used.
library(tidyverse)
filter(iris, Species == "setosa")
library()
loads the package into memory and allows you to use the functions within without naming the package directly every time. Data type | Example |
---|---|
Integer | 1 |
Logical | TRUE |
Numeric | 1.1 |
String / character | "Red" |
Factor (enumerated string) | "Amber" or 2 in c("Red","Amber","Green") |
Complex | i |
Date | 2015-04-24 |
NA | NA |
Function | Use |
---|---|
is.[data type] |
Whether a vector is of a particular type |
as.[data type] |
Attempts to coerce a vector to a data type |
str |
Structure of an object including class/data type, dimensions |
class |
The class(es)/data type(s) an object belongs to |
summary |
Summarizes an object |
dput |
Get R code that recreates an object |
unlist |
Simplify a list to a vector |
dim |
Dimensions of a data type |
length |
Length of a vector or list |
Data type | Info | Construction example(s) |
---|---|---|
Vector | A 1D set of values of the same data type | c(1,"a") , 1:3 |
List | A collection of objects of various data types | list(vector=c(1,"a") , df=data.frame(a=1:6)) |
Data.frame | A 2D set of values of different data types | data.frame(a=1:26, b=11:36) |
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 20, by = 2)
## [1] 1 3 5 7 9 11 13 15 17 19
seq(from = 2, by = 2.5, length.out = 10)
## [1] 2.0 4.5 7.0 9.5 12.0 14.5 17.0 19.5 22.0 24.5
rep(2, 3)
## [1] 2 2 2
rep(1:3, 4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
rep(c("A", "B", "C"), each = 6)
## [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C"
## [18] "C"
set.seed(3823)
x <- sample(1:1000, size = 50, replace = TRUE)
max(x)
## [1] 982
min(x)
## [1] 4
range(x)
## [1] 4 982
mean(x)
## [1] 511.7
median(x)
## [1] 511
sum(x)
## [1] 25585
sd(x)
## [1] 265.8911
var(x)
## [1] 70698.09
sum(ifelse(x > 400, 1, 0))
## [1] 31
y <- rnorm(x, 1, 0.2 * x) + x
plot(x,y)
my_model <- lm(y ~ x)
summary(my_model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -278.58 -50.71 9.29 53.68 326.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.1253 35.2760 -1.024 0.311
## x 1.1210 0.0613 18.286 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114.1 on 48 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.8719
## F-statistic: 334.4 on 1 and 48 DF, p-value: < 2.2e-16
plot(x, y, main = "Linear model")
abline(my_model, col = "red")
<-
x <- 5
x <- c(1, 3, 4, 5)
df <- data.frame(x = 1:5, y = 2:6)
Shortcut: Alt + -
$
df$z <- df$x + df$y
Though if you need to create more than one new column, it’s much more concise to use mutate()
.
From @hadleywickham
x <- 1:5
x[3]
## [1] 3
x[3] <- 42
x
## [1] 1 2 42 4 5
Subset Type | Example | Example Output |
---|---|---|
Nothing | x[] |
1, 2, 42, 4, 5 |
Positive integers | x[5] |
5 |
Negative integers | x[-1] |
2, 42, 4, 5 |
Logical vectors | x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] |
1, 42, 5 |
Zero | x[0] |
$
+ column name to create or access a columndf[row, column]
Note that a much more detailed overview of subsetting can be found in the Subsetting chapter of Advanced R.
R-package authors are required to document their functions although this happens at a various levels of usefulness.
?function_name
to get help on a function.If I get an error I haven’t seen, the first thing I will do is Google it. Usually within a few clicks I can find what went wrong.
But sometimes an easy answer can’t be found so here’s a quick process to walk through:
To get you started here are few of the more common errors you might see:
Think about what is going wrong for each of these.
my_object
## Error in eval(expr, envir, enclos): object 'my_object' not found
iris[, 6]
## Error in `[.data.frame`(iris, , 6): undefined columns selected
iris
data frame have?a < - 5
## Error in eval(expr, envir, enclos): object 'a' not found
sample[1:10,]
## Error in sample[1:10, ]: object of type 'closure' is not subsettable
typeof(sample)
give you? What about sample(10)
? Or ?sample
pet_a_cat()
## Error in eval(expr, envir, enclos): could not find function "pet_a_cat"
nothing = NA
if (nothing == NA) {
print("empty")
}
## Error in if (nothing == NA) {: missing value where TRUE/FALSE needed
nothing == NA
give you? How about is.na(nothing)
?my_data <- read.table("mydata.txt")
## Warning in file(file, "rt"): cannot open file 'mydata.txt': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
x <- data.frame(y = NULL)
x$y = 1:4
## Error in `$<-.data.frame`(`*tmp*`, "y", value = 1:4): replacement has 4 rows, data has 0
x
have?mean(c(NA, 4, 2))
## [1] NA
# Set working directory
setwd("S:/Data Analytics/State Test Analysis/2015-2016/New York State Exams/Results Release/Tableau Data Files/")
# Load packages
library(readxl)
library(tidyverse)
# Define Functions file = file name, sheetname =
# sheet number or name, cn = T or F
loaddata <- function(file, sheetname, cn) {
x <- read_excel(file, sheet = sheetname, na = "",
col_names = cn)
valid_names <- make.names(names = names(x), unique = TRUE,
allow_ = TRUE)
names(x) <- valid_names
return(x)
}
# Load Data
statetestNYS <- loaddata("Uncommon Student Level Results 15-16.xlsx",
"Student Scores", T)
# Look at historical state test profiency by school
# for ELA and Math in schools
plotdat <- statetestNYS %>% filter(Subject != "Science",
Region == "NY", !(Grade == 8 & Subject == "Math")) %>%
mutate(prof = ifelse(Standard.Achieved %in% c("Level 3",
"Level 4"), 1, 0)) %>% group_by(Year, Uncommon.school.abbreviation,
Subject, Grade) %>% summarise(avgP = mean(prof))
ggplot(plotdat, aes(Year, avgP, color = Subject)) +
geom_boxplot() + facet_grid(Subject ~ Grade)
Using the comma_survey dataset in the fivethirtyeight dataset…
library(tidyverse)
library(fivethirtyeight)
df <- fivethirtyeight::comma_survey
## # A tibble: 1 x 3
## household_income education location
## <fctr> <fctr> <chr>
## 1 $0 - $24,999 Some college or Associate degree Pacific
## [1] "1129, 10"
## [1] 291
## [1] "49.9%"
## Warning: package 'bindrcpp' was built under R version 3.3.3
## [1] 88