Sept. 7-11, 2015
General syntax:
if (condition) true_expression if (condition) true_expression else false_expression if (condition) { true_expression } else { false_expression } if (condition) true_expression else if (condition) true_expression else false_expression
if (FALSE) "this will not be printed" else "this will be printed" ## [1] "this will be printed" x <- 3 if (is(x, "numeric")) x/2 else print("x is not numeric") ## [1] 1.5 # The condition statement should be logical and of length one if (1:3 < 2:4) "It works but will be discussed later" ## Warning in if (1:3 < 2:4) "It works but will be discussed later": the ## condition has length > 1 and only the first element will be used ## [1] "It works but will be discussed later" # For compact assignement y <- if(x == 2) x else x + 1
if (y < 0) { print("Negative number") } else if (y > 0) { print("Positive number") } else print("Zero") ## [1] "Positive number"
Syntax: ifelse(test, yes, no)
test
must be a logical vector (or an object that can be coerced to logical).test
.i
from:
yes[i]
if the corresponding value of test
is TRUEno[i]
if the corresponding value of test
is FALSE.yes
and no
are recycled whenever necessary.a = c(5,7,2,9) ifelse(a %% 2 == 0,"even","odd") # yes annd no are recycled ## [1] "odd" "odd" "even" "odd"
x <- c(4:-3) sqrt(x) #- gives warnings! ## Warning in sqrt(x): NaNs produced ## [1] 2.000000 1.732051 1.414214 1.000000 0.000000 NaN NaN NaN recodedX <- ifelse(x >= 0, x, NA) sqrt(recodedX) # no warning ## [1] 2.000000 1.732051 1.414214 1.000000 0.000000 NA NA NA
Often used for elementwise change to a vector but most of the time it is better to build constructs based on logical indexing and assignement operators. The previous ifelse()
call can be written as such:
x[x < 0] <- NA sqrt(x) ## [1] 2.000000 1.732051 1.414214 1.000000 0.000000 NA NA NA
for
loop repeats a chunk of code for each element in a set of input.Syntax: for(var in seq) expr
s <- c("My", "first", "for", "loop") for (value in s) { print(value) } ## [1] "My" ## [1] "first" ## [1] "for" ## [1] "loop"
Can you rewrite this using integer indexing: for (i in 1:length(vecLet)) {
…
i <- 3 s <- c("My", "first", "for", "loop") for (i in seq_along(s)) print(s[i]) ## [1] "My" ## [1] "first" ## [1] "for" ## [1] "loop"
Is the value different? What happens if you remove the call to print()
?
for
loops can be nested: the multiplication table example (Source: The DataCamp Blog).mymat = matrix(nrow = 10, ncol = 10) # create a 10 x 10 matrix (of 10 rows and 10 columns) for(i in 1:dim(mymat)[1]) { # for each row for(j in 1:dim(mymat)[2]) { # for each column mymat[i,j] = i*j # assign values based on position: product of two indexes } } head(mymat) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 2 3 4 5 6 7 8 9 10 ## [2,] 2 4 6 8 10 12 14 16 18 20 ## [3,] 3 6 9 12 15 18 21 24 27 30 ## [4,] 4 8 12 16 20 24 28 32 36 40 ## [5,] 5 10 15 20 25 30 35 40 45 50 ## [6,] 6 12 18 24 30 36 42 48 54 60
BTY, using specialized R function, this can be coded in one line:
mymat2 <- outer(1:10, 1:10, FUN = `*`) identical(mymat, mymat2) ## [1] TRUE
while(cond) expr
i <- 6 while (i < 10) { i = i + 1; print(i) ; } ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10
repeat expr
repeat { print(i) ; i = i + 1 if (i >= 13) break } ## [1] 10 ## [1] 11 ## [1] 12
break
(or something like return()
). Of course, break
can be used with for
and while
loops, too.next
tells the interpreter to skip the remainder of the current iteration of the loop and proceed directly to the next one:i <- 6 while (i < 10) { i = i + 1 if (identical(i, 8)) next print(i) } ## [1] 7 ## [1] 9 ## [1] 10
A call to identical()
is the recommended way to test exact equality in if
and while
statements as well as with &&
and ||
operators because it returns a scalar value and there is more stricter control on type comparison.
if(1 != NULL) cat("Too Bad!") ## Error in if (1 != NULL) cat("Too Bad!"): argument is of length zero if(!identical(1, NULL)) cat("Too Bad!") ## Too Bad! 1 == as.integer(1) ## [1] TRUE identical(1, as.integer(1)) # stored as different types ## [1] FALSE 1L == as.integer(1) ## [1] TRUE typeof(1) ## [1] "double" typeof(1L) ## [1] "integer"
Can you think of a better way to write:for (i in 1:length(b)) {a[i] <- cos(b[i])}
using a simple assignment?
b <- seq(1, by = pi, length.out = 4) # create b b ## [1] 1.000000 4.141593 7.283185 10.424778 a <- numeric(0) for (i in 1:length(b)) { a[i] = cos(b[i]) } a ## [1] 0.5403023 -0.5403023 0.5403023 -0.5403023 a <- cos(b) # this is the trick! a ## [1] 0.5403023 -0.5403023 0.5403023 -0.5403023
Whenever you have the need for a looping construct, first ask yourself whether there is not an easy/elegant and more computing efficient way to write it using R's distinctive vectorization paradigm.
Along the same line, take the following code:
set.seed(123) for (i in 1:10) { x <- runif(1, max = 10) if (x > 5) y[i] <- factorial(as.integer(x)) else y[i] <- x } y ## [1] 2.875775e+00 5.040000e+03 4.089769e+00 4.032000e+04 3.628800e+05 ## [6] 4.555650e-01 1.200000e+02 4.032000e+04 1.200000e+02 4.566147e+00
Can you rewrite it using the ifelse()
function?
set.seed(123) x <- runif(10, max = 10) # generating all necessary x values z <- ifelse(x>5, factorial(as.integer(x)), x) # substitute 'if' in the loop for ifelse() identical(y, z) # check that this is indeed the same ## [1] TRUE
Source: R for Biologists - Prof. Daniel Wegmann
Write a function do.it()
that takes two arguments p
and q
(scalar numeric vectors) and returns q/p
if p > 0
and returns p/q
otherwise.
do.it <- function(p, q){ if(p > 0){ return(q/p); } else { return(p/q); } }
Apply it to the elements of x and y of the matrix defined below using a loop that populate a vector of results with each iteration.
d <- cbind(x = round(runif(n = 4, min = -5, max = 5)), y = round(runif(n = 4, min = -2, max = 2))) t(d) # print the transposed matrix because of space constraints ## [,1] [,2] [,3] [,4] ## x 5 0 2 1 ## y -2 2 -1 -2
r <- numeric(0) for(i in 1:nrow(d)){ r[i] <- do.it(d[i, "x"], d[i, "y"]) } r ## [1] -0.4 0.0 -0.5 -2.0
apply
family of functionsfor
loops can be:
apply
functions:
apply
family of functionsFunction | Usage |
---|---|
base::lapply | Apply a Function over a List or Vector |
base::apply | Apply Functions Over Array Margins |
base::mapply | Apply a Function to Multiple List or Vector Arguments |
base::by | Apply a Function to a Data Frame Split by Factors |
base::tapply | Apply a Function Over groups of values (Ragged Array) |
base::eapply | Apply a Function Over Values in an Environment |
base::rapply | Recursively Apply a Function to a List |
There are also other related functions such as aggregate()
, map()
, sweep()
, etc.
We have no time to cover them all, it would take days…
You can find info in the references and online. Here is a sampler of what can be found on the web: http://blog.datacamp.com/r-tutorial-apply-family/ https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
lapply()
lapply()
takes a function, applies it to each element in a list (or vector), and returns the results in the form of a list.
Typical use cases (like loops):
lapply(xs, function(x) {})
lapply(seq_along(xs), function(i) {})
lapply(names(xs), function(nm) {})
To illustrat a use of lapply()
, let's create some random data and look at the number of values in each random set.
lst <- replicate(5, rnorm(sample(1:50, 1)), simplify = FALSE)
With a for
loop:
# preallocate an empty list of the proper length to save computing time out <- vector("list", length(lst)) # here is the loop: for (i in seq_along(lst)) { out[[i]] <- length(lst[[i]]) # assign value to an element of out } unlist(out) # simplify to a vector rather than a list ## [1] 17 7 40 37 3
With lapply()
unlist(lapply(X = lst, FUN = length)) ## [1] 17 7 40 37 3
The code is far more concise, isn't it?
What if now, rather than determining the number of values in each random vector, we would like to compute the quantiles. How would you do that ? Does it still make sense to unlist the result?
lapply(X = lst, FUN = quantile)
We would like however to overide default argument probs
of quantile()
to get only the min, median and max values. How can we do that?
lapply(X = lst, FUN = quantile, probs = seq(0, 1, 0.50)) # or alternatively: lapply(X = lst, FUN = function(vect) quantile(vect, probs = seq(0, 1, 0.50)))
Additional arguments to FUN
can be passed via ...
The function FUN
can be defined elsewhere or directly in the call to apply via an anonymous function that has additional arguments.
We would also like to keep track of the number of values on which the quantiles are calculated. Can you append the value of length(vect)
to the vector of quantiles?
lapply(X = lst, FUN = function(vect) { quants <- quantile(vect, probs = seq(0, 1, 0.50)) count <- c(n = length(vect)) return(c(count, quants)) } )
Why does count
has a fractional-part?
sapply()
and vapply()
are very similar to lapply() except they simplify their output to produce an atomic vector. What happens when you use sapply
? Do you get this:
## [,1] [,2] [,3] [,4] [,5] ## n 17.00000000 7.0000000 40.0000000 37.00000000 3.0000000 ## 0% -1.08569914 -1.1381369 -3.2273228 -1.66794194 -0.2657389 ## 50% -0.05568601 0.4264642 -0.1519033 -0.04502772 0.1516114 ## 100% 2.52833655 1.2538149 2.1499193 2.18733299 1.3766098
What is it? What would you do to make it easier to read?
Can you also calculate the standard deviation of the values and append it to the rest of the statistics?
mat <- sapply(X = lst, FUN = function(vect) { quants <- quantile(vect, probs = seq(0, 1, 0.50)) count <- c(n = length(vect)) sd <- c(sd = sd(vect)) return(c(count, sd, quants)) } ) t(mat) ## n sd 0% 50% 100% ## [1,] 17 0.9982778 -1.0856991 -0.05568601 2.528337 ## [2,] 7 0.8223179 -1.1381369 0.42646422 1.253815 ## [3,] 40 1.0751375 -3.2273228 -0.15190330 2.149919 ## [4,] 37 0.8338020 -1.6679419 -0.04502772 2.187333 ## [5,] 3 0.8536307 -0.2657389 0.15161137 1.376610
Does this make sense considering the way this set of random vectors was created?
apply()
applies a function to array marginsAn exemple:
a <- matrix(1:20, nrow = 5) a ## [,1] [,2] [,3] [,4] ## [1,] 1 6 11 16 ## [2,] 2 7 12 17 ## [3,] 3 8 13 18 ## [4,] 4 9 14 19 ## [5,] 5 10 15 20 apply(X = a, MARGIN = 1, FUN = mean) # by rows ## [1] 8.5 9.5 10.5 11.5 12.5 apply(a, 2, mean) # by columns ## [1] 3 8 13 18
When MARGIN=c(1,2) the function is applied to every entry of the array X
So for exemple, one may want to calculate modulo ten of the values in a matrix. How would you do it?
m <- matrix(c(seq(from = -23, to = 25, by = 2)), nrow = 5, ncol = 5) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] -23 -13 -3 7 17 ## [2,] -21 -11 -1 9 19 ## [3,] -19 -9 1 11 21 ## [4,] -17 -7 3 13 23 ## [5,] -15 -5 5 15 25
apply(m, MARGIN = c(1,2), function(x) x%%10) # One option ## [,1] [,2] [,3] [,4] [,5] ## [1,] 7 7 7 7 7 ## [2,] 9 9 9 9 9 ## [3,] 1 1 1 1 1 ## [4,] 3 3 3 3 3 ## [5,] 5 5 5 5 5 m%%10 # is another one ## [,1] [,2] [,3] [,4] [,5] ## [1,] 7 7 7 7 7 ## [2,] 9 9 9 9 9 ## [3,] 1 1 1 1 1 ## [4,] 3 3 3 3 3 ## [5,] 5 5 5 5 5
But much faster:
system.time(for (i in 1:10000) apply(m, MARGIN = c(1,2), function(x) x%%10)) ## user system elapsed ## 2.170 0.008 2.181 system.time(for (i in 1:10000) m%%10) ## user system elapsed ## 0.022 0.004 0.026
Conclusion: think about using vectorization before using apply
-like functions. Other notable examples include rowMeans()
, colSums()
and so on.
But in the end, the most important is to get things done, even if it is a bit slow…
mapply()
and Map()
With lapply()
, only one argument to the function varies; the others are fixed. This makes it poorly suited for some problems.
Map is useful whenever you have two (or more) lists (or data frames) that you need to process in parallel.
# First compute the means of the columns mtmeans <- lapply(mtcars, mean) # Then divide each elements of the columns by the corresponding mean of the column mtWeightedMeans <- mapply(`/`, mtcars, mtmeans, SIMPLIFY = TRUE)
BTW, R has the scale()
function to do this…
Two nice and sometimes confusing thing about some apply
-like functions is that:
What happens if you set the SIMPLIFY
and USE.NAMES
arguments to FALSE
?
To finish, here is an example from the mapply()
documentation that illustrates how arguments are passed to FUN
.
mapply(rep, 1:3, 3:1) ## [[1]] ## [1] 1 1 1 ## ## [[2]] ## [1] 2 2 ## ## [[3]] ## [1] 3 mapply(rep, times = 1:3, x = 3:1) ## [[1]] ## [1] 3 ## ## [[2]] ## [1] 2 2 ## ## [[3]] ## [1] 1 1 1
Do you understand why they produce different output ? Look at the help of rep()
to refresh your memories about its arguments.
Write a function that calculates the cumulative sum of the elements of a numeric vector. For exeample 1, 2, 3 produces, 1, 3, 6. First use a for
loop:
MyCumsumWithFor <- function(v) { cumSumRes <- rep(v[1], length(v)) # first create for (i in 2:length(v)) {cumSumRes[i] <- cumSumRes[i-1] + v[i]} cumSumRes } v <- as.numeric(1:5) MyCumsumWithFor(v) ## [1] 1 3 6 10 15
What happens if you omit the coercion to numeric when assigning to v
? Improve your function to gaurd against that.
MyCumsumWithFor <- function(v) { if(!is.numeric(v)) v <- as.numeric(v) cumSumRes <- rep(v[1], length(v)) # first create for (i in 2:length(v)) {cumSumRes[i] <- cumSumRes[i-1] + v[i]} cumSumRes } v <- 1:5 MyCumsumWithFor(v) ## [1] 1 3 6 10 15
Can you think of a way to implement the same computation using sapply()
?
MyCumsumWithSapply <- function(v) { if(!is.numeric(v)) v <- as.numeric(v) cumSumRes <- rep(v[1], length(v)) sapply(2:length(v), function(i) cumSumRes[i] <<- cumSumRes[i-1] + v[i]) cumSumRes } identical(MyCumsumWithSapply(v), MyCumsumWithFor(v)) ## [1] TRUE
A bit awkward, requires the use of the "super assignment" operator.
Base R comes with a built-in cumsum()
. Make sure, your function(s) and cumsum()
produce the same output. Which one is the fastest?
identical(cumsum(v), MyCumsumWithFor(v)) ## [1] TRUE v <- 1:1e5 system.time(MyCumsumWithFor(v)) ## Warning in cumSumRes[i - 1] + v[i]: NAs produced by integer overflow ## user system elapsed ## 0.4 0.0 0.4 system.time(MyCumsumWithSapply(v)) ## Warning in cumSumRes[i - 1] + v[i]: NAs produced by integer overflow ## user system elapsed ## 0.610 0.004 0.614 system.time(cumsum(v)) ## Warning in system.time(cumsum(v)): integer overflow in 'cumsum'; use ## 'cumsum(as.numeric(.))' ## user system elapsed ## 0.001 0.000 0.001
Conclusion:
apply
-like functions are not always faster that for
loops.For a bootstraping approach, you need to extract 10 random samples of rows of the quakes
dataset. Each random sample will have the same number of rows than the original Data Frame. How would you do that?
boots <- lapply(1:10, function(i) { rows <- sample(1:nrow(quakes), rep = TRUE) quakes[rows, ] })
Can you generalise your code to write a function that takes any input Data Frame and where the number of random samples in the output can be specified?
BootstrapIt <- function(df, n = 10) { boots <- lapply(1:n, function(i) { rows <- sample(1:nrow(df), rep = TRUE) df[rows, ] }) return(boots) } mySample <- BootstrapIt(airquality, n = 2)
With the diamonds data frame, for each level of the cut factor, calculate the mean of each numerical column. The output should should be a rectangular object (matrix or data frame). Note that it can be done with the help of by()
or aggregate()
.
This is one way to do it:
library(ggplot2) # select only columns of numeric type diamondsNum <- diamonds[, sapply(diamonds, is.numeric)] meansByCutBy <- by(diamondsNum, diamonds$cut, function(x) sapply(x, mean)) # are you familiar with this: meansByCutBy <- do.call(rbind, meansByCutBy) meansByCutBy ## carat depth table price x y z ## Fair 1.0461366 64.04168 59.05379 4358.758 6.246894 6.182652 3.982770 ## Good 0.8491847 62.36588 58.69464 3928.864 5.838785 5.850744 3.639507 ## Very Good 0.8063814 61.81828 57.95615 3981.760 5.740696 5.770026 3.559801 ## Premium 0.8919549 61.26467 58.74610 4584.258 5.973887 5.944879 3.647124 ## Ideal 0.7028370 61.70940 55.95167 3457.542 5.507451 5.520080 3.401448
Here is another one:
meansByCutAg <-aggregate(diamondsNum, list(diamonds$cut), mean) meansByCutAg ## Group.1 carat depth table price x y ## 1 Fair 1.0461366 64.04168 59.05379 4358.758 6.246894 6.182652 ## 2 Good 0.8491847 62.36588 58.69464 3928.864 5.838785 5.850744 ## 3 Very Good 0.8063814 61.81828 57.95615 3981.760 5.740696 5.770026 ## 4 Premium 0.8919549 61.26467 58.74610 4584.258 5.973887 5.944879 ## 5 Ideal 0.7028370 61.70940 55.95167 3457.542 5.507451 5.520080 ## z ## 1 3.982770 ## 2 3.639507 ## 3 3.559801 ## 4 3.647124 ## 5 3.401448
Bottom line:
It is important that you get familiar with the functions of the apply
familly, if only to understand the code of others and because apply
flavors often percolate in packages that deal with specific object types. Also, they are shipped with base R and do not require additional package.
However, overall this set of function is relatively disparate and it is sometimes tricky to decide on what function to use (if available). This choice depends on:
The plyr
package, which is related to dplyr
offers a more consistent and extensive infrastructure based on the Split-Apply-Combine Strategy. Although plyr
can be slow at times, it offers a very easy way to parallelise the execution of your code which may help speed up things.