Sept. 7-11, 2015

Loops and conditional execution.

Conditional constructs

General syntax:

if (condition) true_expression

if (condition) true_expression else false_expression

if (condition) {
    true_expression
} else {
    false_expression
}

if (condition) true_expression else if (condition) true_expression else false_expression
  • Here again braces can be ommited if code block is composed of a single statement.
  • For readability I would not be greedy about them when it gets complicated because they are visible "fence posts".

  • A few examples
if (FALSE) "this will not be printed" else "this will be printed"
## [1] "this will be printed"

x <- 3
if (is(x, "numeric")) x/2 else print("x is not numeric")
## [1] 1.5

# The condition statement should be logical and of length one
if (1:3 < 2:4) "It works but will be discussed later"
## Warning in if (1:3 < 2:4) "It works but will be discussed later": the
## condition has length > 1 and only the first element will be used
## [1] "It works but will be discussed later"

# For compact assignement
y <- if(x == 2) x else x + 1
  • Nested statements
if (y < 0) {
   print("Negative number")
} else if (y > 0) {
   print("Positive number")
} else
   print("Zero")
## [1] "Positive number"

Vectorized conditional constructs

Syntax: ifelse(test, yes, no)

  • test must be a logical vector (or an object that can be coerced to logical).
  • The return value is a vector with the same length as test.
  • This returned vector has element i from:
    • yes[i] if the corresponding value of test is TRUE
    • no[i] if the corresponding value of test is FALSE.
  • BEWARE, the vectors yes and no are recycled whenever necessary.
a = c(5,7,2,9)
ifelse(a %% 2 == 0,"even","odd") # yes annd no are recycled
## [1] "odd"  "odd"  "even" "odd"

x <- c(4:-3)
sqrt(x)  #- gives warnings!
## Warning in sqrt(x): NaNs produced
## [1] 2.000000 1.732051 1.414214 1.000000 0.000000      NaN      NaN      NaN
recodedX <- ifelse(x >= 0, x, NA)
sqrt(recodedX)  # no warning
## [1] 2.000000 1.732051 1.414214 1.000000 0.000000       NA       NA       NA

Often used for elementwise change to a vector but most of the time it is better to build constructs based on logical indexing and assignement operators. The previous ifelse() call can be written as such:

x[x < 0] <- NA
sqrt(x)
## [1] 2.000000 1.732051 1.414214 1.000000 0.000000       NA       NA       NA

Conventional iteration constructs

  • A for loop repeats a chunk of code for each element in a set of input.

Syntax: for(var in seq) expr

s <- c("My", "first", "for", "loop")
for (value in s) {
  print(value)
}
## [1] "My"
## [1] "first"
## [1] "for"
## [1] "loop"

Can you rewrite this using integer indexing: for (i in 1:length(vecLet)) {…

i <- 3
s <- c("My", "first", "for", "loop")
for (i in seq_along(s)) print(s[i])
## [1] "My"
## [1] "first"
## [1] "for"
## [1] "loop"

Is the value different? What happens if you remove the call to print()?

mymat = matrix(nrow = 10, ncol = 10) # create a 10 x 10 matrix (of 10 rows and 10 columns)

for(i in 1:dim(mymat)[1]) {      # for each row
  for(j in 1:dim(mymat)[2]) {    # for each column
    mymat[i,j] = i*j             # assign values based on position: product of two indexes
  }
}
head(mymat)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    2    3    4    5    6    7    8    9    10
## [2,]    2    4    6    8   10   12   14   16   18    20
## [3,]    3    6    9   12   15   18   21   24   27    30
## [4,]    4    8   12   16   20   24   28   32   36    40
## [5,]    5   10   15   20   25   30   35   40   45    50
## [6,]    6   12   18   24   30   36   42   48   54    60

BTY, using specialized R function, this can be coded in one line:

mymat2 <- outer(1:10, 1:10, FUN = `*`)
identical(mymat, mymat2)
## [1] TRUE

Conventional iteration constructs

  • Syntax: while(cond) expr
i <- 6
while (i < 10) { i = i + 1; print(i) ; }
## [1] 7
## [1] 8
## [1] 9
## [1] 10
  • Syntax: repeat expr
repeat {
  print(i) ; i = i + 1
  if (i >= 13) break
}
## [1] 10
## [1] 11
## [1] 12
  • Note that repeat has no Boolean exit condition. It will brutally iterate forever unless you use the control statemtent break (or something like return()). Of course, break can be used with for and while loops, too.
    BTW how do you force your code to stop running in RStudio? In the console?

  • next tells the interpreter to skip the remainder of the current iteration of the loop and proceed directly to the next one:
i <- 6
while (i < 10) {
  i = i + 1
  if (identical(i, 8)) next
  print(i)
}
## [1] 7
## [1] 9
## [1] 10

A subtle note on deciding about equality of objects

A call to identical() is the recommended way to test exact equality in if and while statements as well as with && and || operators because it returns a scalar value and there is more stricter control on type comparison.

if(1 != NULL) cat("Too Bad!")
## Error in if (1 != NULL) cat("Too Bad!"): argument is of length zero
if(!identical(1, NULL)) cat("Too Bad!")
## Too Bad!

1 == as.integer(1)
## [1] TRUE
identical(1, as.integer(1)) # stored as different types
## [1] FALSE
1L == as.integer(1) 
## [1] TRUE
typeof(1)
## [1] "double"
typeof(1L)
## [1] "integer"

Exercice:

Can you think of a better way to write:
for (i in 1:length(b)) {a[i] <- cos(b[i])}
using a simple assignment?

b <- seq(1, by = pi, length.out = 4) # create b
b
## [1]  1.000000  4.141593  7.283185 10.424778

a <- numeric(0)
for (i in 1:length(b)) { a[i] = cos(b[i]) }
a
## [1]  0.5403023 -0.5403023  0.5403023 -0.5403023

a <- cos(b) # this is the trick!
a
## [1]  0.5403023 -0.5403023  0.5403023 -0.5403023

Whenever you have the need for a looping construct, first ask yourself whether there is not an easy/elegant and more computing efficient way to write it using R's distinctive vectorization paradigm.

Exercice:

Along the same line, take the following code:

set.seed(123)
for (i in 1:10) {
  x <- runif(1, max = 10)
  if (x > 5) y[i] <- factorial(as.integer(x))
  else y[i] <- x }
y
##  [1] 2.875775e+00 5.040000e+03 4.089769e+00 4.032000e+04 3.628800e+05
##  [6] 4.555650e-01 1.200000e+02 4.032000e+04 1.200000e+02 4.566147e+00

Can you rewrite it using the ifelse() function?

set.seed(123)
x <- runif(10, max = 10) # generating all necessary x values
z <- ifelse(x>5, factorial(as.integer(x)), x) # substitute 'if' in the loop for ifelse()
identical(y, z) # check that this is indeed the same
## [1] TRUE

Exercice:

Source: R for Biologists - Prof. Daniel Wegmann

Write a function do.it() that takes two arguments p and q (scalar numeric vectors) and returns q/p if p > 0 and returns p/q otherwise.

do.it <- function(p, q){
  if(p > 0){
    return(q/p);
  } else {
    return(p/q);
  }
}

Apply it to the elements of x and y of the matrix defined below using a loop that populate a vector of results with each iteration.

d <- cbind(x = round(runif(n = 4, min = -5, max = 5)),
            y = round(runif(n = 4, min = -2, max = 2)))
t(d) # print the transposed matrix because of space constraints
##   [,1] [,2] [,3] [,4]
## x    5    0    2    1
## y   -2    2   -1   -2

r <- numeric(0)
for(i in 1:nrow(d)){
  r[i] <- do.it(d[i, "x"], d[i, "y"])
}
r
## [1] -0.4  0.0 -0.5 -2.0

The apply family of functions

  • for loops can be:
    • complicated to code and understand especially when they operate on complex objects or do elaborate processing (see the nested loop example).
    • supposedly slow to run in R due to intrinsic language mecanisms.
  • The very rich, powerful, but also confusing family of apply functions:
    • manipulate and compute on slices of data from typical R data structures (vector, matrices, lists, data frames)
    • easier to read and code after a bit of practice.
    • made of intrinsically vectorized functions which are faster at iterating.
    • but computation in each iteration has to be independant of previous iterations.

The apply family of functions

Function Usage
base::lapply Apply a Function over a List or Vector
base::apply Apply Functions Over Array Margins
base::mapply Apply a Function to Multiple List or Vector Arguments
base::by Apply a Function to a Data Frame Split by Factors
base::tapply Apply a Function Over groups of values (Ragged Array)
base::eapply Apply a Function Over Values in an Environment
base::rapply Recursively Apply a Function to a List

Let's start with lapply()

lapply() takes a function, applies it to each element in a list (or vector), and returns the results in the form of a list.

Typical use cases (like loops):

  • Apply a function on individual elements: lapply(xs, function(x) {})
  • Iterate over a numeric index: lapply(seq_along(xs), function(i) {})
  • Iterate over names as indexes: lapply(names(xs), function(nm) {})

To illustrat a use of lapply(), let's create some random data and look at the number of values in each random set.

lst <- replicate(5, rnorm(sample(1:50, 1)), simplify = FALSE)

With a for loop:

# preallocate an empty list of the proper length to save computing time
out <- vector("list", length(lst))
# here is the loop:
for (i in seq_along(lst)) {
  out[[i]] <- length(lst[[i]]) # assign value to an element of out
}
unlist(out) # simplify to a vector rather than a list
## [1] 17  7 40 37  3

With lapply()

unlist(lapply(X = lst, FUN = length))
## [1] 17  7 40 37  3

The code is far more concise, isn't it?

What if now, rather than determining the number of values in each random vector, we would like to compute the quantiles. How would you do that ? Does it still make sense to unlist the result?

lapply(X = lst, FUN = quantile)

We would like however to overide default argument probs of quantile() to get only the min, median and max values. How can we do that?

lapply(X = lst, FUN = quantile, probs = seq(0, 1, 0.50))
# or alternatively:
lapply(X = lst, FUN = function(vect) quantile(vect, probs = seq(0, 1, 0.50)))
  • Additional arguments to FUN can be passed via ...

  • The function FUN can be defined elsewhere or directly in the call to apply via an anonymous function that has additional arguments.

We would also like to keep track of the number of values on which the quantiles are calculated. Can you append the value of length(vect) to the vector of quantiles?

lapply(X = lst, FUN = function(vect) {
  quants <- quantile(vect, probs = seq(0, 1, 0.50))
  count <- c(n = length(vect))
  return(c(count, quants))
  }
)

Why does count has a fractional-part?

sapply() and vapply() are very similar to lapply() except they simplify their output to produce an atomic vector. What happens when you use sapply? Do you get this:

##             [,1]       [,2]       [,3]        [,4]       [,5]
## n    17.00000000  7.0000000 40.0000000 37.00000000  3.0000000
## 0%   -1.08569914 -1.1381369 -3.2273228 -1.66794194 -0.2657389
## 50%  -0.05568601  0.4264642 -0.1519033 -0.04502772  0.1516114
## 100%  2.52833655  1.2538149  2.1499193  2.18733299  1.3766098

What is it? What would you do to make it easier to read?

Can you also calculate the standard deviation of the values and append it to the rest of the statistics?

mat <- sapply(X = lst, FUN = function(vect) {
  quants <- quantile(vect, probs = seq(0, 1, 0.50))
  count <- c(n = length(vect))
  sd <- c(sd = sd(vect))
  return(c(count, sd, quants))
  }
)
t(mat)
##       n        sd         0%         50%     100%
## [1,] 17 0.9982778 -1.0856991 -0.05568601 2.528337
## [2,]  7 0.8223179 -1.1381369  0.42646422 1.253815
## [3,] 40 1.0751375 -3.2273228 -0.15190330 2.149919
## [4,] 37 0.8338020 -1.6679419 -0.04502772 2.187333
## [5,]  3 0.8536307 -0.2657389  0.15161137 1.376610

Does this make sense considering the way this set of random vectors was created?

apply() applies a function to array margins

Source: Altuna Akalin

An exemple:

a <- matrix(1:20, nrow = 5)
a
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
apply(X = a, MARGIN = 1, FUN = mean) # by rows
## [1]  8.5  9.5 10.5 11.5 12.5
apply(a, 2, mean) # by columns
## [1]  3  8 13 18

When MARGIN=c(1,2) the function is applied to every entry of the array X

So for exemple, one may want to calculate modulo ten of the values in a matrix. How would you do it?

m <- matrix(c(seq(from = -23, to = 25, by = 2)), nrow = 5, ncol = 5)
m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  -23  -13   -3    7   17
## [2,]  -21  -11   -1    9   19
## [3,]  -19   -9    1   11   21
## [4,]  -17   -7    3   13   23
## [5,]  -15   -5    5   15   25
apply(m, MARGIN = c(1,2), function(x) x%%10) # One option
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    7    7    7    7    7
## [2,]    9    9    9    9    9
## [3,]    1    1    1    1    1
## [4,]    3    3    3    3    3
## [5,]    5    5    5    5    5
m%%10 # is another one
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    7    7    7    7    7
## [2,]    9    9    9    9    9
## [3,]    1    1    1    1    1
## [4,]    3    3    3    3    3
## [5,]    5    5    5    5    5

But much faster:

system.time(for (i in 1:10000) apply(m, MARGIN = c(1,2), function(x) x%%10))
##    user  system elapsed 
##   2.170   0.008   2.181
system.time(for (i in 1:10000) m%%10)
##    user  system elapsed 
##   0.022   0.004   0.026

Conclusion: think about using vectorization before using apply-like functions. Other notable examples include rowMeans(), colSums() and so on.

But in the end, the most important is to get things done, even if it is a bit slow…

Multiple inputs: mapply() and Map()

With lapply(), only one argument to the function varies; the others are fixed. This makes it poorly suited for some problems.

Map is useful whenever you have two (or more) lists (or data frames) that you need to process in parallel.

Source: Altuna Akalin

  • As an exemple from Advanced R, another way of standardising columns is to first compute the means and then divide by them. How would you do that?
# First compute the means of the columns
mtmeans <- lapply(mtcars, mean)

# Then divide each elements of the columns by the corresponding mean of the column
mtWeightedMeans <- mapply(`/`, mtcars, mtmeans, SIMPLIFY = TRUE)

BTW, R has the scale() function to do this…

  • Two nice and sometimes confusing thing about some apply-like functions is that:

    • they simplify their output.
    • when input have names, they try hard to pass them properly in their output

What happens if you set the SIMPLIFY and USE.NAMES arguments to FALSE?

To finish, here is an example from the mapply() documentation that illustrates how arguments are passed to FUN.

mapply(rep, 1:3, 3:1)
## [[1]]
## [1] 1 1 1
## 
## [[2]]
## [1] 2 2
## 
## [[3]]
## [1] 3
mapply(rep, times = 1:3, x = 3:1)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 2 2
## 
## [[3]]
## [1] 1 1 1

Do you understand why they produce different output ? Look at the help of rep() to refresh your memories about its arguments.

Exercice

Write a function that calculates the cumulative sum of the elements of a numeric vector. For exeample 1, 2, 3 produces, 1, 3, 6. First use a for loop:

MyCumsumWithFor <- function(v) {
  cumSumRes <- rep(v[1], length(v)) # first create 
  for (i in 2:length(v)) {cumSumRes[i] <- cumSumRes[i-1] + v[i]}
  cumSumRes
}
v <- as.numeric(1:5)
MyCumsumWithFor(v)
## [1]  1  3  6 10 15

What happens if you omit the coercion to numeric when assigning to v ? Improve your function to gaurd against that.

MyCumsumWithFor <- function(v) {
  if(!is.numeric(v)) v <- as.numeric(v)
  cumSumRes <- rep(v[1], length(v)) # first create 
  for (i in 2:length(v)) {cumSumRes[i] <- cumSumRes[i-1] + v[i]}
  cumSumRes
}
v <- 1:5
MyCumsumWithFor(v)
## [1]  1  3  6 10 15

Can you think of a way to implement the same computation using sapply()?

MyCumsumWithSapply <- function(v) {
  if(!is.numeric(v)) v <- as.numeric(v)
  cumSumRes <- rep(v[1], length(v))
  sapply(2:length(v), function(i) cumSumRes[i] <<- cumSumRes[i-1] + v[i])
  cumSumRes
}

identical(MyCumsumWithSapply(v), MyCumsumWithFor(v))
## [1] TRUE

A bit awkward, requires the use of the "super assignment" operator.

Base R comes with a built-in cumsum(). Make sure, your function(s) and cumsum() produce the same output. Which one is the fastest?

identical(cumsum(v), MyCumsumWithFor(v))
## [1] TRUE

v <- 1:1e5

system.time(MyCumsumWithFor(v))
## Warning in cumSumRes[i - 1] + v[i]: NAs produced by integer overflow
##    user  system elapsed 
##     0.4     0.0     0.4
system.time(MyCumsumWithSapply(v))
## Warning in cumSumRes[i - 1] + v[i]: NAs produced by integer overflow
##    user  system elapsed 
##   0.610   0.004   0.614
system.time(cumsum(v))
## Warning in system.time(cumsum(v)): integer overflow in 'cumsum'; use
## 'cumsum(as.numeric(.))'
##    user  system elapsed 
##   0.001   0.000   0.001

Conclusion:

  • apply-like functions are not always faster that for loops.
  • Whenever it exists, use a R function or you will waste both your and computing time.

Exercice

For a bootstraping approach, you need to extract 10 random samples of rows of the quakes dataset. Each random sample will have the same number of rows than the original Data Frame. How would you do that?

boots <- lapply(1:10, function(i) {
  rows <- sample(1:nrow(quakes), rep = TRUE)
  quakes[rows, ]
})

Can you generalise your code to write a function that takes any input Data Frame and where the number of random samples in the output can be specified?

BootstrapIt <- function(df, n = 10) {
  boots <- lapply(1:n, function(i) {
    rows <- sample(1:nrow(df), rep = TRUE)
    df[rows, ]
  })
  return(boots)
}

mySample <- BootstrapIt(airquality, n = 2)

Exercice

With the diamonds data frame, for each level of the cut factor, calculate the mean of each numerical column. The output should should be a rectangular object (matrix or data frame). Note that it can be done with the help of by() or aggregate().

This is one way to do it:

library(ggplot2)
# select only columns of numeric type
diamondsNum <- diamonds[, sapply(diamonds, is.numeric)]

meansByCutBy <- by(diamondsNum, diamonds$cut, function(x) sapply(x, mean))

# are you familiar with this:
meansByCutBy <- do.call(rbind, meansByCutBy)
meansByCutBy
##               carat    depth    table    price        x        y        z
## Fair      1.0461366 64.04168 59.05379 4358.758 6.246894 6.182652 3.982770
## Good      0.8491847 62.36588 58.69464 3928.864 5.838785 5.850744 3.639507
## Very Good 0.8063814 61.81828 57.95615 3981.760 5.740696 5.770026 3.559801
## Premium   0.8919549 61.26467 58.74610 4584.258 5.973887 5.944879 3.647124
## Ideal     0.7028370 61.70940 55.95167 3457.542 5.507451 5.520080 3.401448

Here is another one:

meansByCutAg <-aggregate(diamondsNum, list(diamonds$cut), mean)
meansByCutAg
##     Group.1     carat    depth    table    price        x        y
## 1      Fair 1.0461366 64.04168 59.05379 4358.758 6.246894 6.182652
## 2      Good 0.8491847 62.36588 58.69464 3928.864 5.838785 5.850744
## 3 Very Good 0.8063814 61.81828 57.95615 3981.760 5.740696 5.770026
## 4   Premium 0.8919549 61.26467 58.74610 4584.258 5.973887 5.944879
## 5     Ideal 0.7028370 61.70940 55.95167 3457.542 5.507451 5.520080
##          z
## 1 3.982770
## 2 3.639507
## 3 3.559801
## 4 3.647124
## 5 3.401448

Bottom line:

  • not always trivial to decide on what function to use because it requires a very intimate knowledge of their clockwork…
  • Always experiment with small data sets (toy exemple) and make sure that what you have is what you want.

The plyr package

It is important that you get familiar with the functions of the apply familly, if only to understand the code of others and because apply flavors often percolate in packages that deal with specific object types. Also, they are shipped with base R and do not require additional package.

However, overall this set of function is relatively disparate and it is sometimes tricky to decide on what function to use (if available). This choice depends on:

  • the class of your input data
  • the kind of processing you want to do
  • the form of output you want.

The plyr package, which is related to dplyr offers a more consistent and extensive infrastructure based on the Split-Apply-Combine Strategy. Although plyr can be slow at times, it offers a very easy way to parallelise the execution of your code which may help speed up things.