Data manipulation I: play with the vector and make it your toy.

Indexing and subesetting for vectors

To subset a vector x, use an indexing vector idx (that can be a scalar) that will be placed within the[] operator and refers to the elements of x that should be returned: x[idx]

idx can be one of three types:

  • integer
  • logical
  • character

To illustrate these alternatives, let’s first create an integer vector:

x <- 101:105 # the ':' is a shorthand for the function seq() that we will see later.
# Name the elements
names(x) <- c("A", "B", "C", "D", "E")
# OR
x <- setNames(101:105, c("A", "B", "C", "D", "E"))
x
##   A   B   C   D   E 
## 101 102 103 104 105

Indexing with integer values

  • Positive integers select elements at specific positions, the same positions can be repeated.

Source : Hands-On Programming with R

x[c(2, 3, 3)] # several elements, the same position(s) can occur several times
x[3:5] # generate a sequence of numerics and subset with it.
x[c(1, 105)] # out of bound index values generate NAs
  • Negative integers exclude the corresponding elements
x[-2] # Everything but the second element.

NB: One cannot mix positive en negative indexes

Exercice

  • What happens if you subset with zero?
  • Extract all the elements of x but the first.
  • Extract all elements but the last two ones with the length() function

Logical Indexing

Logical vectors keep elements at positions corresponding to TRUE, recycled if necessary without warning.

Source : Hands-On Programming with R

x[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
##   A   D   E 
## 101 104 105

x[c(T, F)] # if logical vector is too short, it is recycled.
##   A   C   E 
## 101 103 105

v <- x[c(T, T, T, T, T, T, T)]
v # if logical vector is too long, NAs are returned
##    A    B    C    D    E <NA> <NA> 
##  101  102  103  104  105   NA   NA

Logical indexing is often employed to select specific subsets meeting some condition of interest:

idx <- x > 102 & x <= 104 # logical vector
idx
##     A     B     C     D     E 
## FALSE FALSE  TRUE  TRUE FALSE

x[idx]
##   C   D 
## 103 104

Exercice

  • Select all values in x that are ±1 standard deviations away from the mean.

  • Select all elements in v that are not NA.

  • When you get the chance, type apropos("^is.") and demo("is.things") to have a sense of the tons of functions that allows to test the nature of R objects.

Name indexing

Character vectors select elements with matching names. Note that partial matching is not allowed.

x[c("A", "A", "D")]
##   A   A   D 
## 101 101 104

Can you ‘emulate’ name indexing by calling match() on the vector of names?

idx <- match(c("A", "A", "D"), names(x))
idx
## [1] 1 1 4
x[idx]
##   A   A   D 
## 101 101 104

Name indexing is also very handy to create look-up tables (in the form oldValue = “new value”) to recode a variable :

v <- c("three", "four", "one", "two", "three", "one", "four") # Vector to be recoded.
lookUp <- c(one = "un", two = "deux", three = "trois", four = "quatre") # look-up table

v
## [1] "three" "four"  "one"   "two"   "three" "one"   "four"

unname(lookUp[v]) # recoding and getting ride of names
## [1] "trois"  "quatre" "un"     "deux"   "trois"  "un"     "quatre"

Can you think of another way to do that with utilities for factors?

w <- as.factor(v)
levels(w)
## [1] "four"  "one"   "three" "two"

levels(w) <- c("quatre", "un", "trois", "deux")

as.character(w)
## [1] "trois"  "quatre" "un"     "deux"   "trois"  "un"     "quatre"

Modifying vectors (and other objects) in place via subsetting and the assignment operator

All subsetting operators can be combined with assignment to modify selected values of the input vector. The rest of the vector is unaffected.

  • Basic examples:
oldX <- x
x[1] <- 1

x < 103
##     A     B     C     D     E 
##  TRUE  TRUE FALSE FALSE FALSE
x[ x < 103 ] <- 0

x
##   A   B   C   D   E 
##   0   0 103 104 105

  • Be aware of what happens when vectors on either side of the assignment have different lenghts.
x[1:4] <- 0:1 # The right hand side vector is recycled once to match the length of the 'subseted' vector
x
##   A   B   C   D   E 
##   0   1   0   1 105

x[1:2] <- 10:14 # WARNING + the subset of x is modified with the first elements of the replacement vector
## Warning in x[1:2] <- 10:14: number of items to replace is not a multiple of
## replacement length
x
##   A   B   C   D   E 
##  10  11   0   1 105
  • Indices and Nas
x[c(1, NA)] <- c(1, 2) # You CAN'T combine integer indices with NA
## Error in x[c(1, NA)] <- c(1, 2): NAs are not allowed in subscripted assignments
x
##   A   B   C   D   E 
##  10  11   0   1 105

x[c(T, F, NA)] <- 1000 # in logical indices NA are treated as false
x
##    A    B    C    D    E 
## 1000   11    0 1000  105

  • To delete elements, just subset what you want and re-assign the name of your object to it.

  • Assignment with a logical vector is widely used as a substitute to for-if or ifelse() constructs (described later).

x <- 1:4
isOddX <- as.logical(x %% 2) # modulo 2 is not 0

x[which(isOddX)] # even numbers
## [1] 1 3

x[isOddX] <- x[isOddX] + 1 # do something about odd numbers
x
## [1] 2 2 4 4

Will be illustrated in some Exercices further down the road!

Exercice

The rev() function returns a reversed version of its argument.

rev(LETTERS)
##  [1] "Z" "Y" "X" "W" "V" "U" "T" "S" "R" "Q" "P" "O" "N" "M" "L" "K" "J"
## [18] "I" "H" "G" "F" "E" "D" "C" "B" "A"

Can you think of a way to reverse the LETTERS vector without this function?

Exercice

How would you append value(s) in x at the right end of vector v? There is a fastidious way and a simple one to do that.

x <- 5:8
v <- 1:4
  • What happen if you remove the parenthesis around the ‘+’ operations?
    This is relevant to operator precedence. See: ?Syntax

How would you insert the values of x at a specific location within v, rather than the end?

  • Use append()

  • If you are curious, it can be interesting to look at what append() is actually doing with the F2 key.

Exercice

How can you extract consonants with the vector of vowels? Tip: the built-in constant letters contains the 26 lower-case letters of the Roman alphabet.

Generating regular sequences

It is often necessary to generate regular sequences or patterns of values, for exemple when you want to assign replicated levels of factors to experimental units.

In R there are at least two base functions to do this kind of work:

  • the seq(from, to, by, length.out, along.with, ...) which is a generalization of the from:to operator."
    As you will see in its doc, this function is pretty versatile. Typical usages include:
seq(from = 1, to = 6)
## [1] 1 2 3 4 5 6
seq(from = 1, to = 6, by = 2)
## [1] 1 3 5
seq(from = 1, by = 2, length.out = 3)
## [1] 1 3 5
seq(along.with = v)
## [1] 1 2 3 4
seq(5)
## [1] 1 2 3 4 5
seq(length.out = 4)
## [1] 1 2 3 4

  • The rep(x, times, each, length.out)
    Again this function is pretty versatile and all kind of patterns can be generated.
s <- c("a", "b", "c")
rep(s, times = 2)
## [1] "a" "b" "c" "a" "b" "c"
rep(s, each = 2)
## [1] "a" "a" "b" "b" "c" "c"
rep(s, times = 1:length(s))
## [1] "a" "b" "b" "c" "c" "c"
rep(s, each = 3, times = 2)
##  [1] "a" "a" "a" "b" "b" "b" "c" "c" "c" "a" "a" "a" "b" "b" "b" "c" "c"
## [18] "c"
rep(s, each = 2, length.out = 4)
## [1] "a" "a" "b" "b"
rep(s, each = 2, length.out = 10)
##  [1] "a" "a" "b" "b" "c" "c" "a" "a" "b" "b"
  • Along the same line gl() generates factors by specifying the pattern of their levels.

Exercice

Source

Write the expressions that generated the following patterns:

1 2 3 1 2 3 1 2 3
4 3 2 1 4 3 2 1 4 3 2 1
1 1 1 2 2 2 3 3 3 4 4 4
"un"   "un"   "un"   "deux" "deux" "deux" "deux" "deux" "deux"
1.0 1.0 1.5 1.5 2.0 2.0 2.5 2.5 1.0 1.0 1.5 1.5 2.0 2.0 2.5 2.5

Exercice

Let’s take:

x <- seq(1, 20, by = 2)
x
##  [1]  1  3  5  7  9 11 13 15 17 19

Extract every third elements of x. You will do that both using a logical and an integer index.

Integer indexing:

Logical indexing:

Creating vectors of random numbers

Source: R pour les débutants

  • R boost one of the best random generators and offers functions to easily generate random numbers from various distributions.

  • Density (function prefix d), cumulative distribution function (p), quantile function (p) and random variate generation (r) for many standard probability distributions are available in the stats package. Look at ?distribution.

  • If you want to reproduce work later, call set.seed() that will set the seed of R‘s random number generator, which is useful for creating simulations or random objects that can be reproduced.

An exemple, generating random samples from a normal distribution:

  set.seed(124)
rnorm(n = 5 , mean = 10, sd = 3)
## [1]  5.844788 10.114970  7.710910 10.636918 14.276614

set.seed(421)
rnorm(n = 5 , mean = 10, sd = 3)
## [1] 12.41448 11.70813 13.04790 13.85125 10.22423

set.seed(124)
rnorm(n = 5 , mean = 10, sd = 3)
## [1]  5.844788 10.114970  7.710910 10.636918 14.276614

Or from a uniform distribution:

runif(n = 5, min = 0, max = 10)
## [1] 7.717069 8.568504 7.581080 8.503020 4.092967

Randomly sampling objects from a vector.

The sample() function is used to draw a random sample from a given population. It can be used to sample with or without replacement by using the replace argument (the default is F).

A few examples:

sample(x = month.abb, size = 5)
## [1] "Jan" "Jul" "Aug" "Oct" "Dec"
sample(x = month.abb, size = 13)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
sample(x = month.abb, size = 13, replace = TRUE )
##  [1] "Aug" "Aug" "Jan" "May" "May" "Mar" "Nov" "Apr" "Oct" "May" "Oct"
## [12] "Apr" "Jun"
sample(x = c(0,1), size = 20, replace = TRUE, prob = c(0.1, 0.9))
##  [1] 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1

Sorting vectors

  • To sort vectors or factors, the most intuitive function is sort().
    It returns the elements of the input vector in increasing or descending order depending on arguments.
x <- c(13,5,12,5)
sort(x, decreasing = TRUE)
## [1] 13 12  5  5
  • order() is actually more flexible in the sense that it allows to sort objects based on several sorting keys. We will use it for data frames later.

In contrast to sort(), it does not return the input object but a vector of integer representing the indices of the elements of the input. These indices are permuted to reflect the increasing or decreasing order of the input object.

Let’s see an example…

someMonths <- c(sample(x = month.abb, size = 13, replace = TRUE ), NA)
someMonths
##  [1] "Mar" "Mar" "May" "Mar" "Sep" "Mar" "Jul" "Jun" "Jun" "Jan" "Apr"
## [12] "Jul" "Jan" NA
idx <- order(someMonths, na.last = FALSE, decreasing = FALSE) # Note the optional arguments!!
idx
##  [1] 14 11 10 13  7 12  8  9  1  2  4  6  3  5
  • This index can be used to actually re-order the original object:
someMonths[idx]
##  [1] NA    "Apr" "Jan" "Jan" "Jul" "Jul" "Jun" "Jun" "Mar" "Mar" "Mar"
## [12] "Mar" "May" "Sep"
  • The last function of interest is rank() that returns the sample ranks of the values in a vector:
x <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)
names(x) <- letters[1:11]
rank(x, ties.method = "first")
##  a  b  c  d  e  f  g  h  i  j  k 
##  4  1  6  2  7 11  3 10  8  5  9
rank(x, ties.method = "average")
##    a    b    c    d    e    f    g    h    i    j    k 
##  4.5  1.5  6.0  1.5  8.0 11.0  3.0 10.0  8.0  4.5  8.0

### ALWAYS BE AWARE OF HOW TIES ARE HANDELED!! ###

Comparing vectors: set operations

R includes some handy set operations, including these:

Function Description
union(x,y) Union of the sets x and y
intersect(x,y) Intersection of the sets x and y
setdiff(x,y) Set difference between x and y, consisting of all elements of x that are not in y
setequal(x,y) Test for equality between x and y
is.element(el, set) ; c %in% y Membership, testing whether c is an element of the set y
choose(n,k) Number of possible subsets of size k chosen from a set of size n

Note that x and y are vectors of the same mode preferentially with no duplicated values. Replicate will not be returned.

Here are some simple examples of using these functions:

x <- 1:10
y <- c(3:6, 12, 12, 15, 18)
union(x, y)
##  [1]  1  2  3  4  5  6  7  8  9 10 12 15 18
intersect(x, y)
## [1] 3 4 5 6
setdiff(x, y)
## [1]  1  2  7  8  9 10
setdiff(y, x)
## [1] 12 15 18
is.element(2, x)
## [1] TRUE
is.element(y, x)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

let <- letters[1:2]
union(y, let) # Note the implicit type coercion
## [1] "3"  "4"  "5"  "6"  "12" "15" "18" "a"  "b"

Exercice

  • Generate a numeric vector of 400 random values sampled from a uniform distribution with a maximum of 100.
  • Display a summary of this vector.
  • How many values are greater than the first quartile but less than the median? Tip: use `quantile()?

Exercice

  • Create a character vector populated with 10 values of the name of the months randomly sampled (with replacement) from the built-in variable month.name.

  • Replace values in this vector with the numbers of the corresponding months (e.g. March with 3).

Exercice

Source: R for Biologists - Prof. Daniel Wegmann

Create a numerical vector f containing the elements 1, −1, 2, −2, . . . , 100, −100

Create a vector of 100 elements that contains the numbers 1,2 and 3 in random order, but with twice as many 1s than 2s or 3s.

Exercice

Source: R for Biologists - Prof. Daniel Wegmann

  1. Create two vectors x and y containing 1000 random numbers normally distributed with sd=1 and mean=0 and mean=1, respectively.
  2. Calculate the number of pairs (x[i], y[i]) where y[i]>x[i].
  3. Calculate the number of values in y that are larger than the largest value in x.
  4. Calculate the number of values in x that are larger than the 200 th smallest value in y and less than two standard deviations away from the mean of x.
  5. Create a vector z with all 999 differences between the neighboring elements of x such that z[1]=x[2]-x[1], z[2]=x[3]-x[2], . . ..