Data manipulation II: be the master of arrays (matrices), lists and data frames.

Pretty long section. Will cover:

Basics of indexing and subesetting

- arrays (matrices)

- lists

- data frames

Indexing and subesetting for matrices and arrays

  • The basic subsetting syntax for arrays is to use the brackets operator containing vectors of indices for each dimension separated by commas. Examples for matrices:


Source: Altuna Akalin

#  a matrix m
m <- matrix(data = 1:20, nrow = 4, byrow = FALSE)
m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
m[c(2, 4), ] # return a matrix, two lines, all columns
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    6   10   14   18
## [2,]    4    8   12   16   20
m[ ,c(2, 4)] # a matrix,, two columns, all lines
##      [,1] [,2]
## [1,]    5   13
## [2,]    6   14
## [3,]    7   15
## [4,]    8   16
m[2:6] # vector-like indexing by column for matrices
## [1] 2 3 4 5 6

  • Matrices and arrays can also be subset with an indexing integer matrix having as many columns as dimensions of the object to subset. Each row represents the coordinates of the element to extract.
idx <- cbind(rows = 1:4, columns = 1:4) # bind two numeric vectors as columns of an idx matrix
m[idx] # indexing returns a vector of values in the diagonal
## [1]  1  6 11 16
  • Matrices and arrays can also be subset with logical matrices/array/vectors
# Use a logical vector to find rows that have a value > 5 in the second column
# Extract columns two to five for these rows (all columns)
m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
m[m[, 2] > 5, 2:4]
##      [,1] [,2] [,3]
## [1,]    6   10   14
## [2,]    7   11   15
## [3,]    8   12   16

We will do a lot of this for data frames

  • row() and col() return a matrix of integers containing respectively, their row or column number in a matrix-like object.
row(m)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    2    2    2    2    2
## [3,]    3    3    3    3    3
## [4,]    4    4    4    4    4
col(m)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    1    2    3    4    5
## [3,]    1    2    3    4    5
## [4,]    1    2    3    4    5

This is handy to extract specific regions of the matrix using a logical matrix of the same shape:

row(m) == col(m) # the diagonal
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,]  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE FALSE FALSE
## [3,] FALSE FALSE  TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE  TRUE FALSE

m[row(m) < col(m)] # extract the upper triangle by column without the diag
##  [1]  5  9 10 13 14 15 17 18 19 20

Actually, not surprisingly, R has dedicated functions for that: lower.tri(), upper.tri() and diag() (for extracting and setting).

  • which() also works for arrays and can return a numeric indexing matrix. Lets’s illustrate a use case where the goal is to find the locations of values of interest in a matrix:
# One wants to find values 2, 5 and 10 in `m`:
logiV <- m %in% c(2, 5, 10)
logiV # a logical vector by column
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
logiM <- matrix(data = logiV, nrow = 4, byrow = FALSE) # coerce into a matrix
logiM
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE  TRUE FALSE FALSE FALSE
## [2,]  TRUE FALSE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
m[logiM] # we did find them, apparently the values have no replicate
## [1]  2  5 10

# To find the coordinates of the values of interest in the matrix:
idx <- which(logiM, arr.ind = TRUE)

val <- m[idx]

cbind(idx, val = val) # "column" bind the indexing matrix with the vector of values of interest.
##      row col val
## [1,]   2   1   2
## [2,]   1   2   5
## [3,]   2   3  10

Handy Matrix Functions

t(): transposes a matrix
colSums(): Sum of the column values
rowSums(): Sum of the row values
colMeans(): Mean of the column values
rowMeans(): Mean of the row values
%*%: matrices multiplication

t(m)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20
m + m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2   10   18   26   34
## [2,]    4   12   20   28   36
## [3,]    6   14   22   30   38
## [4,]    8   16   24   32   40
m-m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0

Exercice

Take the m matrix. How would you change all the values of the diagonal to 1?

Exercice

In biochemistry, people often use 96 wells microplates that are in essence a matrix of wells with 12 columns and 8 rows:


Here is a matrix containing the measures output by an instrument:

measures <- matrix(data = round(runif(n = 96, max = 10), digits = 1), nrow = 8, ncol = 12)

In order to do computations on the values you would need to reshape it in a long format, i.e. in the form: row#, column#, value. Can you do that?

Indexing and subesetting for lists

List indexing can be achieved with one of three indexing operators:

  • [: selects sub-lists. It always returns a list. Used with a single positive integer, it returns a list of length one.
  • [[: returns the actual element within a list at the specified single index.
  • $: a shorthand for name indexing, wherex$y is equivalent to x[["y", exact = FALSE]]
  • again, integer, logical and name indexing are possible.

Source : Hands-On Programming with R

fileCabinet <- list(drawer1 = c(length = 1, width = 3),
                    drawer2 = FALSE,
                    top = c(item1 = "flowers", item2 ="sharpener"))
str(fileCabinet)
## List of 3
##  $ drawer1: Named num [1:2] 1 3
##   ..- attr(*, "names")= chr [1:2] "length" "width"
##  $ drawer2: logi FALSE
##  $ top    : Named chr [1:2] "flowers" "sharpener"
##   ..- attr(*, "names")= chr [1:2] "item1" "item2"

The [ operator. These expressions all return the same list:

fileCabinet[1:2]
fileCabinet[c(T, T, F)]
fileCabinet[c("drawer1", "drawer2")] # Partial name matching is not possible

The [[ and $ operators. These expressions all return the same element:

fileCabinet[[3]]
fileCabinet[["top"]]
fileCabinet[["t", exact = NA]] # the exact argument controls partial matching, TRUE or NA (return a warning)
fileCabinet$top
fileCabinet$t

identical(fileCabinet[1], fileCabinet[[1]])
## [1] FALSE

Take a couple of minutes to run these expressions and understand why they return the same thing.

  • Recursive indexing:
str(fileCabinet)
## List of 3
##  $ drawer1: Named num [1:2] 1 3
##   ..- attr(*, "names")= chr [1:2] "length" "width"
##  $ drawer2: logi FALSE
##  $ top    : Named chr [1:2] "flowers" "sharpener"
##   ..- attr(*, "names")= chr [1:2] "item1" "item2"
fileCabinet[[c(1,2)]]
## [1] 3
fileCabinet[[c(3,2)]]
## [1] "sharpener"
fileCabinet[[c("top", "item2")]]
## [1] "sharpener"
fileCabinet$top[["item2"]]
## [1] "sharpener"
fileCabinet$top["item2"]
##       item2 
## "sharpener"
# Combinations of subsetting operators are possible but not recommended because the code is hard to read:
fileCabinet[[3]][[2]]
## [1] "sharpener"
fileCabinet[[3]][2]
##       item2 
## "sharpener"

Do you see a difference in the result of the last four expressions? This illustrates a case of simplifying the results of an idexing operation. We will come back to these shortly.

  • Just like with other basic R structures, in place modification of a list is possible with indexing and the <- operator.

For example to delete elements in a list, assing NULL:

lst <-  list(a = c(TRUE, FALSE), b = NULL, c = list(list()))
lst[c("a", "c")] <- NULL

Exercice

How would you replace the last two elements of fileCabinet with the first element of fileCabinet and a sequence of number?

How would you change the value of width in drawer1 to 2?

Basics of Indexing and subesetting for dataframes

Data frames are hybrid objects at the intersection between a list and matrix. Not unexpectedly, both matrix and list indexing techniques apply to data frames.

df <- data.frame(A = letters[1:4],
                 `B-2` = rep(c(TRUE, FALSE), each = 2),
                 `C with space` = seq(10, 20, length.out = 4),
                 D = c(runif(3), NA)
)
df
##   A   B.2 C.with.space         D
## 1 a  TRUE     10.00000 0.7670464
## 2 b  TRUE     13.33333 0.1255563
## 3 c FALSE     16.66667 0.2556670
## 4 d FALSE     20.00000        NA

See what happens with the weird column names?

On a general note, if you want to make your life easier, just prohibit spaces, +, -, *, special characters in column/row names even in tabulated files that you want to import in R.

  • List-like indexing:
df[c(1,4)] # To select one or several columns. Returns a data frame
##   A         D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d        NA
df[3]
##   C.with.space
## 1     10.00000
## 2     13.33333
## 3     16.66667
## 4     20.00000
df[[3]] #To select the CONTENT (class) of a a single column using the list braquets.
## [1] 10.00000 13.33333 16.66667 20.00000
df$C
## [1] 10.00000 13.33333 16.66667 20.00000
# And so on...

  • Matrix-like indexing:
df[,c(1,4)] # extracts columns
##   A         D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d        NA
df[, c("A", "D")]
##   A         D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d        NA
df[2:3, ]
##   A   B.2 C.with.space         D
## 2 b  TRUE     13.33333 0.1255563
## 3 c FALSE     16.66667 0.2556670

  • With logical values:
df[df$B.2, ] # a df
##   A  B.2 C.with.space         D
## 1 a TRUE     10.00000 0.7670464
## 2 b TRUE     13.33333 0.1255563
df[, colnames(df) %in% "D"] # column turned to a vector !!!
## [1] 0.7670464 0.1255563 0.2556670        NA
df[, colnames(df) %in% "D", drop = FALSE] # column remains a df !!!
##           D
## 1 0.7670464
## 2 0.1255563
## 3 0.2556670
## 4        NA

df[df$C.with.space < 20 & df$D > 0.4, ] # use comparison and logical operators to subset
##   A  B.2 C.with.space         D
## 1 a TRUE           10 0.7670464
  • This subsetting technique with $ to refer to column can become tedious to type. So in interactive mode, one can use with()
with(data = df, df[C.with.space < 20 & D > 0.4, ])
##   A  B.2 C.with.space         D
## 1 a TRUE           10 0.7670464

Creating data frames with a list as a column

# This works as expected
DF <- data.frame(a = 1:3)
DF$b <- list(1:1, 1:2, 1:3)
str(DF)
## 'data.frame':    3 obs. of  2 variables:
##  $ a: int  1 2 3
##  $ b:List of 3
##   ..$ : int 1
##   ..$ : int  1 2
##   ..$ : int  1 2 3
# This alternative produce an odd result
DF <- data.frame(a = 1:3, b = list(1:1, 1:2, 1:3))
## Error in data.frame(1L, 1:2, 1:3, check.names = FALSE, stringsAsFactors = TRUE): arguments imply differing number of rows: 1, 2, 3
  • If a list or data frame or matrix is passed to data.frame() it is as if each component or column had been passed as a separate argument.
  • A solution if you need to use this type of construct? Use I() to protect you object and request that it is treated as such.
DF <- data.frame(a = 1:3, b = I(list(1,1:2,1:3)))

Simplifying vs. preserving

Remember from previous slides?

class(df[[3]]) == class(df[3])
## [1] FALSE
  • Unlike [ with drop = FALSE, [[ and [ with default drop = TRUE, will simplify their return value.

  • Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want.

  • Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type.

  • Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors. (It will work for your test cases, but then someone will pass in a single column data frame and it will fail in an unexpected and unclear way.)

Italics are direct copies from the source at http://adv-r.had.co.nz/Subsetting.html if you want a thorough outlook on this aspect follow the link.

Exercice

Take df and keep all columns but “A” and “D”. Do that in at least two different ways (think <- list(NULL))

Exercice

The function installed.packages() returns a data frame with info on the R packages installed in your system.

  • How would you programmatically check if the packages required for this trainning are installed on the system?
    We may gonna need c("ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram", "Biostrings").
  • For those that are installed generate a simple data frame with only columns “Package”, “LibPath”, “Version”

Exercice

Re-assign df to its original value and calculate the mean of the content of column “D”.

Data frame manipulations: transformations, aggregation, … using built-in R tools

Subsetting rows and columns with subset()

Let’s take our toy data frame:

df <- data.frame(A = letters[1:4],
                 B = rep(c(TRUE, FALSE), each = 2),
                 C = seq(10, 20, length.out = 4),
                 D = c(runif(3), NA)
)

We have already encoutered multiple ways to extract rows and columns from a data frame.

  • When using logical indexing of rows, the subset(x, subset, select, drop = FALSE, ...) function can save you some substancial typing in interactive sessions and help understand the code faster.
  • The select= argument is also pretty convenient to keep only columns of interest

subset(x = df, subset = C < 20 & D > 0.4)
subset(x = df, subset = C < 20 & D > 0.4, select = c("A", "B", "C"))
subset(x = df, subset = C < 20 & D > 0.4, select = !colnames(df) %in% c("A", "D"))
subset(x = df, subset = C < 20 & D > 0.4, select = A:C) # "sequence" of column names
subset(x = df, subset = C < 20 & D > 0.4, select = -(C:D)) # "Negated" column names
  • While subset= must have a logical value, select= accepts more types.

  • Note that in order to allow us from refering to data frame columns without specific use of the data frame name (like in df$ColA), subset() like with() and a other functions (especially in dplyr and reshape2) use Non-Standard Evaluation.

It is a fairly advanced topic, but you should remember that because of this usage of NSE, it is not recommended to include these functions when you define your own functions because it may just not work or cause unsuspected behavior.
For a discussion on NYSE, see Advanced R

Renaming rows and columns in a data frame

To change these attributes, there are two dedicated functions that work both for getting and modifying names: rownames() and colnames().

colnames(df)
## [1] "A" "B" "C" "D"
colnames(df) <- month.abb[1:ncol(df)]
rownames(df) <- sample(state.name, size = nrow(df))
df
##               Jan   Feb      Mar       Apr
## Louisiana       a  TRUE 10.00000 0.6436652
## West Virginia   b  TRUE 13.33333 0.2687169
## Alaska          c FALSE 16.66667 0.4745574
## Nebraska        d FALSE 20.00000        NA

In a data frame values for:

  • rownames() should be a character vector of non-duplicated and non-missing names.
  • colnames() a character vector of (preferably) unique syntactically-valid names.

Handling Not Available values

We have seen before that one should always be wary of the presence of NAs in the data under analysis.

  • One informal way to look for them is to scrutinize the info displayed by summary()

  • To look for and exculde them programmatically, test for their presence with is.na() and used logical indexing.

  • na.omit() and related functions are usefull for conveniently deal with NAs in data frames. summary(df) indicate the number of NA for each variable.

  • Remember also that many (summary) functions have an argument like na.rm= that controls how to handle them.

Let’s reset our df to factory values

summary(df)
##  A         B                 C              D          
##  a:1   Mode :logical   Min.   :10.0   Min.   :0.03823  
##  b:1   FALSE:2         1st Qu.:12.5   1st Qu.:0.07693  
##  c:1   TRUE :2         Median :15.0   Median :0.11563  
##  d:1   NA's :0         Mean   :15.0   Mean   :0.35267  
##                        3rd Qu.:17.5   3rd Qu.:0.50989  
##                        Max.   :20.0   Max.   :0.90415  
##                                       NA's   :1

df[!is.na(df$D), ]
##   A     B        C          D
## 1 a  TRUE 10.00000 0.90415435
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775

na.omit(df)
##   A     B        C          D
## 1 a  TRUE 10.00000 0.90415435
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775

Adding / Removing rows and columns

  • Adding / Removing rows

    • To delete rows, use negative integer indexes or (negated) logical indexing and assign result to you df.
    • To add rows use rbind() and re-assign (column names must match), or subsetting and assignment for in place modif.
df <- df[!is.na(df$D), ] # delete
rbind(df, df[1, ]) # bind
##   A     B        C          D
## 1 a  TRUE 10.00000 0.90415435
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a  TRUE 10.00000 0.90415435
df[nrow(df) + 1, ] <- df[1, ] # bind with another method
df
##   A     B        C          D
## 1 a  TRUE 10.00000 0.90415435
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a  TRUE 10.00000 0.90415435

  • Adding / Removing columns

    • To delete columns, set column(s) to list(NULL) (NULL) or return only the columns you want and assign to your df.
    • To add columns use subsetting and assignment for in place modif
    • To add columns use cbind() and re-assing (if binding a character vector, it will be converted to factor unless stringsAsFactors = FALSE)
subset(df, select = A:B) # could be re-assigned to df to delete col C and D
##   A     B
## 1 a  TRUE
## 2 b  TRUE
## 3 c FALSE
## 4 a  TRUE
df <- cbind(df, E = state.name[1:nrow(df)], F = rnorm(n = nrow(df))) # does convert to factor
df[ , "G"] <- state.name[1:nrow(df)] # does NOT 
str(df)
## 'data.frame':    4 obs. of  7 variables:
##  $ A: Factor w/ 4 levels "a","b","c","d": 1 2 3 1
##  $ B: logi  TRUE TRUE FALSE TRUE
##  $ C: num  10 13.3 16.7 10
##  $ D: num  0.9042 0.1156 0.0382 0.9042
##  $ E: Factor w/ 4 levels "Alabama","Alaska",..: 1 2 3 4
##  $ F: num  -1.0387 0.5046 -0.3716 -0.0255
##  $ G: chr  "Alabama" "Alaska" "Arizona" "Arkansas"
df["G"] <- NULL

  • Creating new columns or values calculated by applying a function to other column(s)
    • compute your new columns and add to your df with your technique of choice.
    • use the convenience function transform() that also saves you from referring to df name in expressions and reassign.
df <- cbind(df, G = sqrt(df$D)* (df$C + pi)) # calculate values with vectors of equal length
df <- transform(df, H = ceiling(G - mean(C))) # mean(C) is recycled
df
##   A     B        C          D        E           F         G  H
## 1 a  TRUE 10.00000 0.90415435  Alabama -1.03870301 12.495950  0
## 2 b  TRUE 13.33333 0.11562937   Alaska  0.50458974  5.602186 -6
## 3 c FALSE 16.66667 0.03822775  Arizona -0.37158670  3.872895 -8
## 4 a  TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950  0

Reordering the rows and columns in a data frame

  • General principle: permute the indices of columns or rows and reassign to your df.
  • As introduced for vectors, order() returns a permutation of the indices of the elements of x so that the corresponding element are ordered.
  • order() can work with a combination of sorting keys provided as vectors of the same lenght. So one can sort a data frame based on several columns:
df
##   A     B        C          D        E           F         G  H
## 1 a  TRUE 10.00000 0.90415435  Alabama -1.03870301 12.495950  0
## 2 b  TRUE 13.33333 0.11562937   Alaska  0.50458974  5.602186 -6
## 3 c FALSE 16.66667 0.03822775  Arizona -0.37158670  3.872895 -8
## 4 a  TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950  0
df[ order(df$B, df$H, -df$F), ]
##   A     B        C          D        E           F         G  H
## 3 c FALSE 16.66667 0.03822775  Arizona -0.37158670  3.872895 -8
## 2 b  TRUE 13.33333 0.11562937   Alaska  0.50458974  5.602186 -6
## 4 a  TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950  0
## 1 a  TRUE 10.00000 0.90415435  Alabama -1.03870301 12.495950  0

Note that specifying the increase/decrease behavior for individual columns is achieved by negating the column name (see: also the stackoverflow discussion).

Finding unique/duplicated rows

  • unique() returns only the unique elements
  • duplicated() returns a logical vector where TRUE indicates that the corresponding element is a duplicate.
  • they both work with vectors, arrays and data frames.
df <- subset(df, , A:D)
unique(df)
##   A     B        C          D
## 1 a  TRUE 10.00000 0.90415435
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
duplicated(df)
## [1] FALSE FALSE FALSE  TRUE
df[duplicated(df), ]
##   A    B  C         D
## 4 a TRUE 10 0.9041543
df[-duplicated(df), ]
##   A     B        C          D
## 2 b  TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a  TRUE 10.00000 0.90415435

Merging data frames

Merging data frames is a also a recurring necessity and powerfull technique in data analysis. It involves combining information of two data frames based on common variable(s) or key(s). This is reminiscent of running SQL queries across tables in relational databases.

In base R, merge() let you combines these data frames:

# The grocery store example...
products
##   ProdID  category price
## 1      A vegetable     4
## 2      B      meat     3
## 3      C       can     2
## 4      D  iceCream     1

sales
##   CustomID ProdID       Date
## 1        a      C 2015-09-01
## 2        a      A 2015-09-01
## 3        b      B 2015-09-02
## 4        b      C 2015-09-03
## 5        b      E 2015-09-03

How can we generate somekind of bill?

  • The natural output, woud be to combine records from sales with info from products and sort by CustomID:
bill <- merge(x = sales, y = products, by = "ProdID", all.x = TRUE, sort = FALSE)
bill[order(bill$CustomID, bill$Date, bill$ProdID), ]
##   ProdID CustomID       Date  category price
## 3      A        a 2015-09-01 vegetable     4
## 1      C        a 2015-09-01       can     2
## 4      B        b 2015-09-02      meat     3
## 2      C        b 2015-09-03       can     2
## 5      E        b 2015-09-03      <NA>    NA
# There is no info about product D, what did the cashier!!?
  • There is something weird so we want all records combined by ProdID
merge(x = sales, y = products, by = "ProdID", all = TRUE, sort = FALSE)
##   ProdID CustomID       Date  category price
## 1      C        a 2015-09-01       can     2
## 2      C        b 2015-09-03       can     2
## 3      A        a 2015-09-01 vegetable     4
## 4      B        b 2015-09-02      meat     3
## 5      E        b 2015-09-03      <NA>    NA
## 6      D     <NA>       <NA>  iceCream     1
# Product D did not sell well these last days.

  • Here is the full list of arguments to merge():

merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)

  • It is very flexible and enables all kind of combinations:
    • use several variables for by
    • do full, left, right and natural joins
    • sort the output on by
  • Be carefull that you properly specify arguments to get what you want…

Aggregating data

  • Suppose now, we would like to calculate the total bill by customers. How would we do that?
  • This question relates to the issue of data aggregation which involves notably calculating summary statistics for specific groups of records defined by combinations of the levels of (categorical) variables (factors).

  • Probably the most userfriendly base R function to do these tasks is aggregate().

cleanBill <- na.omit(bill)
aggregate(x = cleanBill$price, by = cleanBill[c("CustomID", "Date")], FUN = sum)
##   CustomID       Date x
## 1        a 2015-09-01 6
## 2        b 2015-09-02 3
## 3        b 2015-09-03 2
# That answers our first question

aggregate() comes in two flavors:

  • aggregate(x, by, FUN, ..., simplify = TRUE)
    • x = is your data
    • by = is a list of grouping factors
    • FUN = the function used for aggregation
    • ... = optional arguments to FUN
    • simplify =whether results should be simplified to vector or matrix/data frame if possible.
  • aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
    • formula = a formula e.g. y ~ x or cbind(y1, y2) ~ x1 + x2. Meaning summarises y by x as a grouping factor. We will learn more about formulas when discussing ploting and statistical modelling.
    • data = the data frame where x and y can be found…

If time allows, you can run the examples of documentation for aggregate() to get a better sense of what you can do with the formula notation.

Let’s look at other examples using the famous iris dataset:

First we want to calculate the means of the various measures conditionned on species:

aggregate(. ~ Species, data = iris, FUN = mean)
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

# The dot in formula means "any columns from data that are otherwise not used"

We are interested in determining the correlation between dimensions of sepal and petals conditionned on the plant species. For this we can use cor().

aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman")
## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x'
# cor() complains...

Does not work because `aggregate() first separates columns and then for each column, it applies FUN to each subsets of values (vectors) defined by by. The problem is that here cor() takes a 2-D object…

Let’s try by() that splits a data frame into a subset of data frames based on the values of one or more factors, and function FUN is applied to each data frame in turn.

irisCorel <- by(iris[1:4], INDICES = iris["Species"], FUN = cor, method = "spearman")

Note that additional arguments to FUN are passed after the FUN= in the call.

str(irisCorel)
## List of 3
##  $ setosa    : num [1:4, 1:4] 1 0.755 0.279 0.299 0.755 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ versicolor: num [1:4, 1:4] 1 0.518 0.737 0.549 0.518 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ virginica : num [1:4, 1:4] 1 0.427 0.824 0.316 0.427 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  - attr(*, "dim")= int 3
##  - attr(*, "dimnames")=List of 1
##   ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
##  - attr(*, "call")= language by.data.frame(data = iris[1:4], INDICES = iris["Species"], FUN = cor,      method = "spearman")
##  - attr(*, "class")= chr "by"

  • To sum up:

    • aggregate() is OK but for example, there is no way to provide names to the results of summarization in the output other than manually changing them afterwards.
    • by() is OK, but we may need to further process its output to format it in a table-like fashion.
    • aggregate() and by() passes the same function to all subsets. What if we wanted to calculate different summaries depending on the variable(s)?
  • The dplyr package will help you do some of that fairly easily.

  • For total control, you will have to write your own functions and apply them iteratively with by() or other functions of the apply() family.

Exercice

Source: Advanced R

  • How would you select a random sample of m rows from a data frame (write a general code and use the iris dataset to test)?

  • What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

Exercice

Run the code below:

library(ggplot2)
mySimpleSumy <- aggregate(diamonds, list(diamonds$color), summary)
  • What is this call doing? Take a look at the structure of the returned object. What is it?

  • run the same aggregate call but change the value of the simplify argument. What happens?

Exercise

Source: Advanced R

  • How could you put the columns in a data frame in alphabetical order?
  • Delete the last one.
  • Write a general code to do that and use movies from the ggplot2 package to test it.

Exercice

Adapted and extended from a Introduction to R

  • Lets first take a close look at what is in the movies dataset from the ggplot2 package. What can you tell about it?
library(ggplot2)
?movies
  • How can you tell whether there are variables that have NA values?

  • How many movies are there where budget is known (don’t use subset())

  • How many movies are there where the rating is less than or equal to 2.5 or greater than 7.5 (don’t use subset())?

  • How many movies were made in 1980 and have a rating above 5.0?

  • How many movies were classified both as “Action” and “Animation?”

  • Don’t sleep! What are the “precious” movies with the maximum number of votes or the longest length? How many votes for the later? You may use subset().

  • Append a new column named rating_zscore to movies by standardizing the Average IMDB user rating using the scale() function.

  • Extract only the last binary columns used for genre classification.

  • Create an aggregated version of it that will count the number of movies belonging to each unique combination of genre.

  • Can you tell how many unique combinations of genre are present in this df?

  • What are three most abundant combinations of genre ?

  • How would you extract all the rows in movies that fit in only one genre?

  • In what genre are on average the most costly movies? Does this guarantee a high IMDB user rating?

Contengency-classification tables

  • Sometimes we need to do some checking for the number of observations having specific combinations of categorical variables levels.

  • To calculate cross-classification table the table() and xtabs() functions are our friend. There are a few differences among them, the main one being that xtabs() takes formulas to specify crosstabulation.

color <- c("red", "green", "green", "red", "green", "green", "red")
shape <- c("round", "sharp", "sharp", "round", "round", "sharp", "round")
size <- c("tall", "tall", "small", "tall", "small", "small", "tall")
objects <- data.frame(color, shape, size)

xclass <- xtabs(data = objects)
str(xclass) # a list with other attributes
##  int [1:2, 1:2, 1:2] 1 0 2 0 0 3 1 0
##  - attr(*, "dimnames")=List of 3
##   ..$ color: chr [1:2] "green" "red"
##   ..$ shape: chr [1:2] "round" "sharp"
##   ..$ size : chr [1:2] "small" "tall"
##  - attr(*, "class")= chr [1:2] "xtabs" "table"
##  - attr(*, "call")= language xtabs(data = objects)

summary(xclass)
## Call: xtabs(data = objects)
## Number of cases in table: 7 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 9.674, df = 4, p-value = 0.0463
##  Chi-squared approximation may be incorrect

as.data.frame(xclass) # ALL combinations between levels are shown
##   color shape  size Freq
## 1 green round small    1
## 2   red round small    0
## 3 green sharp small    2
## 4   red sharp small    0
## 5 green round  tall    0
## 6   red round  tall    3
## 7 green sharp  tall    1
## 8   red sharp  tall    0

ftable(xclass) # Another way to flattened cross-classification tables with more than 2 dimensions
##             size small tall
## color shape                
## green round          1    0
##       sharp          2    1
## red   round          0    3
##       sharp          0    0

# to compute margin sum or other functions
addmargins(xtabs(formula = ~ color + shape, data = objects), FUN = sum)
## Margins computed over dimensions
## in the following order:
## 1: color
## 2: shape
##        shape
## color   round sharp sum
##   green     1     3   4
##   red       3     0   3
##   sum       4     3   7
# to compute proportions (maring = 1, rowwise ; 2, colwise ; not specified, all table)
prop.table(xtabs(formula = ~ color + shape, data = objects))
##        shape
## color       round     sharp
##   green 0.1428571 0.4285714
##   red   0.4285714 0.0000000

This is interesting but we can do somemore with reshape2

Exercice

  • In the movies data set, convert rating to a factor with 5 levels. Use the pretty() function to generate the value of the break argument of cut().
  • Calculate a contingency table between this recoded ratings and the mpaa ratings and add sums on both sides.
  • Is there any movie forbidden for people less than 17 rated more than 8 ?

Introduction to packages reshape2 and dplyr.

Tidy up and forget Excel’s PivotTables

Source Data Wrangling Cheatsheet

  • R follows a set of conventions that makes one layout of tabular data much easier to work with than others.
    Your data will be easier to work with in R if it follows three rules:

    • Each variable in the data set is placed in its own column
    • Each observation/case/individual/experimental unit is placed in its own row
    • Each value is placed in its own cell
  • Often times you received or gather data that is messy, i.e. the way the information is structured is not appropriate for efficient and straightforward analysis using R and you need to tidy it up.

  • Occasionally although you have a tidy dataset, as a part of the analysis, you also want to represent the data differently and possibly aggregate it, like in the contingency tables we just saw.

melt() and dcast() with reshape2

In short, you need to reshape your data and swing seamlessly between long and wide table shapes.

The standard R distribution has facilities to do this type of transformations: stack() and unstack() from the utils package and reshape() from the stats package.

But, in my experience, they are difficult to use or are not as flexible as the two simple but powerful functions provided in the reshape2 package:

reshape2 is based around two key functions: melt and cast:

  • melt() takes wide-format data frames and melts it into long-format data.
  • dcast() takes long-format data frames and casts it into wide-format data.

Let’s see how to do that with a slightly more explicit toy data set:

magicDrug <- data.frame(Name = rep(LETTERS[1:2], each = 2, times = 2),
                        When = rep(c("before", "after"), times = 4),
                        Speed = c(1, 2, 3, 5, 0, 1, 4, 5),
                        Weight = c(10, 3, 20, 6, 11, 2, 23, 9)
)

First melt:


library(reshape2)
mdata <- melt(data = magicDrug,
              id.vars = c("Name", "When"), # Specified but unecessary because they're factors
              measure.vars = c("Speed", "Weight"), # Specified but unecessary because they're numerics
              variable.name = "IndPerformances", # What does these variables have in common?
              na.rm = TRUE, # as a reminder for NAs
              value.name = "value") # the default is good
str(mdata)
## 'data.frame':    16 obs. of  4 variables:
##  $ Name           : Factor w/ 2 levels "A","B": 1 1 2 2 1 1 2 2 1 1 ...
##  $ When           : Factor w/ 2 levels "after","before": 2 1 2 1 2 1 2 1 2 1 ...
##  $ IndPerformances: Factor w/ 2 levels "Speed","Weight": 1 1 1 1 1 1 1 1 2 2 ...
##  $ value          : num  1 2 3 5 0 1 4 5 10 3 ...

Now, we can dcast() the melted data in various ways to get some insight:

  • MagicDrug helps loose weight and run faster!!
# again formula specification of the layout
# mean() for aggregation
dcast(data = mdata, formula = IndPerformances ~ When, fun.aggregate = mean)
##   IndPerformances after before
## 1           Speed  3.25      2
## 2          Weight  5.00     16
###  !!! Note that the fun.aggregate function should take a vector of numbers !!! {.build}
###  !!! and return a single summary statistic                                !!! {.build}
  • B runs faster than A and is heavier!!?
dcast(mdata, Name ~ IndPerformances, mean)
##   Name Speed Weight
## 1    A  1.00    6.5
## 2    B  4.25   14.5
  • This is a contengency table…
dcast(mdata, Name + When ~ IndPerformances, length)
##   Name   When Speed Weight
## 1    A  after     2      2
## 2    A before     2      2
## 3    B  after     2      2
## 4    B before     2      2

Exercice:

In some regards, the iris dataset seems a little untidy. Let’s try to tidy it up.

library(reshape2)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

  • First add a flowerID variable to enable keeping track of what was measured on what flower.
  • Melt the iris dataset with the variable.name = “Object.Dimension” and the value variable name = “Length” and sort it on these variables:

  • Split the variable Object.Dimension into two variables object and dimension and bind to the melted iris data frame.
    Tip: merge() the melted iris data frame with a custom made recoding dataframe on the Object.Dimension column.

  • Cast the resulting data frame into a matrix with Species and object as rows and dimension as columns and containing mean values.
    Note that a preliminary melt operatin MAY be necessary.

dplyr, a unifying framework for data frame manipulation and aggregation

  • Fundamental data processing functions exist in R but they are rather disparate, lack a consistent interface and the ability to easily flow together.

  • This makes it difficult to read, understand and write concise code.

  • The dplyr package attemps to overcome these limitations:
    • a structured vocabulary for tabular data manipulation
    • underlying manipulations are carried out in C++ -> speed is good
    • transparently use the same syntax to acces/process data in a remote database (barely need to know any SQL…)
    • great for interactive data analysis but it is more tricky to program with it because it uses NSE. See this vignette.

  • There are alternatives to dplyr (beside base distribution):
    • The data.table package has many functionalities in common with plyr and is a serious competitor.
    • Unfortunately, I am not yet familiar with it but apparently it is faster and can handle huge tables (billions of rows). Its syntax is however allegedly less smooth that dplr’s.
  • The dplyr package philosophy is that preparing data in the form of a data frame for specific analysis always involves applying a combination of elementary manipulations.

  • Each of these basic manipulations has been formalized with a function named with a verb:

Main plyr function Effect Default R equivalent (pseudo code)
filter() keep rows matching criteria subset()
slice() pick rows using index(es) df[1:10, ]
distinct() Extract distinct (unique) rows unique()
sample_n() randomly sample rows df[sample(1:nrow(df), n), ]
arrange() reorder rows df[order(df, factKey), ]
select() pick columns by name subset() or df[ , c("ColB", "ColD")]
rename() rename specific columns colnames(df)[ "ColA" %in% colnames(df) ] <- "ColX"
mutate() add columns that are f() of existing ones transform() orcbind(df, X = f(df$ColA))
xxx_join() combine data frames merge()
intersect(), etc set operations between data frames NULL
order_by() %>% summarise() reduce variables to values aggregate() and by()

  • verb() syntax:
    • First argument is a data frame
    • Subsequent arguments say what to do with data frame.
    • No use of $ to refer to df colums: like subset() and transform().
    • Never modify in place
    • Always return a data frame
  • The dplyr Cheat Sheet is warmly recommended.
  • We will review selected examples mainly from the dplyr tutorial to highlight neat features that are more tedious to emulate with base R functions.
  • Note that we will only scratch the surface of what can be done. Look at the vignettes, tutorial and cheatsheet to learn about other features.

# download the 'flights.csv' file from the dropbox page and save to you curent working dir
library(dplyr)
flights <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE))
## Warning in file(file, "rt"): cannot open file 'flights.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection
# or 
# install.packages("hflights")
# library(hflights) ; flights <- hflights

# contains flights that departed from Houston in 2011
glimpse(flights)
## Error in nrow(tbl): object 'flights' not found

  • Find all flights: To SFO or OAK
filter(flights, dest %in% c("SFO", "OAK"))
## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
  • Where the arrival delay was more than twice the departure delay
filter(flights, arr_delay > 2 * dep_delay)
## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
  • Select the two delay variables (columns):
select(flights, arr_delay:dep_delay)
## Error in select_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
  • Compute speed in mph from time (in minutes) and distance (in miles). Which flight flew the fastest?
flights <- mutate(flights, speed = dist / (time / 60))
## Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
arrange(flights, desc(speed))
## Error in arrange_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found

The %>% (pipe) operator

  • The pipe operator %>% developed in the R package magrittr.
  • Extremely convenient when chaining multiple manipulation steps to avoid either deeply nested function calls or lines and lines of variable(s) assignment :
    • Nested: f( g( h(x), z=1), y=1 )
    • Piped: h(x) %>% g(z=1) %>% g(y=1)
  • Chainning plyr functions:

Select the two delay variables for flights to SFO or OAK

flights %>% filter(arr_delay > 2 * dep_delay) %>% select(arr_delay:dep_delay)
## Error in eval(expr, envir, enclos): object 'flights' not found

group_by() for operating on subsets of the table

  • To specify a grouping scheme, use group_by() on a table specifying grouping variables.
  • Operate on the grouped table to achieve various effects:
    • with filter : select rows meeting a criteria by group
    • with summarise : apply aggregation function to summarise each group
    • with mutate : calculate a summary value by group and append it to table

Is there a destination where the mean arrival delay is uncommonly high?

flights %>%
    group_by(dest) %>%
    summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
    arrange(-avg_delay)
## Error in eval(expr, envir, enclos): object 'flights' not found

Calculate the mean time by dest and by carrier and divide time by this value to create a new column carrierRelativeTime.

flights %>%
    group_by(dest, carrier) %>% 
    mutate(meanTimeByDByC = mean(time, na.rm = TRUE),
           carrierRelativeTime = time / meanTimeByDByC) %>%
    select(dest:carrierRelativeTime)
## Error in eval(expr, envir, enclos): object 'flights' not found

Do you understand the different behaviour of summarise and mutate when used alongside group_by? Compare the nrow of the respective tables.

do() is a by()-like function for fine control on computed values

  • do() another verb of the plyr vocabulary, performs arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list

  • Let’s try to rerun the previous iris example that could not run with aggregate() but could with cut() althought the output was a list.

# determining the correlation between dimensions of flower parts.
aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman")
## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x'

# in plyr jargon that translates in:
myCors <- iris %>%
          group_by(Species) %>%
          do(cors = cor(subset(., , Sepal.Length:Petal.Width)))

myCors
## Source: local data frame [3 x 2]
## Groups: <by row>
## 
##      Species       cors
## 1     setosa <dbl[4,4]>
## 2 versicolor <dbl[4,4]>
## 3  virginica <dbl[4,4]>
str(myCors[1, ])
## Classes 'tbl_df' and 'data.frame':   1 obs. of  2 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
##  $ cors   :List of 1
##   ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. .. ..$ : chr  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"

Better than that:

# Here, do() returns several results joint together in a data frame row
myCors <- iris %>%
          group_by(Species) %>%
          do(data.frame(mpl = mean(.$Petal.Length),
                        msl = mean(.$Sepal.Length),
                        cors = I(list(cor(subset(., , Sepal.Length:Petal.Width))))
                        )
             )
myCors
## Source: local data frame [3 x 4]
## Groups: Species
## 
##      Species   mpl   msl       cors
## 1     setosa 1.462 5.006 <dbl[4,4]>
## 2 versicolor 4.260 5.936 <dbl[4,4]>
## 3  virginica 5.552 6.588 <dbl[4,4]>
str(myCors[1, ])
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  1 obs. of  4 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
##  $ mpl    : num 1.46
##  $ msl    : num 5.01
##  $ cors   :List of 1
##   ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. .. ..$ : chr  "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  - attr(*, "vars")=List of 1
##   ..$ : symbol Species
##  - attr(*, "drop")= logi TRUE
##  - attr(*, "indices")=List of 1
##   ..$ : int 0
##  - attr(*, "group_sizes")= int 1
##  - attr(*, "biggest_group_size")= int 1
##  - attr(*, "labels")='data.frame':   1 obs. of  1 variable:
##   ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
##   ..- attr(*, "vars")=List of 1
##   .. ..$ : symbol Species

  • Take home message: not limited by what function to apply to what variable and the results are formated in a df.
  • No need to write custom functions to do that but it is obviously possible to use custom functions inside verbs like summarise() and do().
  • Overall, dplr helps efficiently solve most of the routine data frame manipulation tasks.
  • BUY IT!! (;o

Exercice

Find all fligths that departed at midnight

Calculate the min, max values of time and standard deviation of speed for each destination (beware of NA).

Try to guess (no code), the cause of the NA value of sdS for flight(s) to Augusta?

Find the number of flights that departed to JFK for each days of the year (tip: use n())?

Pick the three most delayed flights at departure for each carrier (tip: slice() on carrier groups)

Create two subsamples of flights by randomly sampling 30% of the flights that went to SFO or LAX. Figure out how many fligths they have in common, how many are specific to each of them ?