Pretty long section. Will cover:
- The basics of indexing and subesetting for arrays, lists and data frames.
- More intensive data frame manipulations using R default tools.
- Introduction to packages reshape2 and dplyr.
Sept. 7-11, 2015
Pretty long section. Will cover:
# a matrix m m <- matrix(data = 1:20, nrow = 4, byrow = FALSE) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 5 9 13 17 ## [2,] 2 6 10 14 18 ## [3,] 3 7 11 15 19 ## [4,] 4 8 12 16 20 m[c(2, 4), ] # return a matrix, two lines, all columns ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 6 10 14 18 ## [2,] 4 8 12 16 20 m[ ,c(2, 4)] # a matrix,, two columns, all lines ## [,1] [,2] ## [1,] 5 13 ## [2,] 6 14 ## [3,] 7 15 ## [4,] 8 16 m[2:6] # vector-like indexing by column for matrices ## [1] 2 3 4 5 6
idx <- cbind(rows = 1:4, columns = 1:4) # bind two numeric vectors as columns of an idx matrix m[idx] # indexing returns a vector of values in the diagonal ## [1] 1 6 11 16
# Use a logical vector to find rows that have a value > 5 in the second column # Extract columns two to five for these rows (all columns) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 5 9 13 17 ## [2,] 2 6 10 14 18 ## [3,] 3 7 11 15 19 ## [4,] 4 8 12 16 20 m[m[, 2] > 5, 2:4] ## [,1] [,2] [,3] ## [1,] 6 10 14 ## [2,] 7 11 15 ## [3,] 8 12 16
We will do a lot of this for data frames
row()
and col()
return a matrix of integers containing respectively, their row or column number in a matrix-like object.row(m) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 1 1 1 ## [2,] 2 2 2 2 2 ## [3,] 3 3 3 3 3 ## [4,] 4 4 4 4 4 col(m) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 1 2 3 4 5 ## [3,] 1 2 3 4 5 ## [4,] 1 2 3 4 5
This is handy to extract specific regions of the matrix using a logical matrix of the same shape:
row(m) == col(m) # the diagonal ## [,1] [,2] [,3] [,4] [,5] ## [1,] TRUE FALSE FALSE FALSE FALSE ## [2,] FALSE TRUE FALSE FALSE FALSE ## [3,] FALSE FALSE TRUE FALSE FALSE ## [4,] FALSE FALSE FALSE TRUE FALSE
m[row(m) < col(m)] # extract the upper triangle by column without the diag ## [1] 5 9 10 13 14 15 17 18 19 20
Actually, not surprisingly, R has dedicated functions for that: lower.tri()
, upper.tri()
and diag()
(for extracting and setting).
which()
also works for arrays and can return a numeric indexing matrix. Lets's illustrate a use case where the goal is to find the locations of values of interest in a matrix:# One wants to find values 2, 5 and 10 in `m`: logiV <- m %in% c(2, 5, 10) logiV # a logical vector by column ## [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE logiM <- matrix(data = logiV, nrow = 4, byrow = FALSE) # coerce into a matrix logiM ## [,1] [,2] [,3] [,4] [,5] ## [1,] FALSE TRUE FALSE FALSE FALSE ## [2,] TRUE FALSE TRUE FALSE FALSE ## [3,] FALSE FALSE FALSE FALSE FALSE ## [4,] FALSE FALSE FALSE FALSE FALSE m[logiM] # we did find them, apparently the values have no replicate ## [1] 2 5 10 # To find the coordinates of the values of interest in the matrix: idx <- which(logiM, arr.ind = TRUE) val <- m[idx] cbind(idx, val = val) # "column" bind the indexing matrix with the vector of values of interest. ## row col val ## [1,] 2 1 2 ## [2,] 1 2 5 ## [3,] 2 3 10
t()
: transposes a matrixcolSums()
: Sum of the column valuesrowSums()
: Sum of the row valuescolMeans()
: Mean of the column valuesrowMeans()
: Mean of the row values%*%
: matrices multiplication
…
t(m) ## [,1] [,2] [,3] [,4] ## [1,] 1 2 3 4 ## [2,] 5 6 7 8 ## [3,] 9 10 11 12 ## [4,] 13 14 15 16 ## [5,] 17 18 19 20 m + m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 2 10 18 26 34 ## [2,] 4 12 20 28 36 ## [3,] 6 14 22 30 38 ## [4,] 8 16 24 32 40 m-m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 0 0 0 0 ## [3,] 0 0 0 0 0 ## [4,] 0 0 0 0 0
Take the m
matrix. How would you change all the values of the diagonal to 1?
m <- matrix(data = 1:20, nrow = 4, byrow = FALSE) diag(m) <- 1 m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 5 9 13 17 ## [2,] 2 1 10 14 18 ## [3,] 3 7 1 15 19 ## [4,] 4 8 12 1 20
In biochemistry, people often use 96 wells microplates that are in essence a matrix of wells with 12 columns and 8 rows:
Here is a matrix containing the measures output by an instrument:
measures <- matrix(data = round(runif(n = 96, max = 10), digits = 1), nrow = 8, ncol = 12)
In order to do computations on the values you would need to reshape it in a long format, i.e. in the form: row#, column#, value. Can you do that?
dummyLogiMat <- matrix(TRUE, nrow = 8, ncol = 12) head(cbind(which(dummyLogiMat, measures), measure = as.vector(measures))) ## row col measure ## [1,] 1 1 1.2 ## [2,] 2 1 1.6 ## [3,] 3 1 3.7 ## [4,] 4 1 6.8 ## [5,] 5 1 1.1 ## [6,] 6 1 2.4
List indexing can be achieved with one of three indexing operators:
[
: selects sub-lists. It always returns a list. Used with a single positive integer, it returns a list of length one.[[
: returns the actual element within a list at the specified single index.$
: a shorthand for name indexing, wherex$y
is equivalent to x[["y", exact = FALSE]]
fileCabinet <- list(drawer1 = c(length = 1, width = 3), drawer2 = FALSE, top = c(item1 = "flowers", item2 ="sharpener")) str(fileCabinet) ## List of 3 ## $ drawer1: Named num [1:2] 1 3 ## ..- attr(*, "names")= chr [1:2] "length" "width" ## $ drawer2: logi FALSE ## $ top : Named chr [1:2] "flowers" "sharpener" ## ..- attr(*, "names")= chr [1:2] "item1" "item2"
The [
operator. These expressions all return the same list:
fileCabinet[1:2] fileCabinet[c(T, T, F)] fileCabinet[c("drawer1", "drawer2")] # Partial name matching is not possible
The [[
and $ operators. These expressions all return the same element:
fileCabinet[[3]] fileCabinet[["top"]] fileCabinet[["t", exact = NA]] # the exact argument controls partial matching, TRUE or NA (return a warning) fileCabinet$top fileCabinet$t
identical(fileCabinet[1], fileCabinet[[1]]) ## [1] FALSE
Take a couple of minutes to run these expressions and understand why they return the same thing.
str(fileCabinet) ## List of 3 ## $ drawer1: Named num [1:2] 1 3 ## ..- attr(*, "names")= chr [1:2] "length" "width" ## $ drawer2: logi FALSE ## $ top : Named chr [1:2] "flowers" "sharpener" ## ..- attr(*, "names")= chr [1:2] "item1" "item2" fileCabinet[[c(1,2)]] ## [1] 3 fileCabinet[[c(3,2)]] ## [1] "sharpener" fileCabinet[[c("top", "item2")]] ## [1] "sharpener" fileCabinet$top[["item2"]] ## [1] "sharpener" fileCabinet$top["item2"] ## item2 ## "sharpener" # Combinations of subsetting operators are possible but not recommended because the code is hard to read: fileCabinet[[3]][[2]] ## [1] "sharpener" fileCabinet[[3]][2] ## item2 ## "sharpener"
Do you see a difference in the result of the last four expressions? This illustrates a case of simplifying the results of an idexing operation. We will come back to these shortly.
<-
operator.For example to delete elements in a list, assing NULL
:
lst <- list(a = c(TRUE, FALSE), b = NULL, c = list(list())) lst[c("a", "c")] <- NULL
How would you replace the last two elements of fileCabinet
with the first element of fileCabinet
and a sequence of number?
fileCabinet <- list(drawer1 = c(length = 1, width = 3), drawer2 = FALSE, top = c(item1 = "flowers", item2 = "sharpener")) # Does not work as expected, why? fileCabinet[2:3] <- c(fileCabinet[1], coins = 1:5) ## Warning in fileCabinet[2:3] <- c(fileCabinet[1], coins = 1:5): number of ## items to replace is not a multiple of replacement length # Works fileCabinet[2:3] <- c(fileCabinet[1], list(coins = 1:5))
How would you change the value of width
in drawer1
to 2?
fileCabinet[[1:2]] <- 2 fileCabinet$drawer1[2] <- c(WIDTH = 2)
Data frames are hybrid objects at the intersection between a list and matrix. Not unexpectedly, both matrix and list indexing techniques apply to data frames.
df <- data.frame(A = letters[1:4], `B-2` = rep(c(TRUE, FALSE), each = 2), `C with space` = seq(10, 20, length.out = 4), D = c(runif(3), NA) ) df ## A B.2 C.with.space D ## 1 a TRUE 10.00000 0.5950953 ## 2 b TRUE 13.33333 0.7316014 ## 3 c FALSE 16.66667 0.1582776 ## 4 d FALSE 20.00000 NA
See what happens with the weird column names?
On a general note, if you want to make your life easier, just prohibit spaces, +, -, *, special characters in column/row names even in tabulated files that you want to import in R.
df[c(1,4)] # To select one or several columns. Returns a data frame ## A D ## 1 a 0.5950953 ## 2 b 0.7316014 ## 3 c 0.1582776 ## 4 d NA df[3] ## C.with.space ## 1 10.00000 ## 2 13.33333 ## 3 16.66667 ## 4 20.00000 df[[3]] #To select the CONTENT (class) of a a single column using the list braquets. ## [1] 10.00000 13.33333 16.66667 20.00000 df$C ## [1] 10.00000 13.33333 16.66667 20.00000 # And so on...
df[,c(1,4)] # extracts columns ## A D ## 1 a 0.5950953 ## 2 b 0.7316014 ## 3 c 0.1582776 ## 4 d NA df[, c("A", "D")] ## A D ## 1 a 0.5950953 ## 2 b 0.7316014 ## 3 c 0.1582776 ## 4 d NA df[2:3, ] ## A B.2 C.with.space D ## 2 b TRUE 13.33333 0.7316014 ## 3 c FALSE 16.66667 0.1582776
df[df$B.2, ] # a df ## A B.2 C.with.space D ## 1 a TRUE 10.00000 0.5950953 ## 2 b TRUE 13.33333 0.7316014 df[, colnames(df) %in% "D"] # column turned to a vector !!! ## [1] 0.5950953 0.7316014 0.1582776 NA df[, colnames(df) %in% "D", drop = FALSE] # column remains a df !!! ## D ## 1 0.5950953 ## 2 0.7316014 ## 3 0.1582776 ## 4 NA df[df$C.with.space < 20 & df$D > 0.4, ] # use comparison and logical operators to subset ## A B.2 C.with.space D ## 1 a TRUE 10.00000 0.5950953 ## 2 b TRUE 13.33333 0.7316014
$
to refer to column can become tedious to type. So in interactive mode, one can use with()
with(data = df, df[C.with.space < 20 & D > 0.4, ]) ## A B.2 C.with.space D ## 1 a TRUE 10.00000 0.5950953 ## 2 b TRUE 13.33333 0.7316014
# This works as expected DF <- data.frame(a = 1:3) DF$b <- list(1:1, 1:2, 1:3) str(DF) ## 'data.frame': 3 obs. of 2 variables: ## $ a: int 1 2 3 ## $ b:List of 3 ## ..$ : int 1 ## ..$ : int 1 2 ## ..$ : int 1 2 3
# This alternative produce an odd result DF <- data.frame(a = 1:3, b = list(1:1, 1:2, 1:3)) ## Error in data.frame(1L, 1:2, 1:3, check.names = FALSE, stringsAsFactors = TRUE): arguments imply differing number of rows: 1, 2, 3
data.frame()
it is as if each component or column had been passed as a separate argument.I()
to protect you object and request that it is treated as such.DF <- data.frame(a = 1:3, b = I(list(1,1:2,1:3)))
Remember from previous slides?
class(df[[3]]) == class(df[3]) ## [1] FALSE
Unlike [
with drop = FALSE, [[
and [
with default drop = TRUE, will simplify their return value.
Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want.
Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type.
Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors. (It will work for your test cases, but then someone will pass in a single column data frame and it will fail in an unexpected and unclear way.)
Italics are direct copies from the source at http://adv-r.had.co.nz/Subsetting.html if you want a thorough outlook on this aspect follow the link.
Take df
and keep all columns but "A" and "D". Do that in at least two different ways (think <- list(NULL)
)
df[, !colnames(df) %in% c("A", "D")] # Option 1 ## B.2 C.with.space ## 1 TRUE 10.00000 ## 2 TRUE 13.33333 ## 3 FALSE 16.66667 ## 4 FALSE 20.00000 df[, -match(c("A", "D"), colnames(df))] # Option 2 ## B.2 C.with.space ## 1 TRUE 10.00000 ## 2 TRUE 13.33333 ## 3 FALSE 16.66667 ## 4 FALSE 20.00000 df[c("A", "D")] <- list(NULL) # Option 3 df ## B.2 C.with.space ## 1 TRUE 10.00000 ## 2 TRUE 13.33333 ## 3 FALSE 16.66667 ## 4 FALSE 20.00000
For some reasons…
df[c("A", "D")] <- NULL
…does not work for several columns of a df but it does for several elements of a list.
It does work to remove a single column:
df["B.2"] <- NULL df ## C.with.space ## 1 10.00000 ## 2 13.33333 ## 3 16.66667 ## 4 20.00000
The function installed.packages()
returns a data frame with info on the R packages installed in your system.
c("ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram", "Biostrings")
.neededPacks <- c("ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram", "Biostrings") notInstalled <- neededPacks[!neededPacks %in% rownames(installed.packages())] notInstalled ## character(0) installed.packages()[neededPacks, c("Package", "LibPath", "Version")] ## Package LibPath ## ape "ape" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## reshape2 "reshape2" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## dplyr "dplyr" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## lattice "lattice" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## ggplot2 "ggplot2" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## VennDiagram "VennDiagram" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## Biostrings "Biostrings" "/home/cunnac/R/x86_64-pc-linux-gnu-library/3.2" ## Version ## ape "3.3" ## reshape2 "1.4.1" ## dplyr "0.4.2" ## lattice "0.20-33" ## ggplot2 "1.0.1" ## VennDiagram "1.6.9" ## Biostrings "2.36.4"
Re-assign df
to its original value and calculate the mean of the content of column "D".
df <- data.frame(A = letters[1:4], `B-2` = rep(c(TRUE, FALSE), each = 2), `C with space` = seq(10, 20, length.out = 4), D = c(runif(3), NA) ) mean(df$D, na.rm = FALSE) ## [1] NA
Damned NAs…
subset()
Let's take our toy data frame:
df <- data.frame(A = letters[1:4], B = rep(c(TRUE, FALSE), each = 2), C = seq(10, 20, length.out = 4), D = c(runif(3), NA) )
We have already encoutered multiple ways to extract rows and columns from a data frame.
subset(x, subset, select, drop = FALSE, ...)
function can save you some substancial typing in interactive sessions and help understand the code faster.select=
argument is also pretty convenient to keep only columns of interestsubset(x = df, subset = C < 20 & D > 0.4) subset(x = df, subset = C < 20 & D > 0.4, select = c("A", "B", "C")) subset(x = df, subset = C < 20 & D > 0.4, select = !colnames(df) %in% c("A", "D")) subset(x = df, subset = C < 20 & D > 0.4, select = A:C) # "sequence" of column names subset(x = df, subset = C < 20 & D > 0.4, select = -(C:D)) # "Negated" column names
While subset=
must have a logical value, select=
accepts more types.
Note that in order to allow us from refering to data frame columns without specific use of the data frame name (like in df$ColA
), subset()
like with()
and a other functions (especially in dplyr
and reshape2
) use Non-Standard Evaluation.
It is a fairly advanced topic, but you should remember that because of this usage of NSE, it is not recommended to include these functions when you define your own functions because it may just not work or cause unsuspected behavior.
For a discussion on NYSE, see Advanced R
To change these attributes, there are two dedicated functions that work both for getting and modifying names: rownames()
and colnames()
.
colnames(df) ## [1] "A" "B" "C" "D" colnames(df) <- month.abb[1:ncol(df)] rownames(df) <- sample(state.name, size = nrow(df)) df ## Jan Feb Mar Apr ## Ohio a TRUE 10.00000 0.4732034 ## West Virginia b TRUE 13.33333 0.5277445 ## Nebraska c FALSE 16.66667 0.0682988 ## Kentucky d FALSE 20.00000 NA
In a data frame values for:
rownames()
should be a character vector of non-duplicated and non-missing names.colnames()
a character vector of (preferably) unique syntactically-valid names.We have seen before that one should always be wary of the presence of NAs in the data under analysis.
One informal way to look for them is to scrutinize the info displayed by summary()
To look for and exculde them programmatically, test for their presence with is.na()
and used logical indexing.
na.omit()
and related functions are usefull for conveniently deal with NAs in data frames. summary(df) indicate the number of NA for each variable.
Remember also that many (summary) functions have an argument like na.rm=
that controls how to handle them.
Let's reset our df
to factory values
summary(df) ## A B C D ## a:1 Mode :logical Min. :10.0 Min. :0.2518 ## b:1 FALSE:2 1st Qu.:12.5 1st Qu.:0.3868 ## c:1 TRUE :2 Median :15.0 Median :0.5219 ## d:1 NA's :0 Mean :15.0 Mean :0.5146 ## 3rd Qu.:17.5 3rd Qu.:0.6461 ## Max. :20.0 Max. :0.7702 ## NA's :1 df[!is.na(df$D), ] ## A B C D ## 1 a TRUE 10.00000 0.7702153 ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598 na.omit(df) ## A B C D ## 1 a TRUE 10.00000 0.7702153 ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598
Adding / Removing rows
rbind()
and re-assign (column names must match), or subsetting and assignment for in place modif.df <- df[!is.na(df$D), ] # delete rbind(df, df[1, ]) # bind ## A B C D ## 1 a TRUE 10.00000 0.7702153 ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598 ## 4 a TRUE 10.00000 0.7702153 df[nrow(df) + 1, ] <- df[1, ] # bind with another method df ## A B C D ## 1 a TRUE 10.00000 0.7702153 ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598 ## 4 a TRUE 10.00000 0.7702153
Adding / Removing columns
list(NULL)
(NULL
) or return only the columns you want and assign to your df.cbind()
and re-assing (if binding a character vector, it will be converted to factor unless stringsAsFactors = FALSE
)subset(df, select = A:B) # could be re-assigned to df to delete col C and D ## A B ## 1 a TRUE ## 2 b TRUE ## 3 c FALSE ## 4 a TRUE df <- cbind(df, E = state.name[1:nrow(df)], F = rnorm(n = nrow(df))) # does convert to factor df[ , "G"] <- state.name[1:nrow(df)] # does NOT str(df) ## 'data.frame': 4 obs. of 7 variables: ## $ A: Factor w/ 4 levels "a","b","c","d": 1 2 3 1 ## $ B: logi TRUE TRUE FALSE TRUE ## $ C: num 10 13.3 16.7 10 ## $ D: num 0.77 0.522 0.252 0.77 ## $ E: Factor w/ 4 levels "Alabama","Alaska",..: 1 2 3 4 ## $ F: num -0.9 0.142 1.233 -2.911 ## $ G: chr "Alabama" "Alaska" "Arizona" "Arkansas" df["G"] <- NULL
transform()
that also saves you from referring to df name in expressions and reassign.df <- cbind(df, G = sqrt(df$D)* (df$C + pi)) # calculate values with vectors of equal length df <- transform(df, H = ceiling(G - mean(C))) # mean(C) is recycled df ## A B C D E F G H ## 1 a TRUE 10.00000 0.7702153 Alabama -0.9004528 11.533313 0 ## 2 b TRUE 13.33333 0.5219329 Alaska 0.1417637 11.902298 0 ## 3 c FALSE 16.66667 0.2517598 Arizona 1.2329677 9.938927 -2 ## 4 a TRUE 10.00000 0.7702153 Arkansas -2.9107245 11.533313 0
order()
returns a permutation of the indices of the elements of x so that the corresponding element are ordered.order()
can work with a combination of sorting keys provided as vectors of the same lenght. So one can sort a data frame based on several columns:df ## A B C D E F G H ## 1 a TRUE 10.00000 0.7702153 Alabama -0.9004528 11.533313 0 ## 2 b TRUE 13.33333 0.5219329 Alaska 0.1417637 11.902298 0 ## 3 c FALSE 16.66667 0.2517598 Arizona 1.2329677 9.938927 -2 ## 4 a TRUE 10.00000 0.7702153 Arkansas -2.9107245 11.533313 0 df[ order(df$B, df$H, -df$F), ] ## A B C D E F G H ## 3 c FALSE 16.66667 0.2517598 Arizona 1.2329677 9.938927 -2 ## 2 b TRUE 13.33333 0.5219329 Alaska 0.1417637 11.902298 0 ## 1 a TRUE 10.00000 0.7702153 Alabama -0.9004528 11.533313 0 ## 4 a TRUE 10.00000 0.7702153 Arkansas -2.9107245 11.533313 0
Note that specifying the increase/decrease behavior for individual columns is achieved by negating the column name (see: also the stackoverflow discussion).
unique()
returns only the unique elementsduplicated()
returns a logical vector where TRUE indicates that the corresponding element is a duplicate.df <- subset(df, , A:D) unique(df) ## A B C D ## 1 a TRUE 10.00000 0.7702153 ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598 duplicated(df) ## [1] FALSE FALSE FALSE TRUE df[duplicated(df), ] ## A B C D ## 4 a TRUE 10 0.7702153 df[-duplicated(df), ] ## A B C D ## 2 b TRUE 13.33333 0.5219329 ## 3 c FALSE 16.66667 0.2517598 ## 4 a TRUE 10.00000 0.7702153
Merging data frames is a also a recurring necessity and powerfull technique in data analysis. It involves combining information of two data frames based on common variable(s) or key(s). This is reminiscent of running SQL queries across tables in relational databases.
In base R, merge()
let you combines these data frames:
# The grocery store example... products ## ProdID category price ## 1 A vegetable 4 ## 2 B meat 3 ## 3 C can 2 ## 4 D iceCream 1 sales ## CustomID ProdID Date ## 1 a C 2015-09-01 ## 2 a A 2015-09-01 ## 3 b B 2015-09-02 ## 4 b C 2015-09-03 ## 5 b E 2015-09-03
How can we generate somekind of bill?
sales
with info from products
and sort by CustomID
:bill <- merge(x = sales, y = products, by = "ProdID", all.x = TRUE, sort = FALSE) bill[order(bill$CustomID, bill$Date, bill$ProdID), ] ## ProdID CustomID Date category price ## 3 A a 2015-09-01 vegetable 4 ## 1 C a 2015-09-01 can 2 ## 4 B b 2015-09-02 meat 3 ## 2 C b 2015-09-03 can 2 ## 5 E b 2015-09-03 <NA> NA # There is no info about product D, what did the cashier!!?
ProdID
merge(x = sales, y = products, by = "ProdID", all = TRUE, sort = FALSE) ## ProdID CustomID Date category price ## 1 C a 2015-09-01 can 2 ## 2 C b 2015-09-03 can 2 ## 3 A a 2015-09-01 vegetable 4 ## 4 B b 2015-09-02 meat 3 ## 5 E b 2015-09-03 <NA> NA ## 6 D <NA> <NA> iceCream 1 # Product D did not sell well these last days.
merge()
:merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)
by
by
This question relates to the issue of data aggregation which involves notably calculating summary statistics for specific groups of records defined by combinations of the levels of (categorical) variables (factors).
Probably the most userfriendly base R function to do these tasks is aggregate()
.
cleanBill <- na.omit(bill) aggregate(x = cleanBill$price, by = cleanBill[c("CustomID", "Date")], FUN = sum) ## CustomID Date x ## 1 a 2015-09-01 6 ## 2 b 2015-09-02 3 ## 3 b 2015-09-03 2 # That answers our first question
aggregate()
comes in two flavors:
aggregate(x, by, FUN, ..., simplify = TRUE)
x =
is your databy =
is a list of grouping factorsFUN =
the function used for aggregation... =
optional arguments to FUN
simplify =
whether results should be simplified to vector or matrix/data frame if possible.aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
formula =
a formula e.g. y ~ x
or cbind(y1, y2) ~ x1 + x2
. Meaning summarises y
by x
as a grouping factor. We will learn more about formulas when discussing ploting and statistical modelling.data =
the data frame where x
and y
can be found…If time allows, you can run the examples of documentation for aggregate()
to get a better sense of what you can do with the formula notation.
Let's look at other examples using the famous iris
dataset:
First we want to calculate the means of the various measures conditionned on species:
aggregate(. ~ Species, data = iris, FUN = mean) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 # The dot in formula means "any columns from data that are otherwise not used"
We are interested in determining the correlation between dimensions of sepal and petals conditionned on the plant species. For this we can use cor()
.
aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman") ## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x' # cor() complains...
Does not work because `aggregate()
first separates columns and then for each column, it applies FUN to each subsets of values (vectors) defined by by
. The problem is that here cor()
takes a 2-D object…
Let's try by()
that splits a data frame into a subset of data frames based on the values of one or more factors, and function FUN is applied to each data frame in turn.
irisCorel <- by(iris[1:4], INDICES = iris["Species"], FUN = cor, method = "spearman")
Note that additional arguments to FUN
are passed after the FUN=
in the call.
str(irisCorel) ## List of 3 ## $ setosa : num [1:4, 1:4] 1 0.755 0.279 0.299 0.755 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## $ versicolor: num [1:4, 1:4] 1 0.518 0.737 0.549 0.518 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## $ virginica : num [1:4, 1:4] 1 0.427 0.824 0.316 0.427 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## - attr(*, "dim")= int 3 ## - attr(*, "dimnames")=List of 1 ## ..$ Species: chr [1:3] "setosa" "versicolor" "virginica" ## - attr(*, "call")= language by.data.frame(data = iris[1:4], INDICES = iris["Species"], FUN = cor, method = "spearman") ## - attr(*, "class")= chr "by"
To sum up:
aggregate()
is OK but for example, there is no way to provide names to the results of summarization in the output other than manually changing them afterwards.by()
is OK, but we may need to further process its output to format it in a table-like fashion.aggregate()
and by()
passes the same function to all subsets. What if we wanted to calculate different summaries depending on the variable(s)?The dplyr
package will help you do some of that fairly easily.
For total control, you will have to write your own functions and apply them iteratively with by()
or other functions of the apply()
family.
m
rows from a data frame (write a general code and use the iris
dataset to test)?mydf <- iris m <- 3 mydf[ sample(1:nrow(mydf), size = m, replace = FALSE) , ] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 129 6.4 2.8 5.6 2.1 virginica ## 28 5.2 3.5 1.5 0.2 setosa ## 61 5.0 2.0 3.5 1.0 versicolor
rows <- sort(sample(1:nrow(mydf), size = 2, replace = FALSE)) # Is the sort() necessary? head(mydf[ rows[1]:rows[2], ]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 47 5.1 3.8 1.6 0.2 setosa ## 48 4.6 3.2 1.4 0.2 setosa ## 49 5.3 3.7 1.5 0.2 setosa ## 50 5.0 3.3 1.4 0.2 setosa ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor
Run the code below:
library(ggplot2) mySimpleSumy <- aggregate(diamonds, list(diamonds$color), summary)
str(mySimpleSumy)
simplify
argument. What happens?mySumy <-aggregate(diamonds, list(diamonds$color), summary, simplify = FALSE) str(mySumy)
movies
from the ggplot2
package to test it.library(ggplot2) myDf <- movies orderedColIdx <- order(colnames(myDf)) myNewDf <- myDf[orderedColIdx] head(myNewDf, n = 2) ## Action Animation budget Comedy Documentary Drama length mpaa r1 r10 ## 1 0 0 NA 1 0 1 121 4.5 4.5 ## 2 0 0 NA 1 0 0 71 0.0 14.5 ## r2 r3 r4 r5 r6 r7 r8 r9 rating Romance Short ## 1 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5 6.4 0 0 ## 2 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5 6.0 0 0 ## title votes year ## 1 $ 348 1971 ## 2 $1000 a Touchdown 20 1939
myNewDf[ncol(myNewDf)] <- list(NULL) head(myNewDf, n = 2) ## Action Animation budget Comedy Documentary Drama length mpaa r1 r10 ## 1 0 0 NA 1 0 1 121 4.5 4.5 ## 2 0 0 NA 1 0 0 71 0.0 14.5 ## r2 r3 r4 r5 r6 r7 r8 r9 rating Romance Short ## 1 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5 6.4 0 0 ## 2 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5 6.0 0 0 ## title votes ## 1 $ 348 ## 2 $1000 a Touchdown 20
Adapted and extended from a Introduction to R
movies
dataset from the ggplot2
package. What can you tell about it?library(ggplot2) ?movies
# Visual inspection # summary(movies) # Programmatically colnames(movies)[sapply(movies, function(x) sum(is.na(x))) > 0] ## we'll learn sapply() later ## [1] "budget"
subset()
)sum(!is.na(movies$budget)) ## [1] 5215
subset()
)?sum(movies$rating <= 2.5 | movies$rating > 7.5) ## [1] 9533
sum(movies$year ==1980 & movies$rating > 5) ## [1] 440
sum(movies$Action & movies$Animation) ## [1] 84
subset()
.subset(movies, votes == max(votes) | length == max(length), title:votes) ## title year length ## 11937 Cure for Insomnia, The 1987 5220 ## 30658 Lord of the Rings: The Fellowship of the Ring, The 2001 208 ## budget rating votes ## 11937 NA 3.8 59 ## 30658 93000000 8.8 157608
rating_zscore
to movies by standardizing the Average IMDB user rating using the scale()
function.movies <- transform(movies, rating_zscore = scale(rating)) # For interactive use # OR movies <- cbind(movies, rating_zscore = scale(movies$rating)) # When programming head(movies, 1) ## title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 ## 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 ## r8 r9 r10 mpaa Action Animation Comedy Drama Documentary Romance ## 1 14.5 4.5 4.5 0 0 1 1 0 0 ## Short rating_zscore rating_zscore ## 1 0 0.3007988 0.3007988
moviesGenre <- subset(movies, select = Action:Short) str(moviesGenre) ## 'data.frame': 58788 obs. of 7 variables: ## $ Action : int 0 0 0 0 0 0 1 0 0 0 ... ## $ Animation : int 0 0 1 0 0 0 0 0 0 0 ... ## $ Comedy : int 1 1 0 1 0 0 0 0 0 0 ... ## $ Drama : int 1 0 0 0 0 1 1 0 1 0 ... ## $ Documentary: int 0 0 0 0 0 0 0 1 0 0 ... ## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Short : int 0 0 1 0 0 0 0 1 0 0 ...
moviesGenreUnique <- aggregate(moviesGenre$Action, by = moviesGenre, FUN = length)
nrow(moviesGenreUnique) ## [1] 79
moviesGenreUnique[order(moviesGenreUnique$x, decreasing = TRUE)[1:3], ] ## Action Animation Comedy Drama Documentary Romance Short x ## 9 0 0 0 1 0 0 0 14235 ## 1 0 0 0 0 0 0 0 12786 ## 5 0 0 1 0 0 0 0 8237
movies
that fit in only one genre?moviesInUniqueGenre <- movies[rowSums(moviesGenre) == 1, , drop = FALSE]
genreBudget <- subset(moviesInUniqueGenre, !is.na(budget), select = c(budget, rating, Action:Short)) avgBudgetByGenre <- aggregate(. ~ Action + Animation + Comedy + Drama + Documentary + Romance + Short, data = genreBudget, FUN = mean) avgBudgetByGenre[order(-avgBudgetByGenre$budget), ] ## Action Animation Comedy Drama Documentary Romance Short budget ## 1 1 0 0 0 0 0 0 32698189.6 ## 2 0 1 0 0 0 0 0 32311451.6 ## 3 0 0 1 0 0 0 0 11921970.6 ## 4 0 0 0 1 0 0 0 10456690.5 ## 6 0 0 0 0 0 1 0 5603688.0 ## 5 0 0 0 0 1 0 0 729704.8 ## 7 0 0 0 0 0 0 1 396133.1 ## rating ## 1 5.504118 ## 2 6.645161 ## 3 5.707843 ## 4 6.516722 ## 6 6.639024 ## 5 6.900000 ## 7 7.060265
Sometimes we need to do some checking for the number of observations having specific combinations of categorical variables levels.
To calculate cross-classification table the table()
and xtabs()
functions are our friend. There are a few differences among them, the main one being that xtabs()
takes formulas to specify crosstabulation.
color <- c("red", "green", "green", "red", "green", "green", "red") shape <- c("round", "sharp", "sharp", "round", "round", "sharp", "round") size <- c("tall", "tall", "small", "tall", "small", "small", "tall") objects <- data.frame(color, shape, size) xclass <- xtabs(data = objects) str(xclass) # a list with other attributes ## int [1:2, 1:2, 1:2] 1 0 2 0 0 3 1 0 ## - attr(*, "dimnames")=List of 3 ## ..$ color: chr [1:2] "green" "red" ## ..$ shape: chr [1:2] "round" "sharp" ## ..$ size : chr [1:2] "small" "tall" ## - attr(*, "class")= chr [1:2] "xtabs" "table" ## - attr(*, "call")= language xtabs(data = objects)
summary(xclass) ## Call: xtabs(data = objects) ## Number of cases in table: 7 ## Number of factors: 3 ## Test for independence of all factors: ## Chisq = 9.674, df = 4, p-value = 0.0463 ## Chi-squared approximation may be incorrect as.data.frame(xclass) # ALL combinations between levels are shown ## color shape size Freq ## 1 green round small 1 ## 2 red round small 0 ## 3 green sharp small 2 ## 4 red sharp small 0 ## 5 green round tall 0 ## 6 red round tall 3 ## 7 green sharp tall 1 ## 8 red sharp tall 0 ftable(xclass) # Another way to flattened cross-classification tables with more than 2 dimensions ## size small tall ## color shape ## green round 1 0 ## sharp 2 1 ## red round 0 3 ## sharp 0 0
# to compute margin sum or other functions addmargins(xtabs(formula = ~ color + shape, data = objects), FUN = sum) ## Margins computed over dimensions ## in the following order: ## 1: color ## 2: shape ## shape ## color round sharp sum ## green 1 3 4 ## red 3 0 3 ## sum 4 3 7 # to compute proportions (maring = 1, rowwise ; 2, colwise ; not specified, all table) prop.table(xtabs(formula = ~ color + shape, data = objects)) ## shape ## color round sharp ## green 0.1428571 0.4285714 ## red 0.4285714 0.0000000
This is interesting but we can do somemore with reshape2
movies
data set, convert rating
to a factor with 5 levels. Use the pretty()
function to generate the value of the break
argument of cut()
.mpaa
ratings and add sums on both sides.library(ggplot2) ratingCat <- cut(movies$rating, breaks = pretty(movies$rating, 5)) addmargins(table(movies$mpaa, ratingCat), margin = c(1, 2)) ## ratingCat ## (0,2] (2,4] (4,6] (6,8] (8,10] Sum ## 787 5536 19763 23793 3985 53864 ## NC-17 1 2 7 6 0 16 ## PG 5 70 238 201 14 528 ## PG-13 4 93 439 447 20 1003 ## R 40 611 1433 1240 53 3377 ## Sum 837 6312 21880 25687 4072 58788
Source Data Wrangling Cheatsheet
R follows a set of conventions that makes one layout of tabular data much easier to work with than others.
Your data will be easier to work with in R if it follows three rules:
Often times you received or gather data that is messy, i.e. the way the information is structured is not appropriate for efficient and straightforward analysis using R and you need to tidy it up.
Occasionally although you have a tidy dataset, as a part of the analysis, you also want to represent the data differently and possibly aggregate it, like in the contingency tables we just saw.
melt()
and dcast()
with reshape2
In short, you need to reshape your data and swing seamlessly between long and wide table shapes.
The standard R distribution has facilities to do this type of transformations: stack()
and unstack()
from the utils
package and reshape()
from the stats
package.
But, in my experience, they are difficult to use or are not as flexible as the two simple but powerful functions provided in the reshape2
package:
reshape2 is based around two key functions: melt and cast:
melt()
takes wide-format data frames and melts it into long-format data.dcast()
takes long-format data frames and casts it into wide-format data.Let's see how to do that with a slightly more explicit toy data set:
magicDrug <- data.frame(Name = rep(LETTERS[1:2], each = 2, times = 2), When = rep(c("before", "after"), times = 4), Speed = c(1, 2, 3, 5, 0, 1, 4, 5), Weight = c(10, 3, 20, 6, 11, 2, 23, 9) )
First melt:
library(reshape2) mdata <- melt(data = magicDrug, id.vars = c("Name", "When"), # Specified but unecessary because they're factors measure.vars = c("Speed", "Weight"), # Specified but unecessary because they're numerics variable.name = "IndPerformances", # What does these variables have in common? na.rm = TRUE, # as a reminder for NAs value.name = "value") # the default is good str(mdata) ## 'data.frame': 16 obs. of 4 variables: ## $ Name : Factor w/ 2 levels "A","B": 1 1 2 2 1 1 2 2 1 1 ... ## $ When : Factor w/ 2 levels "after","before": 2 1 2 1 2 1 2 1 2 1 ... ## $ IndPerformances: Factor w/ 2 levels "Speed","Weight": 1 1 1 1 1 1 1 1 2 2 ... ## $ value : num 1 2 3 5 0 1 4 5 10 3 ...
Now, we can dcast()
the melted data in various ways to get some insight:
# again formula specification of the layout # mean() for aggregation dcast(data = mdata, formula = IndPerformances ~ When, fun.aggregate = mean) ## IndPerformances after before ## 1 Speed 3.25 2 ## 2 Weight 5.00 16 ### !!! Note that the fun.aggregate function should take a vector of numbers !!! {.build} ### !!! and return a single summary statistic !!! {.build}
dcast(mdata, Name ~ IndPerformances, mean) ## Name Speed Weight ## 1 A 1.00 6.5 ## 2 B 4.25 14.5
dcast(mdata, Name + When ~ IndPerformances, length) ## Name When Speed Weight ## 1 A after 2 2 ## 2 A before 2 2 ## 3 B after 2 2 ## 4 B before 2 2
In some regards, the iris dataset seems a little untidy. Let's try to tidy it up.
library(reshape2) summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##
flowerID
variable to enable keeping track of what was measured on what flower.newIris <- cbind(iris, flowerID = as.character(1:nrow(iris))) meltedIris <- melt(data = newIris, variable.name = "Object.Dimension", value.name = "Length", actorsAsStrings = FALSE) ## Using Species, flowerID as id variables meltedIris <-meltedIris[order(meltedIris$Object.Dimension, meltedIris$Length), ] str(meltedIris) ## 'data.frame': 600 obs. of 4 variables: ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ flowerID : Factor w/ 150 levels "1","10","100",..: 46 140 84 89 88 85 118 67 94 74 ... ## $ Object.Dimension: Factor w/ 4 levels "Sepal.Length",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ Length : num 4.3 4.4 4.4 4.4 4.5 4.6 4.6 4.6 4.6 4.7 ...
object
and dimension
and bind to the melted iris data frame.merge()
the melted iris data frame with a custom made recoding dataframe on the Object.Dimension column.varRecodeDf <- cbind(Object.Dimension = levels(meltedIris$Object.Dimension), object = rep(c("sepal", "petal"), each = 2), dimension = rep(c("length", "width"), times = 2) ) reshapedIris <- subset(merge(varRecodeDf, meltedIris), select = -Object.Dimension) str(reshapedIris) ## 'data.frame': 600 obs. of 5 variables: ## $ object : Factor w/ 2 levels "petal","sepal": 1 1 1 1 1 1 1 1 1 1 ... ## $ dimension: Factor w/ 2 levels "length","width": 1 1 1 1 1 1 1 1 1 1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ flowerID : Factor w/ 150 levels "1","10","100",..: 67 46 57 81 74 60 82 84 87 88 ... ## $ Length : num 1 1.1 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 ...
Species
and object
as rows and dimension
as columns and containing mean values.# Melting first: meltedReshapedIris <- melt(data = reshapedIris) ## Using object, dimension, Species, flowerID as id variables str(meltedReshapedIris) ## 'data.frame': 600 obs. of 6 variables: ## $ object : Factor w/ 2 levels "petal","sepal": 1 1 1 1 1 1 1 1 1 1 ... ## $ dimension: Factor w/ 2 levels "length","width": 1 1 1 1 1 1 1 1 1 1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ flowerID : Factor w/ 150 levels "1","10","100",..: 67 46 57 81 74 60 82 84 87 88 ... ## $ variable : Factor w/ 1 level "Length": 1 1 1 1 1 1 1 1 1 1 ... ## $ value : num 1 1.1 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.3 ... dcast(meltedReshapedIris, Species + object ~ dimension, fun.aggregate = mean) ## Species object length width ## 1 setosa petal 1.462 0.246 ## 2 setosa sepal 5.006 3.428 ## 3 versicolor petal 4.260 1.326 ## 4 versicolor sepal 5.936 2.770 ## 5 virginica petal 5.552 2.026 ## 6 virginica sepal 6.588 2.974
# casting directly: dcast(reshapedIris, Species + object ~ dimension, value.var = "Length", fun.aggregate = mean) ## Species object length width ## 1 setosa petal 1.462 0.246 ## 2 setosa sepal 5.006 3.428 ## 3 versicolor petal 4.260 1.326 ## 4 versicolor sepal 5.936 2.770 ## 5 virginica petal 5.552 2.026 ## 6 virginica sepal 6.588 2.974
dcast(reshapedIris, Species + object + dimension ~ 1, value.var = "Length", fun.aggregate = mean) ## Species object dimension 1 ## 1 setosa petal length 1.462 ## 2 setosa petal width 0.246 ## 3 setosa sepal length 5.006 ## 4 setosa sepal width 3.428 ## 5 versicolor petal length 4.260 ## 6 versicolor petal width 1.326 ## 7 versicolor sepal length 5.936 ## 8 versicolor sepal width 2.770 ## 9 virginica petal length 5.552 ## 10 virginica petal width 2.026 ## 11 virginica sepal length 6.588 ## 12 virginica sepal width 2.974
This is basically the melted data frame summarized with mean.
dplyr
, a unifying framework for data frame manipulation and aggregationFundamental data processing functions exist in R but they are rather disparate, lack a consistent interface and the ability to easily flow together.
This makes it difficult to read, understand and write concise code.
dplyr
package attemps to overcome these limitations:
dplyr
(beside base distribution):
The dplyr
package philosophy is that preparing data in the form of a data frame for specific analysis always involves applying a combination of elementary manipulations.
Each of these basic manipulations has been formalized with a function named with a verb:
Main plyr function | Effect | Default R equivalent (pseudo code) |
---|---|---|
filter() |
keep rows matching criteria | subset() |
slice() |
pick rows using index(es) | df[1:10, ] |
distinct() |
Extract distinct (unique) rows | unique() |
sample_n() |
randomly sample rows | df[sample(1:nrow(df), n), ] |
arrange() |
reorder rows | df[order(df, factKey), ] |
select() |
pick columns by name | subset() or df[ , c("ColB", "ColD")] |
rename() |
rename specific columns | colnames(df)[ "ColA" %in% colnames(df) ] <- "ColX" |
mutate() |
add columns that are f() of existing ones | transform() orcbind(df, X = f(df$ColA)) |
xxx_join() |
combine data frames | merge() |
intersect() , etc |
set operations between data frames | NULL |
order_by() %>% summarise() |
reduce variables to values | aggregate() and by() |
verb()
syntax:
$
to refer to df
colums: like subset()
and transform()
.dplyr
tutorial to highlight neat features that are more tedious to emulate with base R functions.# download the 'flights.csv' file from the dropbox page and save to you curent working dir library(dplyr) flights <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE)) ## Warning in file(file, "rt"): cannot open file 'flights.csv': No such file ## or directory ## Error in file(file, "rt"): cannot open the connection # or # install.packages("hflights") # library(hflights) ; flights <- hflights # contains flights that departed from Houston in 2011 glimpse(flights) ## Error in nrow(tbl): object 'flights' not found
filter(flights, dest %in% c("SFO", "OAK")) ## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
filter(flights, arr_delay > 2 * dep_delay) ## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
select(flights, arr_delay:dep_delay) ## Error in select_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
flights <- mutate(flights, speed = dist / (time / 60)) ## Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found arrange(flights, desc(speed)) ## Error in arrange_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
%>%
(pipe) operator%>%
developed in the R package magrittr
.f( g( h(x), z=1), y=1 )
h(x) %>% g(z=1) %>% g(y=1)
plyr
functions:Select the two delay variables for flights to SFO or OAK
flights %>% filter(arr_delay > 2 * dep_delay) %>% select(arr_delay:dep_delay) ## Error in eval(expr, envir, enclos): object 'flights' not found
group_by()
for operating on subsets of the tablegroup_by()
on a table specifying grouping variables.filter
: select rows meeting a criteria by groupsummarise
: apply aggregation function to summarise each groupmutate
: calculate a summary value by group and append it to tableIs there a destination where the mean arrival delay is uncommonly high?
flights %>% group_by(dest) %>% summarise(avg_delay = mean(arr_delay, na.rm=TRUE)) %>% arrange(-avg_delay) ## Error in eval(expr, envir, enclos): object 'flights' not found
Calculate the mean time by dest
and by carrier
and divide time
by this value to create a new column carrierRelativeTime
.
flights %>% group_by(dest, carrier) %>% mutate(meanTimeByDByC = mean(time, na.rm = TRUE), carrierRelativeTime = time / meanTimeByDByC) %>% select(dest:carrierRelativeTime) ## Error in eval(expr, envir, enclos): object 'flights' not found
Do you understand the different behaviour of summarise
and mutate
when used alongside group_by
? Compare the nrow
of the respective tables.
do()
is a by()
-like function for fine control on computed valuesdo()
another verb of the plyr
vocabulary, performs arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list
Let's try to rerun the previous iris
example that could not run with aggregate()
but could with cut()
althought the output was a list.
# determining the correlation between dimensions of flower parts. aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman") ## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x' # in plyr jargon that translates in: myCors <- iris %>% group_by(Species) %>% do(cors = cor(subset(., , Sepal.Length:Petal.Width)))
myCors ## Source: local data frame [3 x 2] ## Groups: <by row> ## ## Species cors ## 1 setosa <dbl[4,4]> ## 2 versicolor <dbl[4,4]> ## 3 virginica <dbl[4,4]> str(myCors[1, ]) ## Classes 'tbl_df' and 'data.frame': 1 obs. of 2 variables: ## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 ## $ cors :List of 1 ## ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Better than that:
# Here, do() returns several results joint together in a data frame row myCors <- iris %>% group_by(Species) %>% do(data.frame(mpl = mean(.$Petal.Length), msl = mean(.$Sepal.Length), cors = I(list(cor(subset(., , Sepal.Length:Petal.Width)))) ) )
myCors ## Source: local data frame [3 x 4] ## Groups: Species ## ## Species mpl msl cors ## 1 setosa 1.462 5.006 <dbl[4,4]> ## 2 versicolor 4.260 5.936 <dbl[4,4]> ## 3 virginica 5.552 6.588 <dbl[4,4]> str(myCors[1, ]) ## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 4 variables: ## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 ## $ mpl : num 1.46 ## $ msl : num 5.01 ## $ cors :List of 1 ## ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## - attr(*, "vars")=List of 1 ## ..$ : symbol Species ## - attr(*, "drop")= logi TRUE ## - attr(*, "indices")=List of 1 ## ..$ : int 0 ## - attr(*, "group_sizes")= int 1 ## - attr(*, "biggest_group_size")= int 1 ## - attr(*, "labels")='data.frame': 1 obs. of 1 variable: ## ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 ## ..- attr(*, "vars")=List of 1 ## .. ..$ : symbol Species
summarise()
and do()
.dplr
helps efficiently solve most of the routine data frame manipulation tasks.(;o
Find all fligths that departed at midnight
identical(filter(flights, hour == 0, minute == 0), filter(flights, hour == 0 & minute == 0) ) ## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
Calculate the min, max values of time
and standard deviation of speed
for each destination (beware of NA).
flights %>% na.omit() %>% group_by(dest) %>% summarise(minT = min(time), maxT = max(time), sdS = sd(speed)) ## Error in eval(expr, envir, enclos): object 'flights' not found
Try to guess (no code), the cause of the NA value of sdS for flight(s) to Augusta?
Find the number of flights that departed to JFK for each days of the year (tip: use n()
)?
flights %>% filter(dest == "JFK") %>% group_by(date) %>% summarise(countsToJFK = n()) ## Error in eval(expr, envir, enclos): object 'flights' not found
Pick the three most delayed flights at departure for each carrier (tip: slice()
on carrier
groups)
flights %>% group_by(carrier) %>% arrange(-dep_delay) %>% slice(1:3) ## Error in eval(expr, envir, enclos): object 'flights' not found
Create two subsamples of flights
by randomly sampling 30% of the flights that went to SFO or LAX. Figure out how many fligths they have in common, how many are specific to each of them ?
spl1 <- flights %>% filter(dest == "SFO" | dest == "LAX") %>% sample_frac(size = 0.3) ## Error in eval(expr, envir, enclos): object 'flights' not found spl2 <- flights %>% filter(dest == "SFO" | dest == "LAX") %>% sample_frac(size = 0.3) ## Error in eval(expr, envir, enclos): object 'flights' not found nrow(intersect(spl1, spl2)) ## Error in intersect(spl1, spl2): object 'spl1' not found nrow(setdiff(spl1, spl2)) ## Error in setdiff(spl1, spl2): object 'spl1' not found nrow(setdiff(spl2, spl1)) ## Error in setdiff(spl2, spl1): object 'spl2' not found ### To illustrate the effect of NSE/ {.build} ## replicate(n, expr) run n times the expression expr and return a list with the results {.build} weird <- replicate(n = 2, flights %>% filter(dest == "SFO" | dest == "LAX") %>% sample_frac(size = 0.3)) ## Error in eval(expr, envir, enclos): object 'flights' not found weird ## Error in eval(expr, envir, enclos): object 'weird' not found