Pretty long section. Will cover:
# a matrix m
m <- matrix(data = 1:20, nrow = 4, byrow = FALSE)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
m[c(2, 4), ] # return a matrix, two lines, all columns
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 6 10 14 18
## [2,] 4 8 12 16 20
m[ ,c(2, 4)] # a matrix,, two columns, all lines
## [,1] [,2]
## [1,] 5 13
## [2,] 6 14
## [3,] 7 15
## [4,] 8 16
m[2:6] # vector-like indexing by column for matrices
## [1] 2 3 4 5 6
idx <- cbind(rows = 1:4, columns = 1:4) # bind two numeric vectors as columns of an idx matrix
m[idx] # indexing returns a vector of values in the diagonal
## [1] 1 6 11 16
# Use a logical vector to find rows that have a value > 5 in the second column
# Extract columns two to five for these rows (all columns)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
m[m[, 2] > 5, 2:4]
## [,1] [,2] [,3]
## [1,] 6 10 14
## [2,] 7 11 15
## [3,] 8 12 16
We will do a lot of this for data frames
row()
and col()
return a matrix of integers containing respectively, their row or column number in a matrix-like object.row(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 2 2 2 2 2
## [3,] 3 3 3 3 3
## [4,] 4 4 4 4 4
col(m)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 1 2 3 4 5
## [3,] 1 2 3 4 5
## [4,] 1 2 3 4 5
This is handy to extract specific regions of the matrix using a logical matrix of the same shape:
row(m) == col(m) # the diagonal
## [,1] [,2] [,3] [,4] [,5]
## [1,] TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE FALSE FALSE
## [3,] FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE TRUE FALSE
m[row(m) < col(m)] # extract the upper triangle by column without the diag
## [1] 5 9 10 13 14 15 17 18 19 20
Actually, not surprisingly, R has dedicated functions for that: lower.tri()
, upper.tri()
and diag()
(for extracting and setting).
which()
also works for arrays and can return a numeric indexing matrix. Lets’s illustrate a use case where the goal is to find the locations of values of interest in a matrix:# One wants to find values 2, 5 and 10 in `m`:
logiV <- m %in% c(2, 5, 10)
logiV # a logical vector by column
## [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
logiM <- matrix(data = logiV, nrow = 4, byrow = FALSE) # coerce into a matrix
logiM
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE TRUE FALSE FALSE FALSE
## [2,] TRUE FALSE TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
m[logiM] # we did find them, apparently the values have no replicate
## [1] 2 5 10
# To find the coordinates of the values of interest in the matrix:
idx <- which(logiM, arr.ind = TRUE)
val <- m[idx]
cbind(idx, val = val) # "column" bind the indexing matrix with the vector of values of interest.
## row col val
## [1,] 2 1 2
## [2,] 1 2 5
## [3,] 2 3 10
t()
: transposes a matrixcolSums()
: Sum of the column valuesrowSums()
: Sum of the row valuescolMeans()
: Mean of the column valuesrowMeans()
: Mean of the row values%*%
: matrices multiplication
…
t(m)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
m + m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 10 18 26 34
## [2,] 4 12 20 28 36
## [3,] 6 14 22 30 38
## [4,] 8 16 24 32 40
m-m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
Take the m
matrix. How would you change all the values of the diagonal to 1?
In biochemistry, people often use 96 wells microplates that are in essence a matrix of wells with 12 columns and 8 rows:
Here is a matrix containing the measures output by an instrument:
measures <- matrix(data = round(runif(n = 96, max = 10), digits = 1), nrow = 8, ncol = 12)
In order to do computations on the values you would need to reshape it in a long format, i.e. in the form: row#, column#, value. Can you do that?
List indexing can be achieved with one of three indexing operators:
[
: selects sub-lists. It always returns a list. Used with a single positive integer, it returns a list of length one.[[
: returns the actual element within a list at the specified single index.$
: a shorthand for name indexing, wherex$y
is equivalent to x[["y", exact = FALSE]]
fileCabinet <- list(drawer1 = c(length = 1, width = 3),
drawer2 = FALSE,
top = c(item1 = "flowers", item2 ="sharpener"))
str(fileCabinet)
## List of 3
## $ drawer1: Named num [1:2] 1 3
## ..- attr(*, "names")= chr [1:2] "length" "width"
## $ drawer2: logi FALSE
## $ top : Named chr [1:2] "flowers" "sharpener"
## ..- attr(*, "names")= chr [1:2] "item1" "item2"
The [
operator. These expressions all return the same list:
fileCabinet[1:2]
fileCabinet[c(T, T, F)]
fileCabinet[c("drawer1", "drawer2")] # Partial name matching is not possible
The [[
and $ operators. These expressions all return the same element:
fileCabinet[[3]]
fileCabinet[["top"]]
fileCabinet[["t", exact = NA]] # the exact argument controls partial matching, TRUE or NA (return a warning)
fileCabinet$top
fileCabinet$t
identical(fileCabinet[1], fileCabinet[[1]])
## [1] FALSE
Take a couple of minutes to run these expressions and understand why they return the same thing.
str(fileCabinet)
## List of 3
## $ drawer1: Named num [1:2] 1 3
## ..- attr(*, "names")= chr [1:2] "length" "width"
## $ drawer2: logi FALSE
## $ top : Named chr [1:2] "flowers" "sharpener"
## ..- attr(*, "names")= chr [1:2] "item1" "item2"
fileCabinet[[c(1,2)]]
## [1] 3
fileCabinet[[c(3,2)]]
## [1] "sharpener"
fileCabinet[[c("top", "item2")]]
## [1] "sharpener"
fileCabinet$top[["item2"]]
## [1] "sharpener"
fileCabinet$top["item2"]
## item2
## "sharpener"
# Combinations of subsetting operators are possible but not recommended because the code is hard to read:
fileCabinet[[3]][[2]]
## [1] "sharpener"
fileCabinet[[3]][2]
## item2
## "sharpener"
Do you see a difference in the result of the last four expressions? This illustrates a case of simplifying the results of an idexing operation. We will come back to these shortly.
<-
operator.For example to delete elements in a list, assing NULL
:
lst <- list(a = c(TRUE, FALSE), b = NULL, c = list(list()))
lst[c("a", "c")] <- NULL
How would you replace the last two elements of fileCabinet
with the first element of fileCabinet
and a sequence of number?
How would you change the value of width
in drawer1
to 2?
Data frames are hybrid objects at the intersection between a list and matrix. Not unexpectedly, both matrix and list indexing techniques apply to data frames.
df <- data.frame(A = letters[1:4],
`B-2` = rep(c(TRUE, FALSE), each = 2),
`C with space` = seq(10, 20, length.out = 4),
D = c(runif(3), NA)
)
df
## A B.2 C.with.space D
## 1 a TRUE 10.00000 0.7670464
## 2 b TRUE 13.33333 0.1255563
## 3 c FALSE 16.66667 0.2556670
## 4 d FALSE 20.00000 NA
See what happens with the weird column names?
On a general note, if you want to make your life easier, just prohibit spaces, +, -, *, special characters in column/row names even in tabulated files that you want to import in R.
df[c(1,4)] # To select one or several columns. Returns a data frame
## A D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d NA
df[3]
## C.with.space
## 1 10.00000
## 2 13.33333
## 3 16.66667
## 4 20.00000
df[[3]] #To select the CONTENT (class) of a a single column using the list braquets.
## [1] 10.00000 13.33333 16.66667 20.00000
df$C
## [1] 10.00000 13.33333 16.66667 20.00000
# And so on...
df[,c(1,4)] # extracts columns
## A D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d NA
df[, c("A", "D")]
## A D
## 1 a 0.7670464
## 2 b 0.1255563
## 3 c 0.2556670
## 4 d NA
df[2:3, ]
## A B.2 C.with.space D
## 2 b TRUE 13.33333 0.1255563
## 3 c FALSE 16.66667 0.2556670
df[df$B.2, ] # a df
## A B.2 C.with.space D
## 1 a TRUE 10.00000 0.7670464
## 2 b TRUE 13.33333 0.1255563
df[, colnames(df) %in% "D"] # column turned to a vector !!!
## [1] 0.7670464 0.1255563 0.2556670 NA
df[, colnames(df) %in% "D", drop = FALSE] # column remains a df !!!
## D
## 1 0.7670464
## 2 0.1255563
## 3 0.2556670
## 4 NA
df[df$C.with.space < 20 & df$D > 0.4, ] # use comparison and logical operators to subset
## A B.2 C.with.space D
## 1 a TRUE 10 0.7670464
$
to refer to column can become tedious to type. So in interactive mode, one can use with()
with(data = df, df[C.with.space < 20 & D > 0.4, ])
## A B.2 C.with.space D
## 1 a TRUE 10 0.7670464
# This works as expected
DF <- data.frame(a = 1:3)
DF$b <- list(1:1, 1:2, 1:3)
str(DF)
## 'data.frame': 3 obs. of 2 variables:
## $ a: int 1 2 3
## $ b:List of 3
## ..$ : int 1
## ..$ : int 1 2
## ..$ : int 1 2 3
# This alternative produce an odd result
DF <- data.frame(a = 1:3, b = list(1:1, 1:2, 1:3))
## Error in data.frame(1L, 1:2, 1:3, check.names = FALSE, stringsAsFactors = TRUE): arguments imply differing number of rows: 1, 2, 3
data.frame()
it is as if each component or column had been passed as a separate argument.I()
to protect you object and request that it is treated as such.DF <- data.frame(a = 1:3, b = I(list(1,1:2,1:3)))
Remember from previous slides?
class(df[[3]]) == class(df[3])
## [1] FALSE
Unlike [
with drop = FALSE, [[
and [
with default drop = TRUE, will simplify their return value.
Simplifying subsets returns the simplest possible data structure that can represent the output, and is useful interactively because it usually gives you what you want.
Preserving subsetting keeps the structure of the output the same as the input, and is generally better for programming because the result will always be the same type.
Omitting drop = FALSE when subsetting matrices and data frames is one of the most common sources of programming errors. (It will work for your test cases, but then someone will pass in a single column data frame and it will fail in an unexpected and unclear way.)
Italics are direct copies from the source at http://adv-r.had.co.nz/Subsetting.html if you want a thorough outlook on this aspect follow the link.
Take df
and keep all columns but “A” and “D”. Do that in at least two different ways (think <- list(NULL)
)
The function installed.packages()
returns a data frame with info on the R packages installed in your system.
c("ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram", "Biostrings")
.Re-assign df
to its original value and calculate the mean of the content of column “D”.
subset()
Let’s take our toy data frame:
df <- data.frame(A = letters[1:4],
B = rep(c(TRUE, FALSE), each = 2),
C = seq(10, 20, length.out = 4),
D = c(runif(3), NA)
)
We have already encoutered multiple ways to extract rows and columns from a data frame.
subset(x, subset, select, drop = FALSE, ...)
function can save you some substancial typing in interactive sessions and help understand the code faster.select=
argument is also pretty convenient to keep only columns of interestsubset(x = df, subset = C < 20 & D > 0.4)
subset(x = df, subset = C < 20 & D > 0.4, select = c("A", "B", "C"))
subset(x = df, subset = C < 20 & D > 0.4, select = !colnames(df) %in% c("A", "D"))
subset(x = df, subset = C < 20 & D > 0.4, select = A:C) # "sequence" of column names
subset(x = df, subset = C < 20 & D > 0.4, select = -(C:D)) # "Negated" column names
While subset=
must have a logical value, select=
accepts more types.
Note that in order to allow us from refering to data frame columns without specific use of the data frame name (like in df$ColA
), subset()
like with()
and a other functions (especially in dplyr
and reshape2
) use Non-Standard Evaluation.
It is a fairly advanced topic, but you should remember that because of this usage of NSE, it is not recommended to include these functions when you define your own functions because it may just not work or cause unsuspected behavior.
For a discussion on NYSE, see Advanced R
To change these attributes, there are two dedicated functions that work both for getting and modifying names: rownames()
and colnames()
.
colnames(df)
## [1] "A" "B" "C" "D"
colnames(df) <- month.abb[1:ncol(df)]
rownames(df) <- sample(state.name, size = nrow(df))
df
## Jan Feb Mar Apr
## Louisiana a TRUE 10.00000 0.6436652
## West Virginia b TRUE 13.33333 0.2687169
## Alaska c FALSE 16.66667 0.4745574
## Nebraska d FALSE 20.00000 NA
In a data frame values for:
rownames()
should be a character vector of non-duplicated and non-missing names.colnames()
a character vector of (preferably) unique syntactically-valid names.We have seen before that one should always be wary of the presence of NAs in the data under analysis.
One informal way to look for them is to scrutinize the info displayed by summary()
To look for and exculde them programmatically, test for their presence with is.na()
and used logical indexing.
na.omit()
and related functions are usefull for conveniently deal with NAs in data frames. summary(df) indicate the number of NA for each variable.
Remember also that many (summary) functions have an argument like na.rm=
that controls how to handle them.
Let’s reset our df
to factory values
summary(df)
## A B C D
## a:1 Mode :logical Min. :10.0 Min. :0.03823
## b:1 FALSE:2 1st Qu.:12.5 1st Qu.:0.07693
## c:1 TRUE :2 Median :15.0 Median :0.11563
## d:1 NA's :0 Mean :15.0 Mean :0.35267
## 3rd Qu.:17.5 3rd Qu.:0.50989
## Max. :20.0 Max. :0.90415
## NA's :1
df[!is.na(df$D), ]
## A B C D
## 1 a TRUE 10.00000 0.90415435
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
na.omit(df)
## A B C D
## 1 a TRUE 10.00000 0.90415435
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
Adding / Removing rows
rbind()
and re-assign (column names must match), or subsetting and assignment for in place modif.df <- df[!is.na(df$D), ] # delete
rbind(df, df[1, ]) # bind
## A B C D
## 1 a TRUE 10.00000 0.90415435
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a TRUE 10.00000 0.90415435
df[nrow(df) + 1, ] <- df[1, ] # bind with another method
df
## A B C D
## 1 a TRUE 10.00000 0.90415435
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a TRUE 10.00000 0.90415435
Adding / Removing columns
list(NULL)
(NULL
) or return only the columns you want and assign to your df.cbind()
and re-assing (if binding a character vector, it will be converted to factor unless stringsAsFactors = FALSE
)subset(df, select = A:B) # could be re-assigned to df to delete col C and D
## A B
## 1 a TRUE
## 2 b TRUE
## 3 c FALSE
## 4 a TRUE
df <- cbind(df, E = state.name[1:nrow(df)], F = rnorm(n = nrow(df))) # does convert to factor
df[ , "G"] <- state.name[1:nrow(df)] # does NOT
str(df)
## 'data.frame': 4 obs. of 7 variables:
## $ A: Factor w/ 4 levels "a","b","c","d": 1 2 3 1
## $ B: logi TRUE TRUE FALSE TRUE
## $ C: num 10 13.3 16.7 10
## $ D: num 0.9042 0.1156 0.0382 0.9042
## $ E: Factor w/ 4 levels "Alabama","Alaska",..: 1 2 3 4
## $ F: num -1.0387 0.5046 -0.3716 -0.0255
## $ G: chr "Alabama" "Alaska" "Arizona" "Arkansas"
df["G"] <- NULL
transform()
that also saves you from referring to df name in expressions and reassign.df <- cbind(df, G = sqrt(df$D)* (df$C + pi)) # calculate values with vectors of equal length
df <- transform(df, H = ceiling(G - mean(C))) # mean(C) is recycled
df
## A B C D E F G H
## 1 a TRUE 10.00000 0.90415435 Alabama -1.03870301 12.495950 0
## 2 b TRUE 13.33333 0.11562937 Alaska 0.50458974 5.602186 -6
## 3 c FALSE 16.66667 0.03822775 Arizona -0.37158670 3.872895 -8
## 4 a TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950 0
order()
returns a permutation of the indices of the elements of x so that the corresponding element are ordered.order()
can work with a combination of sorting keys provided as vectors of the same lenght. So one can sort a data frame based on several columns:df
## A B C D E F G H
## 1 a TRUE 10.00000 0.90415435 Alabama -1.03870301 12.495950 0
## 2 b TRUE 13.33333 0.11562937 Alaska 0.50458974 5.602186 -6
## 3 c FALSE 16.66667 0.03822775 Arizona -0.37158670 3.872895 -8
## 4 a TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950 0
df[ order(df$B, df$H, -df$F), ]
## A B C D E F G H
## 3 c FALSE 16.66667 0.03822775 Arizona -0.37158670 3.872895 -8
## 2 b TRUE 13.33333 0.11562937 Alaska 0.50458974 5.602186 -6
## 4 a TRUE 10.00000 0.90415435 Arkansas -0.02548523 12.495950 0
## 1 a TRUE 10.00000 0.90415435 Alabama -1.03870301 12.495950 0
Note that specifying the increase/decrease behavior for individual columns is achieved by negating the column name (see: also the stackoverflow discussion).
unique()
returns only the unique elementsduplicated()
returns a logical vector where TRUE indicates that the corresponding element is a duplicate.df <- subset(df, , A:D)
unique(df)
## A B C D
## 1 a TRUE 10.00000 0.90415435
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
duplicated(df)
## [1] FALSE FALSE FALSE TRUE
df[duplicated(df), ]
## A B C D
## 4 a TRUE 10 0.9041543
df[-duplicated(df), ]
## A B C D
## 2 b TRUE 13.33333 0.11562937
## 3 c FALSE 16.66667 0.03822775
## 4 a TRUE 10.00000 0.90415435
Merging data frames is a also a recurring necessity and powerfull technique in data analysis. It involves combining information of two data frames based on common variable(s) or key(s). This is reminiscent of running SQL queries across tables in relational databases.
In base R, merge()
let you combines these data frames:
# The grocery store example...
products
## ProdID category price
## 1 A vegetable 4
## 2 B meat 3
## 3 C can 2
## 4 D iceCream 1
sales
## CustomID ProdID Date
## 1 a C 2015-09-01
## 2 a A 2015-09-01
## 3 b B 2015-09-02
## 4 b C 2015-09-03
## 5 b E 2015-09-03
How can we generate somekind of bill?
sales
with info from products
and sort by CustomID
:bill <- merge(x = sales, y = products, by = "ProdID", all.x = TRUE, sort = FALSE)
bill[order(bill$CustomID, bill$Date, bill$ProdID), ]
## ProdID CustomID Date category price
## 3 A a 2015-09-01 vegetable 4
## 1 C a 2015-09-01 can 2
## 4 B b 2015-09-02 meat 3
## 2 C b 2015-09-03 can 2
## 5 E b 2015-09-03 <NA> NA
# There is no info about product D, what did the cashier!!?
ProdID
merge(x = sales, y = products, by = "ProdID", all = TRUE, sort = FALSE)
## ProdID CustomID Date category price
## 1 C a 2015-09-01 can 2
## 2 C b 2015-09-03 can 2
## 3 A a 2015-09-01 vegetable 4
## 4 B b 2015-09-02 meat 3
## 5 E b 2015-09-03 <NA> NA
## 6 D <NA> <NA> iceCream 1
# Product D did not sell well these last days.
merge()
:merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)
by
by
This question relates to the issue of data aggregation which involves notably calculating summary statistics for specific groups of records defined by combinations of the levels of (categorical) variables (factors).
Probably the most userfriendly base R function to do these tasks is aggregate()
.
cleanBill <- na.omit(bill)
aggregate(x = cleanBill$price, by = cleanBill[c("CustomID", "Date")], FUN = sum)
## CustomID Date x
## 1 a 2015-09-01 6
## 2 b 2015-09-02 3
## 3 b 2015-09-03 2
# That answers our first question
aggregate()
comes in two flavors:
aggregate(x, by, FUN, ..., simplify = TRUE)
x =
is your databy =
is a list of grouping factorsFUN =
the function used for aggregation... =
optional arguments to FUN
simplify =
whether results should be simplified to vector or matrix/data frame if possible.aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
formula =
a formula e.g. y ~ x
or cbind(y1, y2) ~ x1 + x2
. Meaning summarises y
by x
as a grouping factor. We will learn more about formulas when discussing ploting and statistical modelling.data =
the data frame where x
and y
can be found…If time allows, you can run the examples of documentation for aggregate()
to get a better sense of what you can do with the formula notation.
Let’s look at other examples using the famous iris
dataset:
First we want to calculate the means of the various measures conditionned on species:
aggregate(. ~ Species, data = iris, FUN = mean)
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
# The dot in formula means "any columns from data that are otherwise not used"
We are interested in determining the correlation between dimensions of sepal and petals conditionned on the plant species. For this we can use cor()
.
aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman")
## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x'
# cor() complains...
Does not work because `aggregate()
first separates columns and then for each column, it applies FUN to each subsets of values (vectors) defined by by
. The problem is that here cor()
takes a 2-D object…
Let’s try by()
that splits a data frame into a subset of data frames based on the values of one or more factors, and function FUN is applied to each data frame in turn.
irisCorel <- by(iris[1:4], INDICES = iris["Species"], FUN = cor, method = "spearman")
Note that additional arguments to FUN
are passed after the FUN=
in the call.
str(irisCorel)
## List of 3
## $ setosa : num [1:4, 1:4] 1 0.755 0.279 0.299 0.755 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ versicolor: num [1:4, 1:4] 1 0.518 0.737 0.549 0.518 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## $ virginica : num [1:4, 1:4] 1 0.427 0.824 0.316 0.427 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
## ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
## - attr(*, "call")= language by.data.frame(data = iris[1:4], INDICES = iris["Species"], FUN = cor, method = "spearman")
## - attr(*, "class")= chr "by"
To sum up:
aggregate()
is OK but for example, there is no way to provide names to the results of summarization in the output other than manually changing them afterwards.by()
is OK, but we may need to further process its output to format it in a table-like fashion.aggregate()
and by()
passes the same function to all subsets. What if we wanted to calculate different summaries depending on the variable(s)?The dplyr
package will help you do some of that fairly easily.
For total control, you will have to write your own functions and apply them iteratively with by()
or other functions of the apply()
family.
m
rows from a data frame (write a general code and use the iris
dataset to test)?Run the code below:
library(ggplot2)
mySimpleSumy <- aggregate(diamonds, list(diamonds$color), summary)
What is this call doing? Take a look at the structure of the returned object. What is it?
run the same aggregate call but change the value of the simplify
argument. What happens?
movies
from the ggplot2
package to test it.Adapted and extended from a Introduction to R
movies
dataset from the ggplot2
package. What can you tell about it?library(ggplot2)
?movies
How many movies are there where budget is known (don’t use subset()
)
How many movies are there where the rating is less than or equal to 2.5 or greater than 7.5 (don’t use subset()
)?
How many movies were made in 1980 and have a rating above 5.0?
How many movies were classified both as “Action” and “Animation?”
Don’t sleep! What are the “precious” movies with the maximum number of votes or the longest length? How many votes for the later? You may use subset()
.
Append a new column named rating_zscore
to movies by standardizing the Average IMDB user rating using the scale()
function.
Extract only the last binary columns used for genre classification.
Create an aggregated version of it that will count the number of movies belonging to each unique combination of genre.
Can you tell how many unique combinations of genre are present in this df?
What are three most abundant combinations of genre ?
How would you extract all the rows in movies
that fit in only one genre?
Sometimes we need to do some checking for the number of observations having specific combinations of categorical variables levels.
To calculate cross-classification table the table()
and xtabs()
functions are our friend. There are a few differences among them, the main one being that xtabs()
takes formulas to specify crosstabulation.
color <- c("red", "green", "green", "red", "green", "green", "red")
shape <- c("round", "sharp", "sharp", "round", "round", "sharp", "round")
size <- c("tall", "tall", "small", "tall", "small", "small", "tall")
objects <- data.frame(color, shape, size)
xclass <- xtabs(data = objects)
str(xclass) # a list with other attributes
## int [1:2, 1:2, 1:2] 1 0 2 0 0 3 1 0
## - attr(*, "dimnames")=List of 3
## ..$ color: chr [1:2] "green" "red"
## ..$ shape: chr [1:2] "round" "sharp"
## ..$ size : chr [1:2] "small" "tall"
## - attr(*, "class")= chr [1:2] "xtabs" "table"
## - attr(*, "call")= language xtabs(data = objects)
summary(xclass)
## Call: xtabs(data = objects)
## Number of cases in table: 7
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 9.674, df = 4, p-value = 0.0463
## Chi-squared approximation may be incorrect
as.data.frame(xclass) # ALL combinations between levels are shown
## color shape size Freq
## 1 green round small 1
## 2 red round small 0
## 3 green sharp small 2
## 4 red sharp small 0
## 5 green round tall 0
## 6 red round tall 3
## 7 green sharp tall 1
## 8 red sharp tall 0
ftable(xclass) # Another way to flattened cross-classification tables with more than 2 dimensions
## size small tall
## color shape
## green round 1 0
## sharp 2 1
## red round 0 3
## sharp 0 0
# to compute margin sum or other functions
addmargins(xtabs(formula = ~ color + shape, data = objects), FUN = sum)
## Margins computed over dimensions
## in the following order:
## 1: color
## 2: shape
## shape
## color round sharp sum
## green 1 3 4
## red 3 0 3
## sum 4 3 7
# to compute proportions (maring = 1, rowwise ; 2, colwise ; not specified, all table)
prop.table(xtabs(formula = ~ color + shape, data = objects))
## shape
## color round sharp
## green 0.1428571 0.4285714
## red 0.4285714 0.0000000
This is interesting but we can do somemore with reshape2
movies
data set, convert rating
to a factor with 5 levels. Use the pretty()
function to generate the value of the break
argument of cut()
.mpaa
ratings and add sums on both sides.Source Data Wrangling Cheatsheet
R follows a set of conventions that makes one layout of tabular data much easier to work with than others.
Your data will be easier to work with in R if it follows three rules:
Often times you received or gather data that is messy, i.e. the way the information is structured is not appropriate for efficient and straightforward analysis using R and you need to tidy it up.
Occasionally although you have a tidy dataset, as a part of the analysis, you also want to represent the data differently and possibly aggregate it, like in the contingency tables we just saw.
melt()
and dcast()
with reshape2
In short, you need to reshape your data and swing seamlessly between long and wide table shapes.
The standard R distribution has facilities to do this type of transformations: stack()
and unstack()
from the utils
package and reshape()
from the stats
package.
But, in my experience, they are difficult to use or are not as flexible as the two simple but powerful functions provided in the reshape2
package:
reshape2 is based around two key functions: melt and cast:
melt()
takes wide-format data frames and melts it into long-format data.dcast()
takes long-format data frames and casts it into wide-format data.Let’s see how to do that with a slightly more explicit toy data set:
magicDrug <- data.frame(Name = rep(LETTERS[1:2], each = 2, times = 2),
When = rep(c("before", "after"), times = 4),
Speed = c(1, 2, 3, 5, 0, 1, 4, 5),
Weight = c(10, 3, 20, 6, 11, 2, 23, 9)
)
First melt:
library(reshape2)
mdata <- melt(data = magicDrug,
id.vars = c("Name", "When"), # Specified but unecessary because they're factors
measure.vars = c("Speed", "Weight"), # Specified but unecessary because they're numerics
variable.name = "IndPerformances", # What does these variables have in common?
na.rm = TRUE, # as a reminder for NAs
value.name = "value") # the default is good
str(mdata)
## 'data.frame': 16 obs. of 4 variables:
## $ Name : Factor w/ 2 levels "A","B": 1 1 2 2 1 1 2 2 1 1 ...
## $ When : Factor w/ 2 levels "after","before": 2 1 2 1 2 1 2 1 2 1 ...
## $ IndPerformances: Factor w/ 2 levels "Speed","Weight": 1 1 1 1 1 1 1 1 2 2 ...
## $ value : num 1 2 3 5 0 1 4 5 10 3 ...
Now, we can dcast()
the melted data in various ways to get some insight:
# again formula specification of the layout
# mean() for aggregation
dcast(data = mdata, formula = IndPerformances ~ When, fun.aggregate = mean)
## IndPerformances after before
## 1 Speed 3.25 2
## 2 Weight 5.00 16
### !!! Note that the fun.aggregate function should take a vector of numbers !!! {.build}
### !!! and return a single summary statistic !!! {.build}
dcast(mdata, Name ~ IndPerformances, mean)
## Name Speed Weight
## 1 A 1.00 6.5
## 2 B 4.25 14.5
dcast(mdata, Name + When ~ IndPerformances, length)
## Name When Speed Weight
## 1 A after 2 2
## 2 A before 2 2
## 3 B after 2 2
## 4 B before 2 2
In some regards, the iris dataset seems a little untidy. Let’s try to tidy it up.
library(reshape2)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
flowerID
variable to enable keeping track of what was measured on what flower.object
and dimension
and bind to the melted iris data frame.merge()
the melted iris data frame with a custom made recoding dataframe on the Object.Dimension column.Species
and object
as rows and dimension
as columns and containing mean values.dplyr
, a unifying framework for data frame manipulation and aggregationFundamental data processing functions exist in R but they are rather disparate, lack a consistent interface and the ability to easily flow together.
This makes it difficult to read, understand and write concise code.
dplyr
package attemps to overcome these limitations:
dplyr
(beside base distribution):
The dplyr
package philosophy is that preparing data in the form of a data frame for specific analysis always involves applying a combination of elementary manipulations.
Each of these basic manipulations has been formalized with a function named with a verb:
Main plyr function | Effect | Default R equivalent (pseudo code) |
---|---|---|
filter() |
keep rows matching criteria | subset() |
slice() |
pick rows using index(es) | df[1:10, ] |
distinct() |
Extract distinct (unique) rows | unique() |
sample_n() |
randomly sample rows | df[sample(1:nrow(df), n), ] |
arrange() |
reorder rows | df[order(df, factKey), ] |
select() |
pick columns by name | subset() or df[ , c("ColB", "ColD")] |
rename() |
rename specific columns | colnames(df)[ "ColA" %in% colnames(df) ] <- "ColX" |
mutate() |
add columns that are f() of existing ones | transform() orcbind(df, X = f(df$ColA)) |
xxx_join() |
combine data frames | merge() |
intersect() , etc |
set operations between data frames | NULL |
order_by() %>% summarise() |
reduce variables to values | aggregate() and by() |
verb()
syntax:
$
to refer to df
colums: like subset()
and transform()
.dplyr
tutorial to highlight neat features that are more tedious to emulate with base R functions.# download the 'flights.csv' file from the dropbox page and save to you curent working dir
library(dplyr)
flights <- tbl_df(read.csv("flights.csv", stringsAsFactors = FALSE))
## Warning in file(file, "rt"): cannot open file 'flights.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection
# or
# install.packages("hflights")
# library(hflights) ; flights <- hflights
# contains flights that departed from Houston in 2011
glimpse(flights)
## Error in nrow(tbl): object 'flights' not found
filter(flights, dest %in% c("SFO", "OAK"))
## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
filter(flights, arr_delay > 2 * dep_delay)
## Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
select(flights, arr_delay:dep_delay)
## Error in select_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
flights <- mutate(flights, speed = dist / (time / 60))
## Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
arrange(flights, desc(speed))
## Error in arrange_(.data, .dots = lazyeval::lazy_dots(...)): object 'flights' not found
%>%
(pipe) operator%>%
developed in the R package magrittr
.f( g( h(x), z=1), y=1 )
h(x) %>% g(z=1) %>% g(y=1)
plyr
functions:Select the two delay variables for flights to SFO or OAK
flights %>% filter(arr_delay > 2 * dep_delay) %>% select(arr_delay:dep_delay)
## Error in eval(expr, envir, enclos): object 'flights' not found
group_by()
for operating on subsets of the tablegroup_by()
on a table specifying grouping variables.filter
: select rows meeting a criteria by groupsummarise
: apply aggregation function to summarise each groupmutate
: calculate a summary value by group and append it to tableIs there a destination where the mean arrival delay is uncommonly high?
flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(-avg_delay)
## Error in eval(expr, envir, enclos): object 'flights' not found
Calculate the mean time by dest
and by carrier
and divide time
by this value to create a new column carrierRelativeTime
.
flights %>%
group_by(dest, carrier) %>%
mutate(meanTimeByDByC = mean(time, na.rm = TRUE),
carrierRelativeTime = time / meanTimeByDByC) %>%
select(dest:carrierRelativeTime)
## Error in eval(expr, envir, enclos): object 'flights' not found
Do you understand the different behaviour of summarise
and mutate
when used alongside group_by
? Compare the nrow
of the respective tables.
do()
is a by()
-like function for fine control on computed valuesdo()
another verb of the plyr
vocabulary, performs arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list
Let’s try to rerun the previous iris
example that could not run with aggregate()
but could with cut()
althought the output was a list.
# determining the correlation between dimensions of flower parts.
aggregate(iris[1:4], by = iris["Species"], FUN = cor, method = "spearman")
## Error in FUN(X[[i]], ...): supply both 'x' and 'y' or a matrix-like 'x'
# in plyr jargon that translates in:
myCors <- iris %>%
group_by(Species) %>%
do(cors = cor(subset(., , Sepal.Length:Petal.Width)))
myCors
## Source: local data frame [3 x 2]
## Groups: <by row>
##
## Species cors
## 1 setosa <dbl[4,4]>
## 2 versicolor <dbl[4,4]>
## 3 virginica <dbl[4,4]>
str(myCors[1, ])
## Classes 'tbl_df' and 'data.frame': 1 obs. of 2 variables:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
## $ cors :List of 1
## ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Better than that:
# Here, do() returns several results joint together in a data frame row
myCors <- iris %>%
group_by(Species) %>%
do(data.frame(mpl = mean(.$Petal.Length),
msl = mean(.$Sepal.Length),
cors = I(list(cor(subset(., , Sepal.Length:Petal.Width))))
)
)
myCors
## Source: local data frame [3 x 4]
## Groups: Species
##
## Species mpl msl cors
## 1 setosa 1.462 5.006 <dbl[4,4]>
## 2 versicolor 4.260 5.936 <dbl[4,4]>
## 3 virginica 5.552 6.588 <dbl[4,4]>
str(myCors[1, ])
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 4 variables:
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
## $ mpl : num 1.46
## $ msl : num 5.01
## $ cors :List of 1
## ..$ : num [1:4, 1:4] 1 0.743 0.267 0.278 0.743 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## .. .. ..$ : chr "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## - attr(*, "vars")=List of 1
## ..$ : symbol Species
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 1
## ..$ : int 0
## - attr(*, "group_sizes")= int 1
## - attr(*, "biggest_group_size")= int 1
## - attr(*, "labels")='data.frame': 1 obs. of 1 variable:
## ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1
## ..- attr(*, "vars")=List of 1
## .. ..$ : symbol Species
summarise()
and do()
.dplr
helps efficiently solve most of the routine data frame manipulation tasks.(;o
Find all fligths that departed at midnight
Calculate the min, max values of time
and standard deviation of speed
for each destination (beware of NA).
Try to guess (no code), the cause of the NA value of sdS for flight(s) to Augusta?
Find the number of flights that departed to JFK for each days of the year (tip: use n()
)?
Pick the three most delayed flights at departure for each carrier (tip: slice()
on carrier
groups)
Create two subsamples of flights
by randomly sampling 30% of the flights that went to SFO or LAX. Figure out how many fligths they have in common, how many are specific to each of them ?