Sept. 7-11, 2015

Updated tentative program of the week

J1 - Generalities on R

  • Short overview of R distributions, installation and packages
  • Finding, requesting and understanding R help: surfing a sea of resources
  • Writing and managing your R code
  • Getting to know the R IDE Rstudio
  • Introduction to Sweave et knitr
  • Data structure in R: vectors, factors, matrces, data frames and lists
  • Mathematical and statistical computations on numeric and logical vectors: basic functions and operators
  • Data manipulation I: play with the vector and make it your toy

J2 - R programming

  • Data manipulation II: be the master of arrays (matrices), lists and data frames
  • Writting and calling functions in R
  • An introduction to object-oriented programming in R: S3 classes and generic functions
  • Loops and conditional execution.
  • Character strings manipulation
  • The basics of R – operating system interactions

J3 - Graphics

  • R Graphics: low- and high-level graphical functions and graphical parameters (par())
  • The lattice and ggplot2 packages
  • A few examples of specific graphical functionalities

J4 - Statistics

  • Linerar models: simple regression, multiple regression , ANOVA, ANCOVA, statistical tests, formulas, continuous and descrete predictors, interactions.
  • Generalized linear models (GLM) : distributions, maximum likelyhood, statistical tests

J5 -

  • Workshop on your own projects/data
  • Discussion on difficulties/common problems and possible solutions

Short overview of R distributions, installation and packages

Comprehensive R Archive Network (CRAN)

R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

https://www.r-project.org/

R version 3.2.2 (Fire Safety) has been released on 2015-08-14.

There are R distributions for Linux, Mac, Windows… platforms.

Installation procedure for your favorite platform should be straigthforward if you follow the instructions from CRAN.

On Windows, upgrading R versions may be a pain, but see the R Windows FAQ.

On linux systems, in my experience the most straightforward way to be kept up-to-date, is to register a CRAN repository in your favorite package manager.

Our first interactions with R

  • R is a lot of things but from a user perspective apart from a language, it is primarily:

    • an interpreter that interactively executes the commands you wrote at the prompt upon pressing the enter key.
    • a Graphical display that opens windows (devices) to display graphics created by executing some code.
  • First start R if not done already and let's try to open a graphical window :

hist(rnorm(1000))

  • Now let's execute a few simple expressions in the interpreter to warm up your fingers
1 + 1

# 124é"famoi'_(ù*$=+" - Use a hash '#' to add comments that are just ignored 

a <- "hello" # assigning a value to a symbol/name
A <- 1

# If you simply enter the variable name or expression at the command prompt,
# R will PRINT its value.
a
A

2 <- "error"

?Reserved

B <- 2
A + B

C <- c(A, B)
C

mode(C)

log(x=64, base=4) # reminder about functions and parameters
log(64, 4)

# Missing data
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)

# Not A Number
sqrt(-9)

# close your session
q()

Does this sound familiar to all of you?

R packages are the standardized way of extending R

  • The standard R distribution comes with an amazing number of tools but one may need to perform specific tasks.

  • To extend R functionalities you will want to download and install additional packages and use them.

  • Packages (libraries, modules, … in other languages) are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library.

  • The standard distribution comes with several packages (base, compiler, datasets, grDevices, graphics, …)

  • Repositories hold collections of R packages and have mecanisms to download and install them on your system (99% of the time, it is really easy).

  • "Mainstream" repositories are CRAN, the Omega project and Bioconductor.

Package repositories and installation

  • The primary repository is CRAN. It holds about 7000 (!!!) packages. The easiest way to install from CRAN is just to use install.packages("packageName")
# Try and install:
install.packages(c("ape", "devtools"))
  • Another popular repository is

Bioconductor is bioinformatics-oriented

  • It supports many types of high-throughput sequencing data (DNA, RNA, chIP, methylomes and ribosome profiling, …) and associated annotation resources and covers mircroarrays, proteomic, metabolomic, flow cytometry, quantitative imaging, cheminformatic and other high-throughput data :
    • 1024 software packages
    • 241 experiment data packages
    • 917 up-to-date annotation packages.

source("http://bioconductor.org/biocLite.R") # fetch and execute code of the  function:
biocLite("packageName")
  • GitHub is becoming increasingly popular as a repository for R packages:
    First install devtools from CRAN (!…) and then call the devtools::install_github() function.

  • To remove packages: remove.packages(pkgs="packageName")

Using packages

  • To know what packages are installed on your system: installed.packages()

  • To use resources from a package, you generally attach it to your working environment with library() (or require()). This will load the ressources from the package in memory and attach it to your search() path.

search() # list of attached R packages and objects
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
library(ape) # or
library("ape")
search()
##  [1] ".GlobalEnv"        "package:ape"       "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"
  • When would you just want to load a package:
    • if just need a quick function to avoid crowding unecessarily your seach path (decrease perf and may create name conflicts).
    • overcome name conflicts by specifying where to find the object.

session_info() # This function is from devtools but is not loaded
## Error in eval(expr, envir, enclos): could not find function "session_info"
devtools::session_info() # loads devtools if not already done and call session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_US:en                    
##  collate  en_US.UTF-8                 
##  tz       <NA>
## Packages ------------------------------------------------------------------
##  package   * version date       source        
##  ape       * 3.3     2015-05-29 CRAN (R 3.2.1)
##  curl        0.9.2   2015-08-08 CRAN (R 3.2.2)
##  devtools    1.8.0   2015-05-09 CRAN (R 3.2.2)
##  digest      0.6.8   2014-12-31 CRAN (R 3.1.2)
##  evaluate    0.7     2015-04-21 CRAN (R 3.2.0)
##  formatR     1.2     2015-04-21 CRAN (R 3.2.0)
##  git2r       0.10.1  2015-05-07 CRAN (R 3.2.0)
##  htmltools   0.2.6   2014-09-08 CRAN (R 3.1.2)
##  knitr       1.10.5  2015-05-06 CRAN (R 3.2.0)
##  lattice     0.20-33 2015-07-14 CRAN (R 3.2.2)
##  magrittr    1.5     2014-11-22 CRAN (R 3.2.0)
##  memoise     0.2.1   2014-04-22 CRAN (R 3.1.1)
##  nlme        3.1-122 2015-08-19 CRAN (R 3.2.2)
##  Rcpp        0.12.0  2015-07-25 CRAN (R 3.2.2)
##  rmarkdown   0.7     2015-06-13 CRAN (R 3.2.1)
##  rversions   1.0.2   2015-07-13 CRAN (R 3.2.2)
##  stringi     0.5-5   2015-06-29 CRAN (R 3.2.1)
##  stringr     1.0.0   2015-04-30 CRAN (R 3.2.0)
##  xml2        0.1.1   2015-06-02 CRAN (R 3.2.1)
##  yaml        2.1.13  2014-06-12 CRAN (R 3.2.0)
# run ?"::" if you want.

  • To only detach("package:ape") a package does not unload it.
    For this, you will need unloadNamespace("ape")
    In practice this is barely used.

Exercice

We are probably going to need the following packages: "ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram".

  • These should be hostetd on CRAN. Install them on your system.
install.packages(c("ape", "reshape2", "dplyr", "lattice", "ggplot2", "VennDiagram"))
  • "Biostrings" is a Bioconductor package. Please, install it as well.
source("http://bioconductor.org/biocLite.R") # fetch and execute code of the  function:
biocLite("Biostrings")
  • Test if you can use them with library()

Please, let us know if anything weird happened.

Finding, requesting and understanding R help: surfing a sea of resources.

Clearly, one R's distinctive feature is that documentation ressources in broad terms are extremely aboundant:

  • R's documentation: traditionally, the R developpers commmunity emphasized writing extensive help documents.
  • Books, tutorials, discussions: R users base is pretty large and online community is very active.

Searching your local R documentation for help on a topic:

  • If you have a specific name in mind: Documentation on a topic with name (typically, a R object or a data set)
    • help("name") or ?name : Beware, help() searches only in loaded packages.
    • help.start() : starts the HTML version of help()
?summary
  • If you look for a topic and do not know the function: Search the LOCAL help system documentation matching a given character string in the name, alias, title, concept or keyword fields
    • help.search("topic") or ??topic
??"\\{"
??DNA

Anatomy of a R function documentation page

  • To illustrate the various sections of a function doucumentation, let's type
?paste
  • To take a look at an example documentation for a dataset:
?iris
  • Cool functions:
example("paste") # Run the code in the "examples" section of a function doc
demo("graphics") # some packages offer a demo of their functionalities!

R Reference Card or Cheat Sheet

Cheat sheets are very convenient especially when you are not familiar with a language or a package:

  • they remind you of specific aspects
  • they are a great overview of what you can do.

https://cran.r-project.org/doc/contrib/Short-refcard.pdf

If you can open it, find functions that may help you find help…

R manuals, tutorials, books : references used to prepare the slides

R manuals, tutorials, books…

  • If you do not know what manual to pick you may want to start with this selection of 60+ R resssources classified by topic and type

Reaching out for help: what mailing lists exist for R?

  • The most relevant ones for a user:
  • If you do not want to be politely patronized or humorously mocked read a posting guide (for example here or here) before sending anything to any mailing list:
    • do your homework
    • provide a clear explanation of what you want
    • provide a simple reproducible example of code if relevant
    • do not forget your sessionInfo() (very usefull for diagnostic)
print(sessionInfo(), locale = FALSE)
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ape_3.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0      lattice_0.20-33  digest_0.6.8     grid_3.2.2      
##  [5] nlme_3.1-122     git2r_0.10.1     formatR_1.2      magrittr_1.5    
##  [9] evaluate_0.7     stringi_0.5-5    curl_0.9.2       rstudioapi_0.3.1
## [13] xml2_0.1.1       rmarkdown_0.7    devtools_1.8.0   tools_3.2.2     
## [17] stringr_1.0.0    yaml_2.1.13      rversions_1.0.2  memoise_0.2.1   
## [21] htmltools_0.2.6  knitr_1.10.5

Misc. tools to find what you want

  • Task views are a great place to start when you want to do something but do not know what tool to use because since you get a fairly comprehensive overview of what’s available on a topic. Let's see for example what is available for "High-Performance and Parallel Computing with R".

  • Whenever available READ the Package vignette which is a practical and concise guide to your package that illustrates its key functionalities.
    Go directly to the package page on CRAN of use functions browseVignettes("packagename") and vignette(x)

  • For people interested in high-throughput genomic data, take a look at the Bioconductor website. It provides workflows which are step-by-step guides to certain types of analysis. Bioconductor vignettes are of excellent quality, the first tutorial to try

  • Use online search engines:

Writing and managing your R code

Tips for efficiently writing understandable and reusable code

Save you code in script files.

  • No matter what you do, from a quick t-test on your last experiment data or a comprehensive analysis on RNA-seq data from a consortium of labs, save your code for later!

  • R script files are just plain text files with an .R extension, that is it!

  • Why keep records of your code?
    • Document for reproducible research
    • Re-use general purpose functions
  • How to organise functional units of code, from a single script to a package:
    • Start in one script and separate code into logically separable units, often using functions.
    • Group functions together in one section of the file (often the top)
    • Keep functions focused, should be doing one thing, if reach more that ~15 lines of code, ask yourself whether you cannot split the task

    • Consider breaking your script up into several files containing logical units.
      • If scrolling around among logically unrelated units is an annoyance.
      • If your functions are generic/mature enough to be reused in other scripts.
    • You can then use source() to execute the code from R scipts and load your functions in your session.
    • There are intermediate alternative mechanisms of organizing reusable code into units without necessarily writing a package.
    • Ultimately, you will want to organize your code into a genuine R package
  • if you have no idea what "writing maintainable code, using version control and issue trackers, code reviews, unit testing, and task automation" means, read this accessible "Best Practices for Scientific Computing" PLoS Biology paper

Coding style

If you (or someone else) want to be able to quickly grasp what you wrote a few weeks after you actually wrote it, there are a few tips you should follow:

  • add comments, # even if you think you will never read it again

  • use descritive and judicious variable names (e.g. samplingLocations rather x), this is a difficult art.

  • use indentation to reflect code structure (in if construct, function definitions, …)

  • use consitent formatting style (e.g. DeletedObservations or delete_observations but not both)

  • keep nesting to a strict minimum (f(g(h(x)))): hard balance between compactness vs. wordiness

For more detailled and R-oriented advices please try to browse one of those guides for tomorrow:

Exercice

Lets first fetch the zip archive and extract it in a directory named "Rtrainning_201509" in your user home.

Exercice

Type

history()

Save the commands of your session history into a file with:

savehistory(file = "Rtrainning_20150907.R")

Find out where this file has been created by executing

getwd()

With the file explorer, make sure it is there, move it into the RTrainning folder and open it with a text editor to look at its content.

Exercice

Run the code contained in the script file sourceMe.R

source(file = "sourceMe.R", echo = FALSE)

Getting to know the R IDE Rstudio

An Integrated Development Environment (IDE) brigthens up your day

  • Select and run code
  • Syntax highlighting
  • On the fly syntax check
  • Code autocompletion
  • Code indentation
  • Easy invocation of help for language elements
  • Variable name occurrence highlighting across a script
  • Variable renaming across a script
  • Project/Package management
  • Version Control System interactions
  • And so on…

A non-exhaustive list of IDE and editors

  • RStudio - R-specific IDE
  • ESS (Emacs Speaks Statistics) - package for Emacs and XEmacs
  • Architect - a remix of the Eclipse IDE with the StatET plugin
  • TERR - commercial IDE with its own R engine
  • Live-R - R IDE in a browser
  • JGR - Java-based GUI for R
  • Tinn-R - R-specific code editor
  • Sciviews-K - Extension for the Komodo IDE
  • NppToR - plugin for Notepad++
  • Vim-R - plugin for Vim
  • Rgedit - plugin for gedit and pluma
  • Deducer R Editor

Let's start Rstudio on your machines!

RStudio: selected keyboard shortcuts

For the complete list go there

Description Windows & Linux Mac
Run current line/selection Ctrl+Enter Command+Enter
Comment/uncomment current line/selection Ctrl+Shift+C Command+Shift+C
Insert assignment operator Alt+- Option+-
Reformat Selection Ctrl+Shift+A Command+Shift+A
Find and Replace Ctrl+F Command+F
Attempt completion Tab or Ctrl+Space Tab or Command+Space
Show help for function at cursor F1 F1
Show source code for function F2 F2

Exercice with Rstudio

  • Open your "Rtrainning_20150907.R" file.

  • The graphical interface has 4 panels (layout customisable):
    • The code and data editor
    • The R console
    • Files, plots, packages and help.
    • Workspace and history.
  • Select some commands (preferentially not the package installation ones) and run them (use keyboard)

  • Go to the history tab and run these commands again by clicking 'To console'.

  • Find help on the function library() (use keyboard)

  • Other "Files", "Plot", "Packages", "Viewer" tabs

Exercice with Rstudio

  • Run
head(iris)
  • Look at the "Environment" tab (grid vs list display).
  • Try to find USArrests in the package:datasets environment. What is it?
  • Go back to Global, remove one object.
  • Run these commands:
installed.packages()
ls()
str(iris)
ls.str()
rm(A)

Exercice

Source Prof. Daniel Wegmann

  • Assign the values 6.7 and −56.3 to variables a and b, respectively
  • Use R to calculate (2*a)/b+a*b and assign the results to variable x
  • Use help.search() to find out how to compute the square root of variables and compute the square root of a and b
  • Quit RStudio and save the work space
  • Start RStudio again and check that all variables you created are still there.
  • Use R to calculate log(x) and assign the result to variable y
  • Quit RStudio without saving the work space.
  • Restart RStudio and check that the variables a, b and x exist, but not y.
  • Write a new script to assign the values 75 and 0.1 to the variables u and v, respectively, and to print(u, v).
  • Execute the script in full and line by line.
  • Save the script on your computer, close the script, exit RStudio , start RStudio again, reopen the script and execute it again.
  • Go back to "Rtrainning_20150907.R" and type a command to run the previous script.

Exercice

Copy the line below and paste it in the code editor and execute it.

anticonstitucionalissimamente <- lm(iris)

Write the line below by typing only 1-3 letters (not counting $ signs). Tip use tab…

anticonstitucionalissimamente$model$Petal.Width

Exercice

  • Type head(iris) in your script and execute the code.
  • In RStudio, find the iris dataset in the datasets environment.
  • What is it?
  • How many columns?
  • What is the type of data in the columns?
  • How many observations?