R best practices

Daniel Himmelstein

doi:10.15363/thinklab.d83

Project:

Rephetio: Repurposing drugs on a hetnet [rephetio]

R best practices

Daniel Himmelstein Researcher June 18, 2015

R is a data scientist's dream but a programmer's nightmare.

Here I'll describe the R programming principles and practices that @leobrueggeman and I will try to adhere to for this project. We are big believers in the Hadleyverse — a philosophy of R programming, data analysis, and visualization — spearheaded by Hadley Wickham. I'll describe the basics as well as our modifications below.

Style

We follow the Hadley style guide, which builds off the older Google style guide. When calling functions from a package (any non-base function), use double colons to clarify function provenance (i.e. dplyr::filter() rather than just filter()).

Data format

The preferred storage format for tabular data is tab-separated with the .tsv function. Column names should always be included. For compression, use gzip with a .gz extension.

Tabular data in R should be in data frames. Avoid relying on row names by making a dedicated column for the attribute. options(stringsAsFactors = FALSE) is imperative but may not need to be explicitly called if reading data with readr and manipulating data with dplyr and tidyr. Create data frames using dplyr::data_frame().

Tables should be tidy [1]:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Piping

Consecutive commands should be chained together using the magrittr pipe (%>%) when possible. Piping improves readability and avoids unnecessary variables cluttering the workspace.

Visualization

ggplot2 is the preferred plotting package [2, 3]. Avoid the shortcut function ggplot2::qplot(). Consider cowplot or ggplot2::theme_bw() to avoid the ugly default theme. Make sure all text is readable. If text is too small to read, remove it. Exporting to vector images (pdfs and svgs) is preferred unless intensive rendering requires png.

Development environment

RStudio is a mediocre mature development environment. For most prototyping work, notebooks are more powerful and straightforward. Consider using Jupyter (IPython) notebooks with IRKernel.

Version control

All code should be version controlled. We use git hosted on GitHub. These short videos are recommended for inexperienced git users.

Link to specific files on GitHub with the commit hash for immutability and durability.

Additional materials

Check out the rstudio webinars as well as cheatsheets.

Antoine Lizee: RStudio is getting better with every iteration and has improved significantly since this post. Many features (smart completion, tight git integration, ...) make it a better IDE today than the Jupyter notebooks in my opinion.

Daniel Himmelstein Researcher Aug. 29, 2015

Conda for R

Conda is an awesome package manager that we've been using for Python. In most cases, conda alleviates the horror of installation errors and dependencies.

Now conda is available for R. In other words, you can install R using the following command:

conda install --channel r r

I found many of my favorite R packages were available in conda's r channel, usually with r- prepended to their lowercased name. For example, I installed r-ggplot2, r-tidyr, r-dplyr, r-caret, and r-glmnet. For notebook support, I installed rpy2 and r-irkernel.

Installing R packages that are not included in conda's channel became more difficult after the switch to conda management. For example, devtools::install_github() was failing, and when doing traditional package installation, I had to specify the repos argument because the GUI popup was broken:

install.packages('readr', repos='http://cran.us.r-project.org')

Conda can definitely save R users lots of frustration, but I suggest the general user wait for greater maturity before adoption.

Daniel Himmelstein: Continuum has created an r-essentials bundle containing many of the most common R packages. Installing, r-essentials should also add the R kernel to your jupyter notebook.

Status: Open

Views

1110

Topics

R Rstats Hadley Wickham Programming Coding Style Hadleyverse

Referenced by

Workshop to analyze LINCS data for the Systems Pharmacology course at UCSF

Cite this as

Daniel Himmelstein (2015) R best practices. Thinklab. doi:10.15363/thinklab.d83

License