- Pros and cons
- Installing R
- Installing packages
- Command-line scripting
- Recommended config
- Needful packages
- High performance R
- Interacting with Julia
- Intro help
- Saving and loading
- Subsetting hell
- Data exchange
- R for Pythonistas
- what files do I need?
- Path surgery
tl;dr R is a powerful, effective, diverse, well-supported, free, nightmarishly messy, inefficient, de-facto standard. As far as scientific computation goes, this is outstandingly good.
Pros and cons
- Free (beer/speech)
- Combines unparalleled breadth and community, at least as pertains to statisticians, data miners, machine learners and other such assorted folk as I call my colleagues. To get some sense of this thriving scene, check out R-bloggers. That community alone is enough to sell R as an ecosystem, whatever you think of technical chaos of the language itself (cf “Your community is your best asset”) And believe me, I have reservations about everything else.
- Amazing, statistically-useful plotting (cf the awful battle to get error bars in python’s mayavi)
- Online web-app visualisation: shiny
- Integration into literate coding and reproducible research through knitr — see scientific writing workflow.
- Poetically, R has random scope amongst other parser and syntax weirdness.
- Call-by-value semantics (in a "big-data" processing language?)
- …ameliorated not even by array views,
- …exacerbated by bloaty design.
- Object model tacked on after the fact. In fact, several object models?. Which is fine? I guess?
- One of the worst names to google for ever (cf Processing, Pure)
See the R package page.
See the R package page.
Loads CSV from stdin into R as a data.frame, executes given commands, and gets the output as CSV or PNG on stdout
Aaron recommends not starting a new x server to ask you to choose a menu item:
Upon setting up a new machine I always run
install.packages(c("blogdown", "renv", "tinytex", "knitr", "devtools", "ggplot2")) tinytex::install_tinytex() devtools::install_github("r-lib/hugodown")
That gets the baseline tools I actually use.
The tidyverse is a miniature ecosystem within R which has coding conventions and tooling to make certain data analysis easier and prettier, although not necessarily more performant.
Blogging / reports / reproducible research
blogdown, the blogging tool, and the knit rendering engine, as mentioned elsewhere comprise R’s entrant into the academic blogoverse. It does reproducible research and miscellaneous scientific writing. This is the R killer feature that incorporates all the other killer features.
R now plugs into many machine-learning-style algorithms.
R does not define a standardized interface for its machine-learning algorithms. Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output.
Additionally you need to implement infrastructure to
- resample your models
- optimize hyperparameters
- select features
- cope with pre- and post-processing of data and compare models in a statistically meaningful way.
As this becomes computationally expensive, you might want to parallelize your experiments as well. This often forces users to make crummy trade-offs in their experiments due to time constraints or lacking expert programming skills.
mlrprovides this infrastructure so that you can focus on your experiments! The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering. It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms.
I think this pitch is more or less the same for caret.
There are also externally developed of ML algorithms accessible from R that presumably have consistent interfaces by construction: h2o
refines data frames in various ways, including lett dunderheaded default consstructors, high performance dataset querying, and
approximately the same functionality as
but seems to be faster at e.g.
It has a slightly different syntax to built-in dataframes, usually better syntax.
Here is a tutorial and the introduction.
disk.frame is a friendly gigabyte-scale single machine disk-backed data store, for stuff too big for memory.
In the tidyverse
IMO, the real killer feature of R.
See Plotting in R.
High performance R
Rcpp seems to be how everyone invokes their favoured compiled C++ code.
There are higher level tools that do this under the hood -
rstan compiles an inner loop this for Bayesian posterior simulation and a little bit of basic variational inference.
If you want a little more freedom but still want to have automatic differentiation and linear algebra done by magic, try TMB whose name and description are both awful but which manages pretty neat reduced rank matrix and optimization tricks for you.
Interacting with Julia
Julia is a nice language that can attain high performance.
I don’t know how to choose between these alternative methods. They both seem to have stalled, but XRJulia seems to be somewhat fresher.
This package provides an interface from R to Julia, based on the XR structure, as implemented in the XR package, in this repository.
rJulia provides an interface between R and Julia. It allows a user to run a script in Julia from R, and maps objects between the two languages.
RStudio is the most famed IDE for R. It happens to include a passable text editor, and a couple of neat features (equation preview! blog autogeneration! data explorer! interactive widgets!) and some misfeatures (bizarre and idiosyncratic keyboard shortcuts, bad integration for non-R languages…). Overall I find RStudio helpful for generating graphs and reports and slides, but I actually edit code in VS code.
There is an rstudio addin addinslist which
As noted under spreadsheet interfaces,
Jamovi is a new “3rd generation” statistical spreadsheet. designed from the ground up to be easy to use, jamovi is a compelling alternative to costly statistical products such as SPSS and SAS.
jamovi is built on top of the R statistical language, giving you access to the best the statistics community has to offer. would you like the R code for your analyses? jamovi can provide that too.
The interface looks good. It is lacking certain features that I would like (generalized linear models, AFAICT interaction terms in regressions) but it does good classical testing and analysis, and nudges you towards best practice especially in hypothesis tests.
You can develop extra extensions.
VS Code for R
This requires, from the R side, that one installs R’s language server.
randy3k/radian: A 21 century R console radian is an alternative console for the R program with multiline editing and rich syntax highlight. One would consider radian as a ipython clone for R, though its design is more aligned to julia.
Exploratory is an exploratory data anlysis workbench built in R with lots of nice tools and things. At USD49/month for entry price it has an unfortunately steep price curve and so I have not tried it.
- Rstudio.com cheat sheets
- Monash university’s bioinformatic’s focused intro.
- CSIRO’s introduction
- Drew Conway’s strata bootcamp
- R cookbook
- Jeremy Howard of Kaggle gives a virtuous and improving presentation
- Edwin de Jonge and Mark van der Loo, Data cleaning with R
- Bob Rudis: Using R to get data out of word documents
Saving and loading
Save my workspace (i.e. current scope and variable definitions) to
Load my workspace from
> rm(list=ls()) # clear current defs > load(".RData") # actually load
To subset a list based object:
to subset and optionally downcast the same:
to subset a matrix-based object:
x[1, , drop=FALSE]
to subset and optionally downcast the same:
How to pass sparse matrices between R and Python
Counter-intuitively, this FS-backed method was a couple of orders of magnitude faster than rpy2 last time I tried to pass more than a few MB of data.
Inspecting frames post hoc
Use recover. In fact, pro-tip, you can invoke it in 3rd party code gracefully:
options(error = utils::recover)
Basic interactive debugger
There is at least one, called browser.
Graphical interactive optionally-web-based debugger
Available in RStudio and if it had any more buzzwords in it would socially tag your instagram and upload in to the NSA’s Internet Of Things to be 3D printed.
R for Pythonistas
Many things about R are surprising to me, coming as I do most recently from Python. I’m documenting my perpetual surprise here, in order that it may save someone else the inconvenience of going to all that trouble to be personally surprised.
Importing an R package, unlike importing a python module, brings in random cruft that may have little to do with the names of the thing you just imported. That is, IMO, poor planning, although history indicates that most language designers don’t agree with me on that:
> npreg Error: object 'npreg' not found > library("np") Nonparametric Kernel Methods for Mixed Datatypes (version 0.40-4) > npreg function (bws, …) #etc
Further, Data structures in R can do, and are intended to, provide first class scopes for looking up of names. You are, in your explorations into data, as apt to bring the names of columns in a data set into scope as much as the names of functions in a library. This is kind of useful, although it leads to bizarre and unhelpful errors, so watch it.
No scalar types…
A float is a float vector of size 1:
> 5  5
…yet verbose vector literal syntax
You makes vectors by using a call to a function called
> c('a', 'b', 'c', 'd')  "a" "b" "c" "d"
If you type a literal vector in though, it will throw an error:
> 'a', 'b', 'c', 'd' Error: unexpected ',' in "'a',"
I’m sure there are Reasons for this; it’s just that they are reasons that I don’t care about.
what files do I need?
Here is a good
.gitignore file for R which keeps only what you need.
R is sensitive to which system libraries it picks up. If you are running, for example, homebrew then you probably want to make SURE that it does not show up in R as the first option.
Put the following in an
.Rprofile startup file:
This puts the
.linuxbrew paths last
.pth = Sys.getenv("PATH") .pths = unlist(strsplit(.pth, ":")) .brewpthi = as.vector(unlist(lapply(.pths, function (x) grepl("brew", x)))) .nbrewpthi = as.vector(unlist(lapply(.pths, function (x) !grepl("brew", x)))) Sys.setenv(PATH=paste(paste(.pths[.nbrewpthi], collapse=":"), paste(.pths[.brewpthi], collapse=":"), sep=":"))
This one deletes them entirely:
.pth = Sys.getenv("PATH") .pths = unlist(strsplit(.pth, ":")) .nbrewpthi = as.vector(unlist(lapply(.pths, function (x) !grepl("brew", x)))) Sys.setenv(PATH=paste(.pths[.nbrewpthi], collapse=":"))
Mentioning that you have done this is probably helpful to the user. I note it with the following message:
print("Changed PATH") print(.pth) print("to") print(Sys.getenv("PATH"))