R in pRoduction: theRe be dRagons!

Here at Stitch Fix, we have an ambivalent relationship with R. We love R for its bleeding-edge statistical models, e.g. mbest and bsts, and for the robust data science toolchain known as the tidyverse. When we’re working interactively, we love using R to explore new models and visualize our data with ggplot2. But when we’re running production jobs, such as fitting models inside Docker images on cloud computers, R has a few defaults that send shivers up our spine.

To illustrate these, we’ll use the diamonds dataset from ggplot. We assume you’re familiar with base R’s data.frame class and some common packages like tibble and data.table.

library(ggplot2)
data(diamonds)  # a tibble
diamonds

Asking for nonexistent columns

For us, the biggest surprise when using an R data.frame is what happens when you try to access a nonexistent column. Suppose we wanted to do something with the prices of our diamonds. price is a valid column of diamonds, but say we forgot the name and thought it was title case. When we ask for diamonds[["Price"]], R returns NULL rather than throwing an error! This is the behavior not just for tibble, but for data.table and data.frame as well. For production jobs, we need things to fail loudly, i.e. throw errors, in order to get our attention. We’d like this loud failure to occur when, for example, some upstream data change breaks our script’s assumptions. Otherwise, we assume everything ran smoothly and as intended. This highlights the difference between interactive use, where R shines, and production use.

Concretely, suppose we have a limited budget and want to know which diamonds we can afford.

diamonds[diamonds[["price"]] < 350, ]  # 17 rows
diamonds[diamonds[["Price"]] < 350, ]  # 0 rows

This small mix-up might lead us to conclude that we can’t buy any diamonds (Note one could use the dplyr::filter function in this instance).

data.frame

If we use the $ operator to access columns of a data.frame, we also get NULL for nonexistent columns, such as diamonds$Price. However, we also get something worse: partial name matching!

# convert from tibble to data.frame
df_diamonds <- data.frame(diamonds)
df_diamonds$pri  # returns the price vector instead of throwing an error!

tibble / dplyr

Thankfully, tibble abolished partial name matching, and issues a warning on a key-miss when using $, but whether R returns NULL or its best guess of the column you want, our view is that anytime you try to access a column of a data.frame whose name does not exactly match an existing column, an error should be raised. Python’s pandas behaves this way by throwing a KeyError in this situation.

data.table

A popular alternative to base R’s data.frame and the tidyverse’s tibble classes is data.table. This has some nice features, notably a blazingly fast file reader (data.table::fread) and the ability to modify data in-place. However, one drawback of data.table is inconsistency of return types of very similar calls. Let’s look at the following code snippet:

library(data.table)
packageVersion("data.table")  # 1.10.4
dt_diamonds <- data.table(diamonds)
x <- "price"
dt_diamonds[, price]  # vector
dt_diamonds[, "price"]  # data.table
dt_diamonds[, x] # vector
dt_diamonds[, .(x)]  # data.table
dt_diamonds[[x]]  # vector

When we have multiple colleagues coding and committing to shared files, we prefer not to have to think hard about whether we’ll get a vector or a data.frame back. That said, data.table offers the ability to modify data in-place, whereas the tidyverse’s dplyr copies data with every function call. For this reason, we consider data.table to be better suited for memory-intensive production jobs.

stringsAsFactors

One common stumbling block in R is the parameter stringsAsFactors, which defaults to TRUE for functions like data.frame and read.csv. Factors are vectors that contain predefined values, and are used to store categorical data. In themselves, factors are a very useful concept in R. That said, automatic conversion of character vectors to factors leads to unintended consequences, in particular because factors are built on top of integer vectors. This can lead to unintended casting bugs. For example, imagine you have a CSV file filled with user-ids, item-ids, and free text user-comments wrapped in double quotes, such as "Great color, I love these jeans!". The default behavior of read.csv (which outputs a data.frame) is to treat these comments as a categorical instead of a character variable! This is because R assumes you are a statistician who is going to plug your tabular data into a model like glm immediately after reading it in. Nowadays, many people use R for natural language processing, among many other uses, and this default behavior makes less sense.

Moreover, because some datasets may have quotation marks around numeric data, or use a character vector to represent a missing or censored observation, values intended to be integers can be read in and cast as integers with different values! Thankfully, both data.table::fread and readr::read_csv default to stringsAsFactors = FALSE.

What we actually do

In light of these surprises, how do we actually work with data.frames in R? We never, EVER use $ subsetting. Instead, we use double brackets. This leads to ugly code like the following, which is still vulnerable to key misses, but at least it absolutely avoids partial name matching.

# find affordable diamonds
x <- "price"
mask <- dt_diamonds[[x]] < 350
dt_diamonds[mask]

Obviously, this is not perfect. Here are some paths we are actively considering:

Greater use of data.table. However, the syntax is quite different from base R and the tidyverse so there’s a small learning curve. Additionally, there is the issue of vector vs. data.table return types mentioned above.
Using our own safe data.frame class, where we do throw exceptions for key misses. We don’t use this in production yet, but it would look something like this:

# safe_df class
`[[.safe_df` <- function(df, key) get(x = key, envir = list2env(df))
`$.safe_df` <- function(df, key) get(x = key, envir = list2env(df))

class(diamonds)
# [1] "tbl_df"     "tbl"        "data.frame"
diamonds_safe <- diamonds
class(diamonds_safe) <- c("safe_df", "data.frame")

diamonds[["Price"]] 
# NULL
diamonds_safe[["Price"]] 
# Error in get(x = key, envir = list2env(df)) : object 'Price' not found

diamonds$Price
# NULL
# Warning message:
# Unknown or uninitialised column: 'Price'. 
diamonds_safe$Price
# Error in get(x = key, envir = list2env(df)) : object 'Price' not found

Using R less. We think Python and Scala are generally better suited for production environments, but the task of translating our model fitting code from R to something else is pretty daunting. Currently, we only use R in the model-fitting step of our pipelines; every step up- and downstream of this is done in Python.
Calling R/Python subprocesses from each other, a la rpy2 or reticulate.

Other Things

Here are some miscellaneous pro tips for using R in production environments.

There is an R package for docopt, which is a nice way to define a command line interface.
The logging package is a nice way to keep track of what your program is doing. See this discussion on SO for why we prefer logging to printing to stdout.
Docker containers provide stable environments for your R scripts to run in. With them, you don’t have to worry about a new package version breaking your existing script.
Thanks to our colleage Steven Troxler for this gem, which gives useful tracebacks on errors. That Python does this by default illustrates the different purposes of the two languages: R for interactive data analysis, Python for general purpose scripting.

.enable_traceback <- function() {
  options(error = function() { 
    traceback(2) 
    if (!interactive()) quit("no", status = 1, runLast = FALSE) 
  })
}

Conclusion

Despite its production surpises, we love R and will continue to use it. There is no better interactive tool for data science, and as long as researchers continue to publish their work in R, we need it in production to stay on the cutting edge of data science.

P.S.

While we were working on this post, Hadley Wickham released the strict package, which makes base functions more likely to throw errors in ambiguous situations. We think this is a big step in the right direction for the R language. Thanks, Hadley!

Tim Sweetser and Kyle Schmaus

June 15, 2017 - San Francisco, CA