This is mostly a running list of data.table tricks that took me a while to figure out either by digging into the official documentation, adapting StackOverflow posts, or more often than not, experimenting for hours. I’d like to persist these discoveries somewhere with more memory than my head (hello internet) so I can reuse them after my mental memory forgets them. A less organized and concise addition to DataCamp’s sweet cheat sheet for the basics.
Most, if not all of these techniques were developed for real data science projects and provided some value to my data engineering. I’ve generalized everything to the mtcars dataset which might not make this value immediately clear in this slightly contrived context. This list is not intended to be comprehensive as DataCamp’s data.table cheatsheet is. OK, enough disclaimers!
Some more advanced functionality from data.table creator Matt Dowle here.
1. DATA STRUCTURES & ASSIGNMENT
Columns of lists
summary table (long and narrow)
This could be useful, but is easily achievable using traditional methods.
summary table (short and narrow)
Add all categories of gear for each cyl to original data.table as a list.
This is more nifty. It’s so simple, I find myself using this trick to quickly explore data ad hoc at the command line.
Can also be useful for more serious data engineering.
Update 10/29/2015: Per these comments
on StackOverlow referencing my post, t[,gearsL:=list(list(unique(gear))), by=cyl] can be more elegantly written as t[,gearsL:=.(list(unique(gear))), by=cyl]. Thanks for pointing out my unnecessarily verbose and unusual syntax! I think I wrote the first thing that worked when I posted this, not realizing the normal .( syntax was equivalent to the outer list.
Accessing elements from a column of lists
Extract second element of each list in gearL1 and create row gearL1.
This isn’t that groundbreaking, but explores how to access elements of columns which are constructed of lists of lists. lapply is your friend.
Update 9/24/2015: Per Matt Dowle’s comments, a slightly more syntactically succinct way of doing this:
Calculate all the gears for all cars of each cyl (excluding the current current row).
This can be useful for comparing observations to the mean of groups, where the group mean is not biased by the observation of interest.
Update 9/24/2015: Per Matt Dowle’s comments, this achieves the same as above.
Suppressing intermediate output with {}
This is actually a base R trick that I didn’t discover until working with data.table. See ?`{` for some documentation and examples.
I’ve only used it within the J slot of data.table, it might be more generalizable. I find it pretty useful for generating columns
on the fly when I need to perform some multi-step vectorized operation. It can clean up code by allowing you to reference the same temporary variable
by a concise name rather than rewriting the code to re-compute it.
Defaults to just returning the last object defined in the braces unnamed.
We can be more explicit by passing a named list of what we want to keep.
Can also write it like this without semicolons.
This is trickier with := assignments… I don’t think := is intended to work when wrapped in {. Assigning multiple columns with := at once
does not allow you to use the first columns you create to use building the ones after it, as we did with = inside the { above. Chaining and then dropping unwanted variables is a messy workaround… still exploring this one.
Fast looping with set
I still haven’t worked much with the loop + set framework. I’ve been able to achieve pretty much everything with := which is more flexible and powerful.
However, if you must loop, set is orders of magnitude faster than native R assignments within loops. Here’s a snippet from data.table news a while back:
New function set(DT,i,j,value) allows fast assignment to elements
of DT. Similar to := but avoids the overhead of [.data.table, so is
much faster inside a loop. Less flexible than :=, but as flexible
as matrix sub-assignment. Similar in spirit to setnames(), setcolorder(),
setkey() and setattr(); i.e., assigns by reference with no copy at all.
M = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(M)
DT = as.data.table(M)
system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s
system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s
system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s
system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s
data.table creators do favor set for some things, like this task which can also be done w/ lapply and .SD. I was actually directed to this solution after I posed this question on StackOverflow. I was also pleased to learn that the
functionality I was looking for – applying a function to a subset of columns with .SDcols while preserving the untouched columns – was added as a feature request.
Using shift for to lead/lag vectors and lists
Note this feature is only available in version 1.9.5 (currently on Github, not CRAN)
Base R surprisingly does not have great tools for dealing with leads/lags of vectors that most social science
statistical software (Stata, SAS, even FAME which I used in my formative data years) come equipped with out of the box.
shift with by
Create multiple columns with := in one statement
This is useful, but note that that the columns operated on must be atomic vectors or lists. That is they must exist before running computation.
Building columns referencing other columns in this set need to be done individually or chained.
Assign a column with := named with a character object
This is the advised way to assign a new column whose name you already have determined and saved as a character. Simply surround the character object in parentheses.
This is old (now deprecated) way which still works for now. Not advised.
2. BY
Calculate a function over a group (using by) excluding each entity in a second category.
This title probably doesn’t immediately make much sense. Let me explain what I’m going to calculate and why with an example.
We want to compare the mpg of each car to the average mpg of cars in the same class (the same # of cylinders). However, we don’t want
to bias the group mean by including the car we want to compare to the average in that average.
This assumption doesn’t appear useful in this example, but assume that gear+cyl uniquely identify the cars. In the real project where I faced this
problem, I was calculating an indicator related to an appraiser relative to the average of all other appraisers in their zip3. (cyl was really zipcode
and gear was the appraiser’s ID).
METHOD 1: in-line
0.a Biased mean: simple mean by cyl
However we want to know for each row, what is the mean among all the other cars with the same # of cyls, excluding that car.
1.a .GRP without setting key
Update 9/24/2015: Per Matt Dowle’s comments, this also works with slightly less code. For my simple example, there was also a marginal speed gain. Time savings relative to the .GRP method will likely increase with the complexity of the problem.
1.b Same as 1.a, but a little faster
Why does this work?
1.b Setting key
METHOD 2: using {} and .SD
{} is used for to suppress intermediate operations.
Building up
No surprises here.
Nested data.tables and by statements
This chunk shows what happens with two by statements nested within two different data.tables. Explanatory purposes only - not necessary for our task.
n counts the # of cars in that cyl. N counts the number of cars by cyl and gear.
Calculating “unbiased mean”
This is in a summary table. This would need to be merged back onto dt if that is desired.
METHOD 3: Super Fast Mean calculation
Non-function direct way
Using a vectorized approach to calculate the unbiased mean for each combination of gear and cyl. Mechanically,
it calculates the “biased average” for all cars by cyl. Then subtract off the share of cars with the combination of gear and cyl
that we want to exclude from the average and add that share. Then extrapolate out this pared down mean.
Wrapping up code below into a function
Speed check
Method 3 is roughly 100x faster than the other two. Great for this narrow task with the vectorization built in,
but less generalizable; The other two methods allow any function to be passed.
Method 3:
Method 2:
Method 1:
keyby to key resulting aggregate table
Without keyby
Categories are not sorted
With keyby
Using [1], [.N], setkey and by for within group subsetting
take highest value of column A when column B is highest by group
Max of qsec for each category of cyl
(this is easy)
value of qsec when mpg is the highest per category of cyl
(this is trickier)
value of qsec when mpg is the lowest per category of cyl
value of qsec when mpg is the median per category of cyl
subset rows within by statement
V1 is the standard deviation of mpg by cyl. V2 is the standard deviation of mpg for just the first half of mpg.
3. FUNCTIONS
Passing data.table column names as function arguments
Method 1: No quotes, and deparse + substitute
This way seems more data.table-ish because it maintains the practice of not using quotes on variable names in most cases.
Method 2: quotes and get
However I tend to pass through column names as characters (quoted) and use get each time I reference that column. That can be annoying if you have a long function
repeatedly reference column names, but I often need to write such few lines of code with data.table, it hasn’t struck me as terribly unslick, yet.
Beware of scoping within data.table
data.frame way
When you add something to a data.frame within a function that exists in the global environment, it does not affect that object in the
global environment unless you return and reassign it as such, or you use the <<- operator.
When we call the function, we see addcol1 in the output. But not addcol2. That’s because it’s been added to the df in the global environment one level up.
Here is addcol2, but not addcol.
data.table way
Unlike data.frame, the := operator adds a column to both the object living in the global environment and used in the function. I think this is because
these objects are actually the same object. data.table shaves computation time by not making copies unless explicitly directed to.
So something like this renaming the local version using copy bypasses this behavior, but is likely somewhat less efficient (and elegant). I suspect there’s a cleaner and/or faster way to do this: keep some variables
local to the function while persisting and returning other columns.
4. PRINTING
Print data.table with []
Nothing groundbreaking here, but a small miscellaneous piece of functionality.
In data.frame world, wrapping an expression in () prints the output to the console. This also works with data.table, but there is another way.
In data.table this is achieved by appending [] to the end of the expression. I find this useful because when I’m exploring at the console, I
don’t usually decide to print the output until I’m almost done and I’m already at the end of the expression I’ve written.
Hide output from := with knitr
It used to be that assignments using the := operator printed the object to console when knitting documents with knitr and rmarkdown. This is actually fixed in data.table v1.9.5. However at the time of my writing, this currently not available on CRAN… only Github. For 1.9.4 users, this StackOverflow post has some hacky solutions. This least impedance approach I found was simply wrapping
the expression in invisible. Other solutions alter the way you use data.table which I didn’t like.