If your R or dplyr work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table.
For some tasks data.table is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
This note shares an experiment comparing the performance of a number of data processing systems available in R. Our notional or example problem is finding the top ranking item per group (group defined by three string columns, and order defined by a single numeric column). This is a common and often needed task.
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).
In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals).
An example would be: a diet that changes individual weight by an ounce on average with a standard deviation of a pound. With a large enough population the diet is statistically significant. It could also be used to shave an ounce off a national average weight. But, for any one individual: this diet is largely pointless.
The concept is teachable, but we have always stumbled of the naming “statistical significance” versus “practical clinical significance.”
I am suggesting trying the word “substantial” (and its antonym “insubstantial”) to describe if changes are physically small or large.
This comes down to having to remind people that “p-values are not effect sizes”. In this article we recommended reporting three statistics: a units-based effect size (such as expected delta pounds), a dimensionless effects size (such as Cohen’s d), and a reliability of experiment size measure (such as a statistical significance, which at best measures only one possible risk: re-sampling risk).
The merit is: if we don’t confound different meanings, we may be less confusing. A downside is: some of these measures are a bit technical to discuss. I’d be interested in hearing opinions and about teaching experiences along these distinctions.
cdata is a data manipulation package that subsumes many higher order data manipulation operations including pivot/un-pivot, spread/gather, or cast/melt. The record to record transforms are specified by drawing a table that expresses the record structure (called the “control table” and also the link between the key concepts of row-records and block-records).
What can be quickly specified and achieved using these concepts and notations is amazing and quite teachable. These transforms can be run in-memory or in remote database or big-data systems (such as Spark).
The 0.7.0 update adds local versions of the operators in addition to the Spark and database implementations. These methods should now be a bit safer for in-memory complex/annotated types such as dates and times.
Rtip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.
I know I write a lot about coding in R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.
R without data is like going to the theater to watch the curtain go up and down.
(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)
Usually you come to R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.
If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.
R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.
When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:
Missing, invalid, or out of range values.
Categorical variables with large sets of possible levels.
Novel categorical levels discovered during test, cross-validation, or model application/deployment.
Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.
If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.