Please read on for our handy hints on keeping your data handles neat. Continue reading Managing Spark data handles in R
Win-Vector LLC has recently been teaching how to use
R with big data through
sparklyr. We have also been helping clients become productive on
R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using
R, share some hints, and even admit to some of the warts found in this combination of systems.
The ability to perform sophisticated analyses and modeling on “big data” with
R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).
J. Howard Miller, 1943.
The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using
R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.
Please read on for a brief description of our new articles series: “R and big data.” Continue reading New series: R and big data (concentrating on Spark and sparklyr)
I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.
More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.
In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal of this article is to set up terminology so we can state in one or two sentences why decision trees tend to work well in practice.
Living in the age of big data we ask what to do when we have the good fortune to be presented with a huge amount of supervised training data? Most often at large scale we are presented with the un-supervised problems of characterization and information extraction; but some problem domains offer an almost limitless supply of supervised training data (such as using older data to build models that predict the near future). Having too much training data is a good problem to have and there are ways to use traditional methods (like logistic regression) at this scale. We present an “out of core” logistic regression implementation and a quick example in Apache Hadoop running on Amazon Elastic MapReduce. This presentation assumes familiarity with Unix style command lines, Java and Hadoop.
We have been living in the age of “big data” for some time now. This is an age where incredible things can be accomplished through the effective application of statistics and machine learning at large scale (for example see: “The Unreasonable Effectiveness of Data” Alon Halevy, Peter Norvig, Fernando Pereira, IEEE Intelligent Systems (2009)). But I have gotten to thinking about the period before this. The period before we had easy access to so much data, before most computation was aggregation and before we accepted numerical analysis style convergence as “efficient.” A small problem I needed to solve (as part of a bigger project) reminded me what theoretical computer scientists did then: we worried about provable worst case efficiency.