Nina Zumel introduced y-aware scaling in her recent article Principal Components Regression, Pt. 2: Y-Aware Methods. I really encourage you to read the article and add the technique to your repertoire. The method combines well with other methods and can drive better predictive modeling results.
From feedback I am not sure everybody noticed that in addition to being easy and effective, the method is actually novel (we haven’t yet found an academic reference to it or seen it already in use after visiting numerous clients). Likely it has been applied before (as it is a simple method), but it is not currently considered a standard method (something we would like to change).
In this example we are going to show what building a predictive model using vtreat best practices looks like assuming you were somehow already in the habit of using vtreat for your data preparation step. We are deliberately not going to explain any steps, but just show the small number of steps we advise routinely using. This is a simple schematic, but not a guide. Of course we do not advise use without understanding (and we work hard to teach the concepts in our writing), but want what small effort is required to add vtreat to your predictive modeling practice.
In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package vtreat in the examples we show in this note, but you can easily implement the approach independently of vtreat.
We have been recently working on and presenting on nested modeling issues. These are situations where the output of one trained machine learning model is part of the input of a later model or procedure. I am now of the opinion that correct treatment of nested models is one of the biggest opportunities for improvement in data science practice. Nested models can be more powerful than non-nested, but are easy to get wrong.
One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.
We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.
You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
The reason we care is: by making the computer work harder (perform many calculations simultaneously) we wait less time for our experiments and can run more experiments. This is especially important when doing data science (as we often do using the R analysis platform) as we often need to repeat variations of large analyses to learn things, infer parameters, and estimate model stability.
Typically to get the computer to work a harder the analyst, programmer, or library designer must themselves work a bit hard to arrange calculations in a parallel friendly manner. In the best circumstances somebody has already done this for you:
Good parallel libraries, such as the multi-threaded BLAS/LAPACK libraries included in Revolution R Open (RRO, now Microsoft R Open) (see here).
Parallelization abstraction frameworks such as Thrust/Rth (see here).
Using R application libraries that dealt with parallelism on their own (examples include gbm, boot and our own vtreat). (Some of these libraries do not attempt parallel operation until you specify a parallel execution environment.)
In addition to having a task ready to “parallelize” you need a facility willing to work on it in a parallel manner. Examples include:
Your own machine. Even a laptop computer usually now has four our more cores. Potentially running four times faster, or equivalently waiting only one fourth the time, is big.
Graphics processing units (GPUs). Many machines have a one or more powerful graphics cards already installed. For some numerical task these cards are 10 to 100 times faster than the basic Central Processing Unit (CPU) you normally use for computation (see here).