Finding the K in K-means by Parametric Bootstrap

Posted on Categories data science, Exciting Techniques, Expository Writing, Mathematics, StatisticsTags , , , , , , Leave a comment on Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R.

We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.

You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.

Continue reading Finding the K in K-means by Parametric Bootstrap

Databases in containers

Posted on Categories Coding, Exciting Techniques, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, RantsTags , , , , 5 Comments on Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).


Containerized DB

NewImage

The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

Continue reading Databases in containers

Neglected optimization topic: set diversity

Posted on Categories Applications, data science, Expository Writing, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , Leave a comment on Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

The problem

Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = {x_i1, x_i2, ..., x_ik} such that there is a high probability one of the x in X is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

  • Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
  • Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
  • Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
  • Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
  • Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).


Rplot01

Minimal spanning trees, the basis of one set diversity metric.

Continue reading Neglected optimization topic: set diversity

Free video course: applied Bayesian A/B testing in R

Posted on Categories Administrativia, Pragmatic Data Science, StatisticsTags , , Leave a comment on Free video course: applied Bayesian A/B testing in R

As a “thank you” to our blog, mailing list, and Twitter followers (@WinVectorLLC) we at Win-Vector LLC have decided to re-release our formerly fee-based A/B testing video course as a free (advertisement supported) video course here on Youtube.


CTvideoCourse

The course emphasizes how to design A/B tests using prior “guestimates” of effect sizes (often you have these from prior campaigns, or somebody claims an effect size and it is merely your job to confirm it). It is fairly technical, and the emphasis is Bayesian- where we are trying to get an actual estimate of the distribution unknown true expected payoff rate of the various campaigns (the so-called posteriors). We show how to design and evaluate a sales campaigns for a product at two different price points.

The solution is coded in R and Nina Zumel has contributed an updated Shiny user interface demonstrating the technique (for more on Shiny, please see here). The code for the calculation methods and older shiny app are shared here. Continue reading Free video course: applied Bayesian A/B testing in R

Using PostgreSQL in R: A quick how-to

Posted on Categories Coding, data science, Expository Writing, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , , , , , 4 Comments on Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

Continue reading Using PostgreSQL in R: A quick how-to

“Introduction to Data Science” video course contest is closed

Posted on Categories Administrativia, CodingTags , , Leave a comment on “Introduction to Data Science” video course contest is closed

Congratulations to all the winners of the Win-Vector “Introduction to Data Science” Video Course giveaway! We’ve emailed all of you your individual subscription coupons. Continue reading “Introduction to Data Science” video course contest is closed

Running R jobs quickly on many machines

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, ProgrammingTags , , 4 Comments on Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.


NewImage
Colossus supercomputer : The Forbin Project

Continue reading Running R jobs quickly on many machines

Win-Vector data science mailing list (and a give-away!)

Posted on Categories Administrativia, StatisticsTags 2 Comments on Win-Vector data science mailing list (and a give-away!)

Win-Vector LLC is starting a data science mailing list that we would like you to sign up for. It is going to be a (deliberately infrequent) set of updates including Win-Vector LLC notices, upcoming speaking events, and data science products.


To kick this off we will be awarding 5 free permanent subscriptions to our video course “Introduction to Data Science” to people who join the mailing list in January 2016 (people who have already signed up already eligible!).
(The contest is over, thank you all for entering!).

For more news/announcements please follow us on:

Prepping Data for Analysis using R

Posted on Categories Statistics, TutorialsTags , , Leave a comment on Prepping Data for Analysis using R

Nina and I are proud to share our lecture: “Prepping Data for Analysis using R” from ODSC West 2015.


Nina Zumel and John Mount ODSC WEST 2015

It is about 90 minutes, and covers a lot of the theory behind the vtreat data preparation library.

We also have a Github repository including all the lecture materials here. Continue reading Prepping Data for Analysis using R