We at Win-Vector LLC are very proud to announce that RStudio just inducted two more of our demonstration Shiny applications into their Shiny User Showcase gallery. Continue reading More Shiny user showcase demonstrations

# Month: February 2016

## What does a “slow news day” look like at Win-Vector LLC?

We at Win-Vector LLC have been writing this blog for almost 9 years. In that time we have accumulated a lot of what we feel is very good writing on data science topics (277 posts and 687 contributed comments). Below is what a “slow news day” looks like on the WordPress pages viewed statistics summary. That is: the list of what was read yesterday from our site. Continue reading What does a “slow news day” look like at Win-Vector LLC?

## Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book *Practical Data Science with R*.

We also came upon another cool approach, in the `mixtools`

package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The `boot.comp`

function estimates the number of components (let’s call it *k*) by incrementally testing the hypothesis that there are *k+1* components against the null hypothesis that there are *k* components, via parametric bootstrap.

You can use a similar idea to estimate the number of clusters in a clustering problem, if you make a few assumptions about the shape of the clusters. This approach is only heuristic, and more ad-hoc in the clustering situation than it is in mixture modeling. Still, it’s another approach to add to your toolkit, and estimating the number of clusters via a variety of different heuristics isn’t a bad idea.

Continue reading Finding the K in K-means by Parametric Bootstrap

## Databases in containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.

In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.

In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the *fee* paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).

Containerized DB

The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)

“Despite all my rage I am still just a rat in a cage!”

(image credit).

## Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

## The problem

Consider the following problem: for a number of items `U = {x_1`

, … `x_n}`

pick a small set of them `X = {x_i1, x_i2, ..., x_ik}`

such that there is a high probability one of the `x in X`

is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

- Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
- Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
- Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
- Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
- Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).

Minimal spanning trees, the basis of one set diversity metric.

## Free video course: applied Bayesian A/B testing in R

As a “thank you” to our blog, mailing list, and Twitter followers (@WinVectorLLC) we at Win-Vector LLC have decided to re-release our formerly fee-based A/B testing video course as a free (advertisement supported) video course here on Youtube.

The course emphasizes how to design A/B tests using prior “guestimates” of effect sizes (often you have these from prior campaigns, or somebody claims an effect size and it is merely your job to confirm it). It is fairly technical, and the emphasis is Bayesian- where we are trying to get an actual estimate of the distribution unknown true expected payoff rate of the various campaigns (the so-called posteriors). We show how to design and evaluate a sales campaigns for a product at two different price points.

The solution is coded in R and Nina Zumel has contributed an updated Shiny user interface demonstrating the technique (for more on Shiny, please see here). The code for the calculation methods and older shiny app are shared here. Continue reading Free video course: applied Bayesian A/B testing in R

## Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a *serverless* SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the `RpostgreSQL`

package to interact with Postgres databases in R.

## “Introduction to Data Science” video course contest is closed

Congratulations to all the winners of the Win-Vector “Introduction to Data Science” Video Course giveaway! We’ve emailed all of you your individual subscription coupons. Continue reading “Introduction to Data Science” video course contest is closed