Using PostgreSQL in R: A quick how-to

Posted on Categories Coding, data science, Expository Writing, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , , , , , 3 Comments on Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

Continue reading Using PostgreSQL in R: A quick how-to

Win-Vector news

Posted on Categories AdministrativiaTags , Leave a comment on Win-Vector news

Just an update of what we have been up to lately at Win-Vector LLC, and a reminder of some of our current offerings. It has been busy lately (and that is good).

Our current professional service offerings continue to be data science consulting (helping companies extract value from their data and data infrastructure) and on-site corporate training. We have been honored to recently deliver our training to teams at Salesforce and Genentech. Continue reading Win-Vector news

Practical Data Science with R examples

Posted on Categories Practical Data Science, Pragmatic Data Science, StatisticsTags , , , Leave a comment on Practical Data Science with R examples

One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.

Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.

To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).

What was data science before it was called data science?

Posted on Categories Opinion, StatisticsTags , , , 3 Comments on What was data science before it was called data science?

“Data Science” is obviously a trendy term making it way through the hype cycle. Either nobody is good enough to be a data scientist (unicorns) or everybody is too good to be a data scientist (or the truth is somewhere in the middle).



NewImage

Gartner hype cycle (Wikipedia).

And there is a quarter that grumbles that we are merely talking about statistics under a new name (see here and here).

It has always been the case that advances in data engineering (such as punch cards, or data centers) make analysis practical at new scales (though I still suspect Map/Reduce was a plot designed to trick engineers into being excited about ETL and report generation).


NewImage

Data Science 1832: Semen Korsakov card.

However, in the 1940s and 1950s the field was called “operations research” (even when performed by statisticians). When you read John F. Magee, (2002) “Operations Research at Arthur D. Little, Inc.: The Early Years”, Operations Research 50(1):149-153 http://dx.doi.org/10.1287/opre.50.1.149.17796 you really come away with the impression you are reading about a study of online advertising performed in the 1940s (okay mail advertising, but mail was “the email of its time”).

In this spirit next week we will write about the sequential analysis solution for A/B-testing, invented in the 1940s by one of the greats of statistics and operations research: Abraham Wald (whom we have written about before).


NewImage

Abraham Wald

Who is allowed to call themselves a data scientist?

Posted on Categories Pragmatic Data Science, Quantitative Finance, RantsTags , , 3 Comments on Who is allowed to call themselves a data scientist?

It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists.



NewImage
Gartner hype cycle (Wikipedia).

Given we agree data science exists, who is allowed to call themselves a data scientist? Continue reading Who is allowed to call themselves a data scientist?

I was wrong about statistics

Posted on Categories Opinion, Rants, StatisticsTags , , , 9 Comments on I was wrong about statistics

I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).


5317820857 d1f6a5b8a9 b
Used wrong (image Justin Baeder, some rights reserved).

Continue reading I was wrong about statistics

Announcing: Introduction to Data Science video course

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, StatisticsTags , ,

Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy.


Data Science 2
Continue reading Announcing: Introduction to Data Science video course

The gap between data mining and predictive models

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, StatisticsTags , , , 3 Comments on The gap between data mining and predictive models

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining results and usable predictive models. Data mining results like this (and the infamous “Beer and Diapers story”) face an expectation that one is immediately ready to implement something like what is claimed in: “Target Figured Out A Teen Girl Was Pregnant Before Her Father Did” once an association is plotted.

Producing a revenue improving predictive model is much harder than mining an interesting association. And this is what we will discuss here. Continue reading The gap between data mining and predictive models

Big News! “Practical Data Science with R” MEAP launched!

Posted on Categories Administrativia, data science, Exciting Techniques, StatisticsTags , , , , , 7 Comments on Big News! “Practical Data Science with R” MEAP launched!

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.


Zumel PDSwithR 3

Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward). Continue reading Big News! “Practical Data Science with R” MEAP launched!

Data Science, Machine Learning, and Statistics: what is in a name?

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , 4 Comments on Data Science, Machine Learning, and Statistics: what is in a name?

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990’s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Continue reading Data Science, Machine Learning, and Statistics: what is in a name?