Kudos to Professor Andrew Gelman for telling a great joke at his own expense:
Stupid-ass statisticians don’t know what a goddam confidence interval is.
He brilliantly burlesqued a frustrating common occurrence many people say they “have never seen happen.”
One of the pains of writing about data science is there is a (small but vocal) sub-population of statisticians jump on your first mistake (we all make errors) and then expand it into an essay on how you: known nothing, are stupid, are ignorant, are unqualified, and are evil.
I get it: many people writing about data science do not know enough statistics. However, not every person writing from a data science point of view is statistically ignorant. That is not to say computer science (my original field) doesn’t have similar problems.
Trying to destroy a sweater by pulling on a loose thread in no way establishes that it wasn’t a nice sweater in the first place (or how nice a sweater it would be if the loose thread were fixed).
(BTW: the book in question is in fact excellent. Chapter 12 alone is worth at least ten times the list price of the book.)
We are excited to announce the
rquery is Win-Vector LLC‘s currently in development big data query tool for
rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with
dplyr at big data scale in production).
Continue reading Announcing rquery
I am excited to share a new deep learning model performance trajectory graph.
Here is an example produced based on Keras in R using ggplot2:
Continue reading Plotting Deep Learning Model Performance Trajectories
For some time we have been teaching
R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis."
The idea behind the advice is: working with fewer columns makes for quicker queries.
photo: Jacques Henri Lartigue 1912
The issue arises because wide tables (200 to 1000 columns) are quite common in big-data analytics projects. Often these are "denormalized marts" that are used to drive many different projects. For any one project only a small subset of the columns may be relevant in a calculation.
Continue reading How to Greatly Speed Up Your Spark Queries
Was enjoying Gabriel’s article Pipes in R Tutorial For Beginners and wanted call attention to a few more pipes in R (not all for beginners).
Continue reading More Pipes in R
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Continue reading Getting started with seplyr
In our last article we pointed out a dangerous silent result corruption we have seen when using the
dplyr package with databases.
To systematically avoid this result corruption we suggest breaking up your
dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any value in the same mutate it is formed). We consider these to be key and critical precautions to take when using
dplyr with a database.
We would also like to point out we are also distributing free tools to do this automatically, and a worked example of this solution.
A note to
dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside
dplyr::mutate() statements. Continue reading Please inspect your dplyr+database code