I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.
John Mount discussing
SQL generation at LinkedIn.
If you have a group using
R at database or
Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.
Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.
Win-Vector LLC recently announced the
R package, an operator based query generator.
In this note I want to share some exciting and favorable initial rquery benchmark timings.
Continue reading rquery: Fast Data Manipulation in R
This note describes a useful
replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).
Continue reading Use a Join Controller to Document Your Work
I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:
Please read on for my discussion of this diagram and teaching joins. Continue reading Visualizing relational joins
We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described.
We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar with, but forget not everybody uses) has caused readers. Also, database adapters for R have greatly improved, so we feel more confident depending on them alone. Practical Data Science with R remains an excellent book and a good resource to learn from that we are very proud of and fully support (hence errata). Continue reading Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article
replyr is an
R package that contains extensions, adaptions, and work-arounds to make remote
dplyr data sources (including big data systems such as
Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory
Spark2 currently being the primary supported platforms).
Continue reading replyr: Get a Grip on Big Data in R
Consider the common following problem: compute for a data set (say the infamous
iris example data set) per-group ranks. Suppose we want the rank of
Sepal.Lengths on a per-
Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
Iris, by Diliff – Own work, CC BY-SA 3.0, Link
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”. Continue reading Organize your data manipulation in terms of “grouped ordered apply”
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how
data.frames describe themselves (try “
str(data.frame(x=1:2))” in an
R-console to see this) and is part of the tidy data manifesto.
SQL (structured query language) and
dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below. Continue reading The case for index-free data manipulation
I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. Continue reading MySql in a container
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool.
In her case the tools were the data manipulation grammars SQL (Structured Query Language) and dplyr. It happened to be the case that in both cases the implementation was supplied by a backing database system (PostgreSQL), but the database was not the center of attention for very long.
In this note we will concentrate on SQL (which itself can be used to implement dplyr operators, and is available on even Hadoop scaled systems such as Hive). Our point can be summarized as: SQL isn’t the price of admission to a server, a server is the fee paid to use SQL. We will try to reduce the fee and show how to containerize PostgreSQL on Microsoft Windows (as was already done for us on Apple OSX).
The Smashing Pumpkins “Bullet with Butterfly Wings” (start 2 minutes 6s)
“Despite all my rage I am still just a rat in a cage!”
(image credit). Continue reading Databases in containers