The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.
We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.
We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.
For the rest of this post, we give a quick how-to on using the
RpostgreSQL package to interact with Postgres databases in R.
Continue reading Using PostgreSQL in R: A quick how-to
One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.
Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.
To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).
A bit of text we are proud to steal from our good friend Joseph Rickert:
Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive Modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.
For more on SVMs see the original article on the Revolution Analytics blog.
When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:
- Missing values (
NA or blanks)
- Problematic numerical values (
NaN, sentinel values like 999999999 or -1)
- Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values
Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.
Continue reading Vtreat: designing a package for variable treatment
The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.
Our plan to teach is to:
- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.
Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).
Once you define what a data scientist does, you find fewer people want to work as one.
We expand a few of our points below. Continue reading A bit of the agenda of Practical Data Science with R
It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.
If you haven’t yet, order it now!
(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)
Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/.
Deal of the Day August 1: Half off my book Practical Data Science with R. Use code dotd0801au at www.manning.com/zumel/
A bit about our upcoming book “Practical Data Science with R”. Nina and I share our current draft of the front matter from the book, which is a description which will help you decide if this is the book for you (we hope that it is). Or this could be the book that helps explain what you do to others.
Continue reading What is “Practical Data Science with R”?
We have some great news for “Practical Data Science with R”:
- We have started an announcement page to point direct readers to the book, book forums, data and the free preview chapter.
Continue reading Practical Data Science with R news