This section is elementary, but things really pick up speed as later on (also available in a paid preview).
Some more Practical Data Science with R news.
Practical Data Science with R is the book we wish we had when we started in data science. Practical Data Science with R, Second Edition is the revision of that book with the packages we wish had been available at that time (in particular
wrapr). A second edition also lets us also correct some omissions, such as not demonstrating
For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code
dotd082018au at https://www.manning.com/dotd).
For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.
Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!
Excited to see our new Hangul/Korean edition of “Practical Data Science with R” by Nina Zumel, John Mount, translated by Daekyoung Lim.
The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.
We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.
We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.
For the rest of this post, we give a quick how-to on using the
RpostgreSQL package to interact with Postgres databases in R.
One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.
Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.
To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).
A bit of text we are proud to steal from our good friend Joseph Rickert:
Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive Modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.
For more on SVMs see the original article on the Revolution Analytics blog.
When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:
- Missing values (
- Problematic numerical values (
NaN, sentinel values like 999999999 or -1)
- Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values
Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.
The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.
Our plan to teach is to:
- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.
Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).
Once you define what a data scientist does, you find fewer people want to work as one.
We expand a few of our points below. Continue reading A bit of the agenda of Practical Data Science with R
It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.
If you haven’t yet, order it now!
(softbound 416 pages, black and white; includes access to color PDF, ePub and Kindle when available)