A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.

This is as good an excuse as any to share a book update.

Skip to content
# Tag: Practical Data Science

Posted on Categories Administrativia, Practical Data ScienceLeave a comment on Practical Data Science with R Book Update## Practical Data Science with R Book Update

Posted on Categories Administrativia, data science, Opinion, StatisticsLeave a comment on PDSwR2 Free Excerpt and New Discount Code## PDSwR2 Free Excerpt and New Discount Code

Posted on Categories Administrativia, Practical Data Science, Statistics1 Comment on More Practical Data Science with R Book News## More Practical Data Science with R Book News

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, Statistics## Four Years of Practical Data Science with R

Posted on Categories Administrativia, Practical Data Science, Statistics2 Comments on Hangul/Korean edition of Practical Data Science with R!## Hangul/Korean edition of Practical Data Science with R!

Posted on Categories Coding, data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Tutorials4 Comments on Using PostgreSQL in R: A quick how-to## Using PostgreSQL in R: A quick how-to

Posted on Categories Practical Data Science, Pragmatic Data Science, Statistics## Practical Data Science with R examples

Posted on Categories Administrativia, Practical Data Science## Thank you Joseph Rickert!

Posted on Categories Coding, data science, math programming, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics3 Comments on Vtreat: designing a package for variable treatment## Vtreat: designing a package for variable treatment

Posted on Categories data science, Expository Writing, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Rants, Statistics## A bit of the agenda of Practical Data Science with R

A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.

This is as good an excuse as any to share a book update.

Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

Some more *Practical Data Science with R* news.

*Practical Data Science with R* is the book we wish we had when we started in data science. *Practical Data Science with R, Second Edition* is the revision of that book with the packages we wish had been available at that time (in particular `vtreat`

, `cdata`

, and `wrapr`

). A second edition also lets us also correct some omissions, such as not demonstrating `data.table`

.

For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code `dotd082018au`

at https://www.manning.com/dotd).

For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

Continue reading Four Years of Practical Data Science with R

Excited to see our new Hangul/Korean edition of “Practical Data Science with R” by Nina Zumel, John Mount, translated by Daekyoung Lim.

Continue reading Hangul/Korean edition of Practical Data Science with R!

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a *serverless* SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

Image: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the `RpostgreSQL`

package to interact with Postgres databases in R.

One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.

Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.

To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).

A bit of text we are proud to steal from our good friend Joseph Rickert:

Then, for some very readable background material on SVMs I recommend section 13.4 of

Applied Predictive Modelingand sections 9.3 and 9.4 ofPractical Data Science with Rby Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.

For more on SVMs see the original article on the Revolution Analytics blog.

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

- Missing values (
`NA`

or blanks) - Problematic numerical values (
`Inf`

,`NaN`

, sentinel values like 999999999 or -1) - Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
- Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

Continue reading Vtreat: designing a package for variable treatment

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.

Our plan to teach is to:

- Order the material by what is expected from the data scientist.
- Emphasize the already available bread and butter machine learning algorithms that most often work.
- Provide a large set of worked examples.
- Expose the reader to a number of realistic data sets.

Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).

Once you define what a data scientist does, you find fewer people want to work as one.

We expand a few of our points below. Continue reading A bit of the agenda of Practical Data Science with R