Posted on Categories Administrativia, Practical Data ScienceTags , , Leave a comment on Practical Data Science with R Book Update (April 2019)

Practical Data Science with R Book Update (April 2019)

I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.

Continue reading Practical Data Science with R Book Update (April 2019)

Posted on Categories Administrativia, Practical Data ScienceTags , , Leave a comment on Practical Data Science with R Book Update

Practical Data Science with R Book Update

A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.

IMG 20190404 114957

This is as good an excuse as any to share a book update.

Continue reading Practical Data Science with R Book Update

Posted on Categories Administrativia, data science, Opinion, StatisticsTags , ,

PDSwR2 Free Excerpt and New Discount Code

Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 1 Comment on More Practical Data Science with R Book News

More Practical Data Science with R Book News

Some more Practical Data Science with R news.

Practical Data Science with R is the book we wish we had when we started in data science. Practical Data Science with R, Second Edition is the revision of that book with the packages we wish had been available at that time (in particular vtreat, cdata, and wrapr). A second edition also lets us also correct some omissions, such as not demonstrating data.table.

For your part: please help us get the word out about this book. Practical Data Science with R, Second Edition, R in Action, Second Edition, and Think Like a Data Scientist are Manning’s August 20th 2018 “Deal of the Day” (use code dotd082018au at https://www.manning.com/dotd).

For our part we are busy revising chapters and setting up a new Github repository for examples and code and other reader resources.

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , ,

Four Years of Practical Data Science with R

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

1960860 10203595069745403 608808262 o

Continue reading Four Years of Practical Data Science with R

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 2 Comments on Hangul/Korean edition of Practical Data Science with R!

Hangul/Korean edition of Practical Data Science with R!

Excited to see our new Hangul/Korean edition of “Practical Data Science with R” by Nina Zumel, John Mount, translated by Daekyoung Lim.

IMG 0865

Continue reading Hangul/Korean edition of Practical Data Science with R!

Posted on Categories Coding, data science, Expository Writing, Practical Data Science, Pragmatic Data Science, TutorialsTags , , , , , , , , 4 Comments on Using PostgreSQL in R: A quick how-to

Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data handling to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

Continue reading Using PostgreSQL in R: A quick how-to

Posted on Categories Practical Data Science, Pragmatic Data Science, StatisticsTags , , ,

Practical Data Science with R examples

One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory of our book support materials git repository.

Some readers want to work much closer to the sequence in the book. To make working along with book easier we extracted all book examples and shared them with our readers (in a Github directory, and a downloadable zip file, press “Raw” to download). The direct extraction from the book guarantees the files are in sync with our revised book. However there are trade-offs, sometimes (for legibility) the book mixed input and output without using R’s comment conventions. So you can’t always just paste everything. Also for a snippet to run you may need some libraries, data and results of previous snippets to be present in your R environment.

To help these readers we have added a new section to the book support materials: knitr markdown sheets that work all the book extracts from each chapter. Each chapter and appendix now has a matching markdown file that sets up the correct context to run each and every snippet extracted from the book. In principle you can now clone the entire zmPDSwR repository to your local machine and run all the from the CodeExamples directory by using the RStudio project in RunExamples. Correct execution also depens on having the right packages installed so we have also added a worksheet showing everything we expect to see installed in one place: InstallAll.Rmd (note some of the packages require external dependencies to work such as a C compiler, curl libraries, and a Java framework to run).

Posted on Categories Administrativia, Practical Data ScienceTags , , , ,

Thank you Joseph Rickert!

A bit of text we are proud to steal from our good friend Joseph Rickert:

Then, for some very readable background material on SVMs I recommend section 13.4 of Applied Predictive Modeling and sections 9.3 and 9.4 of Practical Data Science with R by Nina Zumel and John Mount. You will be hard pressed to find an introduction to kernel methods and SVMs that is as clear and useful as this last reference.

For more on SVMs see the original article on the Revolution Analytics blog.

Posted on Categories Coding, data science, math programming, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , , , , , , 3 Comments on Vtreat: designing a package for variable treatment

Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again:

  • Missing values (NA or blanks)
  • Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1)
  • Valid categorical levels that don’t appear in the training data (especially when there are rare levels, or a large number of levels)
  • Invalid values

Of course, you should examine the data to understand the nature of the data issues: are the missing values missing at random, or are they systematic? What are the valid ranges for the numerical data? Are there sentinel values, what are they, and what do they mean? What are the valid values for text fields? Do we know all the valid values for a categorical variable, and are there any missing? Is there any principled way to roll up category levels? In the end though, the steps you take to deal with these issues will often be the same from data set to data set, so having a package of ready-to-go functions for data treatment is useful. In this article, we will discuss some of our usual data treatment procedures, and describe a prototype R package that implements them.

Continue reading Vtreat: designing a package for variable treatment