Posted on Categories data science, Opinion, StatisticsTags ,

What is a Second Edition?

What it is a second edition of a book to its authors?

In some sense it is the book the authors thought they were writing the first time.

With some good fortune a second edition can be much more than that.

For our example: Nina and I received a lot of positive and useful feedback from people who used the first edition of Practical Data Science with R to learn from, or even to teach from. This helped tell us what was working as we had hoped and what needed some improvement.

In the new second edition of Practical Data Science with R we were able to make some major improvements that we think both new readers and returning readers will be interested in.

  • The ad-hoc examples worked on the KDD 2009 data set have been removed and replaced with easier to document and justify steps from the vtreat package. We kept the structure of the 1st edition of teaching the concepts directly (without a package) on smaller data before moving to our example problem. However, the example problem is now worked using the vtreat package, allowing us to separate essential issues from inessential difficulties.
  • We used the extra space to include an entirely new chapter on how to use vtreat to prepare messy data. This is the newest vtreat manual we have, and incorporates a lot we have learned about teaching the use of vtreat. So as not to “hold the package hostage to the book” we have also improved the free vtreat documentation here. (Note: vtreat is also now available for Python users here.)
  • We added a chapter on data wrangling, which shows how to perform the essential data arrangement steps in base-R, data.table, and dplyr. This gives the user a lot of options and the ability to compare options.
  • We begged, borrowed, and stole space to add a major section on regularization and how it improves model performance.
  • We were able to add gradient boosting, to complement random forests.
  • We were able to include a section on model explainability. These are new per-example diagnostics that compliment global model diagnostics (such as variable importance) and are going to be more and more important as people finally start to devote more time to machine learning ethics.

Biggest current regret: we didn’t have space to do a lot more with data re-shaping using cdata (free material on this can be found here, also now available in Python via the data_algebra package).

As an aside: a big purpose of the book is teaching the terminology, so that the above statements seem familiar and make sense.

Please check out the book at Amazon.com or from the publisher Manning.com (half off at the time of this writing!).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.