FeaturedPosted on Categories Opinion, Practical Data Science, StatisticsTags , 2 Comments on Practical Data Science with R2

Practical Data Science with R2

The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language.

Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:

Pdsr2s

We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:

The dot notation for base R and the dplyr package did make me stand up and think. Certain things suddenly made sense.

Continue reading Practical Data Science with R2

Posted on Categories Opinion, Programming, TutorialsTags , , 1 Comment on Make Teaching R Quasi-Quotation Easier

Make Teaching R Quasi-Quotation Easier

To make teaching R quasi-quotation easier it would be nice if R string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts.

Continue reading Make Teaching R Quasi-Quotation Easier

Posted on Categories Programming, TutorialsTags , , , Leave a comment on R Tip: Use Inline Operators For Legibility

R Tip: Use Inline Operators For Legibility

R Tip: use inline operators for legibility.

A Python feature I miss when working in R is the convenience of Python‘s inline + operator. In Python, + does the right thing for some built in data types:

  • It concatenates lists: [1,2] + [3] is [1, 2, 3].
  • It concatenates strings: 'a' + 'b' is 'ab'.

And, of course, it adds numbers: 1 + 2 is 3.

The inline notation is very convenient and legible. In this note we will show how to use a related notation R.

Continue reading R Tip: Use Inline Operators For Legibility

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags , 1 Comment on Practical Data Science with R, 2nd Edition discount!

Practical Data Science with R, 2nd Edition discount!

Please help share our news and this discount.

The second edition of our best-selling book Practical Data Science with R2, Zumel, Mount is featured as deal of the day at Manning.

NewImage

The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.

The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).

Here are the details in Tweetable form:

Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.

Posted on Categories ProgrammingTags , , 2 Comments on R Tip: Use seqi() For Indexes

R Tip: Use seqi() For Indexes

R Tip: use seqi() for indexing.

R‘s 1:0 trap” is a mal-feature that confuses newcomers and is a reliable source of bugs. This note will show how to use seqi() to write more reliable code and document intent.

Continue reading R Tip: Use seqi() For Indexes

Posted on Categories Mathematics, Opinion, TutorialsTags , , Leave a comment on A Beautiful 2 by 2 Matrix Identity

A Beautiful 2 by 2 Matrix Identity

While working on a variation of the RcppDynProg algorithm we derived the following beautiful identity of 2 by 2 real matrices:

The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant.

This is derived from one of the check equations for the Moore–Penrose inverse and we have details of the derivation here, and details of the messy algebra here.

Posted on Categories Coding, Opinion, TutorialsTags , , , 7 Comments on Timing the Same Algorithm in R, Python, and C++

Timing the Same Algorithm in R, Python, and C++

While developing the RcppDynProg R package I took a little extra time to port the core algorithm from C++ to both R and Python.

This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on for a summary of the results.

Continue reading Timing the Same Algorithm in R, Python, and C++

Posted on Categories Programming, Statistics, Tutorials, UncategorizedTags , 4 Comments on What does it mean to write “vectorized” code in R?

What does it mean to write “vectorized” code in R?

One often hears that R can not be fast (false), or more correctly that for fast code in R you may have to consider “vectorizing.”

A lot of knowledgable R users are not comfortable with the term “vectorize”, and not really familiar with the method.

“Vectorize” is just a slightly high-handed way of saying:

R naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.

In this article we will make the above clear by working through a non-trivial example of writing vectorized code.

Continue reading What does it mean to write “vectorized” code in R?

Posted on Categories Exciting Techniques, math programming, TutorialsTags , , 3 Comments on Introducing RcppDynProg

Introducing RcppDynProg

RcppDynProg is a new Rcpp based R package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note).

Continue reading Introducing RcppDynProg

Posted on Categories UncategorizedLeave a comment on Rotary

Rotary

We try to keep this blog mostly technical and business (as we assume that is what our readers are here for).

However, this post is going to be an exception.

I’ve just got back from photographing the Rotary Club of San Francisco‘s 2018 Holiday Party. We had a special guest SF Mayor London Breed (shown here with Rotary Club of San Francisco President Rhonda Poppen).

IMG 0136

I am proud to say I have been a member of this organization for over 10 years. It is where I do my volunteer work both in San Francisco and internationally.

In particular I am thrilled to be supporting the efforts of a number of Rotarians and Roots of Peace in their latest effort to remediate farmland in Vietnam (with the help and permission of the Vietnamese government). These people are working hard to undo some of the pain and misery of unexploded ordinance (UXO). I’ll be helping with some administrative tasks and these people will be training hundreds of farmers to move into profitable world market crops.

IMG 0039

Pictured above Heidi Kuhn and Christian Kuhn of Roots of Peace.

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , Leave a comment on vtreat Variable Importance

vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).

In this note we will discuss a small aspect of the vtreat package: variable screening.

Continue reading vtreat Variable Importance