Posted on Categories data science, Opinion, Pragmatic Data Science, Programming, StatisticsTags , , , 3 Comments on Data Manipulation Corner Cases

Data Manipulation Corner Cases

Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

Let’s see what happens when we try to stick a fork in the power-outlet.

Fork

Continue reading Data Manipulation Corner Cases

Posted on Categories data science, Programming, TutorialsTags , , , 1 Comment on rquery Substitution

rquery Substitution

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.

Continue reading rquery Substitution

Posted on Categories data science, Exciting Techniques, TutorialsTags , Leave a comment on Query Generation in R

Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

Continue reading Query Generation in R

Posted on Categories Administrativia, data science, Opinion, StatisticsTags , , Leave a comment on PDSwR2 Free Excerpt and New Discount Code

PDSwR2 Free Excerpt and New Discount Code

Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags 5 Comments on PDSwR2: New Chapters!

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review!

NewImage

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

Posted on Categories data science, Exciting Techniques, TutorialsTags , , , , 1 Comment on Function Objects and Pipelines in R

Function Objects and Pipelines in R

Composing functions and sequencing operations are core programming concepts.

Some notable realizations of sequencing or pipelining operations include:

The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.

In this note we will discuss the advanced R pipeline operator "dot arrow pipe" and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.

Continue reading Function Objects and Pipelines in R

Posted on Categories data science, Exciting Techniques, Statistics, TutorialsTags , , 1 Comment on Fully General Record Transforms with cdata

Fully General Record Transforms with cdata

One of the design goals of the cdata R package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).

But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.

Continue reading Fully General Record Transforms with cdata

Posted on Categories Administrativia, data science, Practical Data Science, StatisticsTags , 1 Comment on Practical Data Science with R, 2nd Edition discount!

Practical Data Science with R, 2nd Edition discount!

Please help share our news and this discount.

The second edition of our best-selling book Practical Data Science with R2, Zumel, Mount is featured as deal of the day at Manning.

NewImage

The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.

The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).

Here are the details in Tweetable form:

Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags ,

vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).

In this note we will discuss a small aspect of the vtreat package: variable screening.

Continue reading vtreat Variable Importance

Posted on Categories data science, Exciting Techniques, Programming, TutorialsTags , , , , , , , 2 Comments on Sharing Modeling Pipelines in R

Sharing Modeling Pipelines in R

Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts. wrapr supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes here). We will demonstrate this with the vtreat data preparation system.

Continue reading Sharing Modeling Pipelines in R