Posted on Categories data science, StatisticsTags , , , Leave a comment on Estimating Rates using Probability Theory: Chalk Talk

Estimating Rates using Probability Theory: Chalk Talk

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the ideas of: Sample size and power for rare events.

Please check it out here.

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , , Leave a comment on Data Layout Exercises

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories data science, Opinion, Pragmatic Data Science, Programming, StatisticsTags , , , 3 Comments on Data Manipulation Corner Cases

Data Manipulation Corner Cases

Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.

Let’s see what happens when we try to stick a fork in the power-outlet.

Fork

Continue reading Data Manipulation Corner Cases

Posted on Categories data science, Programming, TutorialsTags , , , 1 Comment on rquery Substitution

rquery Substitution

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable.

This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which will be translated to a data.frame column name or a database column name) or a character/string (which will be translated to a constant) is important.

Continue reading rquery Substitution

Posted on Categories data science, Exciting Techniques, TutorialsTags ,

Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

Continue reading Query Generation in R

Posted on Categories Administrativia, data science, Opinion, StatisticsTags , ,

PDSwR2 Free Excerpt and New Discount Code

Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.

This section is elementary, but things really pick up speed as later on (also available in a paid preview).

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags 5 Comments on PDSwR2: New Chapters!

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review!

NewImage

The newly available chapters cover:

Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.

Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.

If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.

For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.

Posted on Categories data science, Exciting Techniques, TutorialsTags , , , , 1 Comment on Function Objects and Pipelines in R

Function Objects and Pipelines in R

Composing functions and sequencing operations are core programming concepts.

Some notable realizations of sequencing or pipelining operations include:

The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.

In this note we will discuss the advanced R pipeline operator "dot arrow pipe" and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.

Continue reading Function Objects and Pipelines in R

Posted on Categories data science, Exciting Techniques, Statistics, TutorialsTags , , 1 Comment on Fully General Record Transforms with cdata

Fully General Record Transforms with cdata

One of the design goals of the cdata R package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data into arbitrary record shapes (please see here and here for the concepts).

But as with all general ideas, it is much easier to see what we mean by the above with a concrete example.

Continue reading Fully General Record Transforms with cdata