Posted on Categories Opinion, Statistics, TutorialsTags Leave a comment on data.table is Much Better Than You Have Been Told

data.table is Much Better Than You Have Been Told

There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table commands, to take advantage of data.table‘s superior performance. Obviously if one wants to use data.table it is best to learn data.table. But if we want code that can run multiple places a translation layer may be in order.

In this note we look at how this translation is commonly done.

Continue reading data.table is Much Better Than You Have Been Told

Posted on Categories Practical Data Science, Statistics, Statistics To English Translation, TutorialsTags , , , 2 Comments on Cohen’s D for Experimental Planning

Cohen’s D for Experimental Planning

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments.

Estimating sample size

Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. You’ll put a control group on the old plan, and a treatment group on the new plan, and after three months, you’ll measure how much weight the subjects lost, and see which plan does better on average.

The question is: how many subjects do you need to run a good experiment? Continue reading Cohen’s D for Experimental Planning

Posted on Categories Computer Science, TutorialsTags , , Leave a comment on Minimal Key Set is NP hard

Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Continue reading Minimal Key Set is NP hard

Posted on Categories data science, StatisticsTags , , , Leave a comment on Estimating Rates using Probability Theory: Chalk Talk

Estimating Rates using Probability Theory: Chalk Talk

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the ideas of: Sample size and power for rare events.

Please check it out here.

Posted on Categories Opinion, StatisticsTags , , Leave a comment on Technical books are amazing opportunities

Technical books are amazing opportunities

Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of:

Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years.

To us this reads as somebody with deep experience, confidence, and bit of humility. They do something technical and valuable, but because they understand it they do not consider it to be arcane magic.

In this note we describe might can happen if such a person (or if a junior version of such a person) acquires 1 or 2 technical books.

Continue reading Technical books are amazing opportunities

Posted on Categories Administrativia, Pragmatic Data ScienceTags , Leave a comment on Practical Data Science with R, half off sale!

Practical Data Science with R, half off sale!

Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.

  • Fri: Half off all eBooks
  • Sat: Half off all MEAPs
  • Sun: Half off all pBooks and liveVideos
  • Mon: Half off everything

The discount code is: wm052419au.

Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Free Video Lecture: Vectors for Programmers and Data Scientists

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Please check the lectures out.

NewImage

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , Leave a comment on Timing Working With a Row or a Column from a data.frame

Timing Working With a Row or a Column from a data.frame

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Continue reading Timing Working With a Row or a Column from a data.frame