Posted on Categories Computer Science, TutorialsTags , , Leave a comment on Minimal Key Set is NP hard

Minimal Key Set is NP hard

It usually gives us a chuckle when we find some natural and seemingly easy data science question is NP-hard. For instance we have written that variable pruning is NP-hard when one insists on finding a minimal sized set of variables (and also why there are no obvious methods for exact large permutation tests).

In this note we show that finding a minimal set of columns that form a primary key in a database is also NP-hard.

Continue reading Minimal Key Set is NP hard

Posted on Categories data science, StatisticsTags , , , Leave a comment on Estimating Rates using Probability Theory: Chalk Talk

Estimating Rates using Probability Theory: Chalk Talk

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the ideas of: Sample size and power for rare events.

Please check it out here.

Posted on Categories Opinion, StatisticsTags , , Leave a comment on Technical books are amazing opportunities

Technical books are amazing opportunities

Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of:

Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years.

To us this reads as somebody with deep experience, confidence, and bit of humility. They do something technical and valuable, but because they understand it they do not consider it to be arcane magic.

In this note we describe might can happen if such a person (or if a junior version of such a person) acquires 1 or 2 technical books.

Continue reading Technical books are amazing opportunities

Posted on Categories Administrativia, Pragmatic Data ScienceTags , Leave a comment on Practical Data Science with R, half off sale!

Practical Data Science with R, half off sale!

Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.

  • Fri: Half off all eBooks
  • Sat: Half off all MEAPs
  • Sun: Half off all pBooks and liveVideos
  • Mon: Half off everything

The discount code is: wm052419au.

Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!

Posted on Categories Mathematics, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Free Video Lecture: Vectors for Programmers and Data Scientists

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Please check the lectures out.

NewImage

Posted on Categories Opinion, Pragmatic Data Science, TutorialsTags , Leave a comment on Timing Working With a Row or a Column from a data.frame

Timing Working With a Row or a Column from a data.frame

In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames.

We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.

We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.

Continue reading Timing Working With a Row or a Column from a data.frame

Posted on Categories data science, Exciting Techniques, Practical Data Science, Pragmatic Data Science, TutorialsTags , , Leave a comment on Data Layout Exercises

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27

In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.

Continue reading Data Layout Exercises

Posted on Categories Administrativia, Practical Data ScienceTags , , Leave a comment on Practical Data Science with R Book Update (April 2019)

Practical Data Science with R Book Update (April 2019)

I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019.

Continue reading Practical Data Science with R Book Update (April 2019)