Posted on Categories Administrativia, data science, Exciting Techniques, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , Leave a comment on Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

Free R/datascience Extract: Evaluating a Classification Model with a Spam Filter

We are excited to share a free extract of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019: Evaluating a Classification Model with a Spam Filter.

Zumel eacmwasf 02

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

This teaching style has worked very well for us both in R and in Python (it is considered one of the merits of our LinkedIn AI Academy course design):

One of the best data science courses I’ve taken. The course focuses on model selection and evaluation which are usually underestimated. Thanks to John Mount, the teacher and the co-authors of Practical Data Science with R. hashtag#AI200

(Note: Nina Zumel, leads on the course design, which is the heavy lifting, John Mount just got tasked to be the one delivering it.)

Zumel, Mount, Practical Data Science with R, 2nd Edition is coming out in print very soon. Here is a discount code to help you get a good deal on the book:

Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.

Posted on Categories Administrativia, data science, OpinionTags , Leave a comment on AI for Engineers

AI for Engineers

For the last year we (Nina Zumel, and myself: John Mount) have had the honor of teaching the AI200 portion of LinkedIn’s AI Academy.

John Mount at LinkedIn

John Mount at the LinkedIn campus

Nina Zumel designed most of the material, and John Mount has been delivering it and bringing her feedback. We’ve just started our 9th cohort. We adjust the course each time. Our students teach us a lot about how one thinks about data science. We bring that forward to each round of the course.

Roughly the goal is the following.

If every engineer, product manager, and project manager had some hands-on experience with data science and AI (deep neural nets), then they are both more likely to think of using these techniques in their work and of introducing the instrumentation required to have useful data in the first place.

This will have huge downstream benefits for LinkedIn. Our group is thrilled to be a part of this.

We are looking for more companies that want an on-site data science intensive for their teams (either in Python or in R).

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , , Leave a comment on vtreat Cross Validation

vtreat Cross Validation

Nina Zumel finished new documentation on how vtreat‘s cross validation works, which I want to share here.

vtreat is a system that makes data preparation for machine learning a “one-liner” (available in R or available in Python). We have a set of starting off points here. These documents describe what vtreat does for you, you just find the one that matches your task and you should have a good start for solving data science problems in R or in Python.

The latest documentation is a bit about how vtreat works, and how to control some of the details of the work it is doing for you.

The new documentation is:

Please give one of the examples a try, and consider adding vtreat to your data science workflow.

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , Leave a comment on New vtreat Documentation (Starting with Multinomial Classification)

New vtreat Documentation (Starting with Multinomial Classification)

Nina Zumel finished some great new documentation showing how to use Python vtreat to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R vtreat. So we now have good introductions on how to use vtreat to prepare data for the common tasks of:

That is now 8 introductions to start with. To use vtreat you only have to work through one introduction (the one helping with the task you have at hand in the language you are using).

As I have said before:

  • vtreat helps with project blocking issues commonly seen in real world data: missing values, re-coding categorical variables, and dealing high cardinality categorical variables.
  • If you aren’t using a tool like vtreat in your data science projects: you are really missing out (and making more work for yourself).
Posted on Categories data science, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , Leave a comment on How to Prepare Data

How to Prepare Data

Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables.

In this note we describe some great tools for working with such data.

Continue reading How to Prepare Data

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , Leave a comment on Preparing Data for Supervised Classification

Preparing Data for Supervised Classification

Nina Zumel has been polishing up new vtreat for Python documentation and tutorials. They are coming out so good that I find to be fair to the R community I must start to back-port this new documentation to vtreat for R.

Continue reading Preparing Data for Supervised Classification

Posted on Categories Practical Data Science, Statistics, TutorialsTags , , , , , Leave a comment on The Advantages of Record Transform Specifications

The Advantages of Record Transform Specifications

Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R.


Keras plot

I will use this example to show some of the advantages of cdata record transform specifications.

Continue reading The Advantages of Record Transform Specifications

Posted on Categories data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , , Leave a comment on Advanced Data Reshaping in Python and R

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”).

The advantages of data_algebra and cdata are:

  • The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
  • The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
  • The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Continue reading Advanced Data Reshaping in Python and R

Posted on Categories Administrativia, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , Leave a comment on New Getting Started with vtreat Documentation

New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation.

vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from:

  • Missing values
  • Large cardinality categorical variables
  • Novel levels from categorical variables

I hoped she could get the Python vtreat documentation up to parity with the R vtreat documentation. But I think she really hit the ball out of the park, and went way past that.

The new documentation is 3 “getting started” guides. These guides deliberately overlap, so you don’t have to read them all. Just read the one suited to your problem and go.

The new guides:

Perhaps we can back-port the new guides to the R version at some point.

Posted on Categories Administrativia, data science, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , , , , Leave a comment on Introducing data_algebra

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases.

In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable).

Continue reading Introducing data_algebra