Posted on Categories Administrativia, data science, StatisticsTags , , Leave a comment on Introduction to Data Science in R, Free for 3 days

Introduction to Data Science in R, Free for 3 days

To celebrate the new year and the recent release of Practical Data Science with R 2nd Edition, we are offering a free coupon for our video course “Introduction to Data Science.”

The following URL and code should get you permanent free access to the video course, if used between now and January 1st 2020:

https://www.udemy.com/course/introduction-to-data-science/ code: PDSWR2

Posted on Categories Exciting Techniques, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 1 Comment on PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

PyData Los Angeles 2019 talk: Preparing Messy Real World Data for Supervised Machine Learning

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks.

Please check it out.

(Slides are also here.)

Posted on Categories data science, Opinion, StatisticsTags , Leave a comment on What is a Second Edition?

What is a Second Edition?

What it is a second edition of a book to its authors?

In some sense it is the book the authors thought they were writing the first time.

Continue reading What is a Second Edition?

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , Leave a comment on Why to try Practical Data Science with R, 2nd Edition

Why to try Practical Data Science with R, 2nd Edition

I thought we would try to express why somebody interested in using the R language (and package ecosystem) for supervised machine learning, data wrangling, analytics projects, and other data science topics should give Practical Data Science with R, 2nd Edition a try.

Nina Zumel and I shared the book with two incredible data scientists (Jeremy Howard and Rachel Thomas), and they helped answer the question with the following as the Practical Data Science with R, 2nd Edition forward:

Practical Data Science with R, Second Edition, is a hands-on guide to data science, with a focus on techniques for working with structured or tabular data, using the R language and statistical packages. The book emphasizes machine learning, but is unique in the number of chapters it devotes to topics such as the role of the data scientist in projects, managing results, and even designing presentations. In addition to working out how to code up models, the book shares how to collaborate with diverse teams, how to translate business goals into metrics, and how to organize work and reports. If you want to learn how to use R to work as a data scientist, get this book.

We have known Nina Zumel and John Mount for a number of years. We have invited them to teach with us at Singularity University. They are two of the best data scientists we know. We regularly recommend their original research on cross-validation and impact coding (also called target encoding). In fact, chapter 8 of Practical Data Science with R teaches the theory of impact coding and uses it through the authors own R package: vtreat.

Practical Data Science with R takes the time to describe what data science is, and how a data scientist solves problems and explains their work. It includes careful descriptions of classic supervised learning methods, such as linear and logistic regression. We liked the survey style of the book and extensively worked examples using contest-winning methodologies and packages such as random forests and xgboost. The book is full of useful, shared experience and practical advice. We notice they even include our own trick of using random forest variable importance for initial variable screening.

Overall, this is a great book, and we highly recommend it.

Jeremy Howard and Rachel Thomas

About the forward authors.

Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a faculty member at the University of San Francisco, and is chief scientist at doc.ai and platform.ai.

Previously, Jeremy was the founding CEO of Enlitic, which was the first company to apply deep learning to medicine, and was selected as one of the worlds top 50 smartest companies by MIT Tech Review two years running. He was the president and chief scientist of the data science platform Kaggle, where he was the top-ranked participant in international machine learning competitions two years running.

Rachel Thomas is director of the USF Center for Applied Data Ethics and cofounder of fast.ai, which has been featured in The Economist, MIT Tech Review, and Forbes. She was selected by Forbes as one of 20 Incredible Women in AI, earned her math PhD at Duke, and was an early engineer at Uber. Rachel is a popular writer and keynote speaker. In her TEDx talk, she shares what scares her about AI and why we need people from all backgrounds involved with AI.

Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning, 2019 is available from:

Posted on Categories data science, Pragmatic Data Science, TutorialsTags , , , , Leave a comment on A Richer Category for Data Wrangling

A Richer Category for Data Wrangling

I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable.

I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

Continue reading A Richer Category for Data Wrangling

Posted on Categories Administrativia, Computer Science, Pragmatic Data ScienceTags , , , , Leave a comment on Better SQL Generation via the data_algebra

Better SQL Generation via the data_algebra

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.

Continue reading Better SQL Generation via the data_algebra

Posted on Categories data science, TutorialsTags , , , , , Leave a comment on New rquery Vignette: Working with Many Columns

New rquery Vignette: Working with Many Columns

We have a new rquery vignette here: Working with Many Columns.

This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation).

Please check it out.

Posted on Categories data science, TutorialsTags , , , , 1 Comment on data_algebra/rquery as a Category Over Table Descriptions

data_algebra/rquery as a Category Over Table Descriptions

Introduction

I would like to talk about some of the design principles underlying the data_algebra package (and also in its sibling rquery package).

The data_algebra package is a query generator that can act on either Pandas data frames or on SQL tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.

Continue reading data_algebra/rquery as a Category Over Table Descriptions

Posted on Categories data science, Exciting Techniques, TutorialsTags , , , , , 3 Comments on What is new for rquery December 2019

What is new for rquery December 2019

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python).

Lets see what rquery is good at, and what new features are making rquery better.

Continue reading What is new for rquery December 2019

Posted on Categories Administrativia, data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , Leave a comment on Slides for PyData LA 2019 vtreat Talk

Slides for PyData LA 2019 vtreat Talk

Slides for PyData LA 2019 vtreat Talk are here!