Posted on Categories Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , Leave a comment on What is vtreat?

What is vtreat?

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict “y”. In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified “y” or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.

Worked examples can be found here.

For more detail please see here: arXiv:1611.09477 stat.AP (the documentation describes the R version, however all of the examples can be found worked in Python here).

vtreat is available as a Python/Pandas package, and also as an R package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

Some operational examples can be found here.

Posted on Categories Administrativia, Pragmatic Data ScienceTags , , , , Leave a comment on Speaking at BARUG

Speaking at BARUG

We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us.

Nina Zumel & John Mount
Practical Data Science with R

Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data Science using R. We have been working hard on an improved and revised 2nd edition of our book (coming out this Fall). The book reflects more experience with data science, teaching, and with R itself. We will talk about what direction we think the R community has been taking, how this affected the book, and what is new in the upcoming edition.

Posted on Categories TutorialsTags , , Leave a comment on Returning to Tides

Returning to Tides

Fred Viole shared a great “data only” R solution to the forecasting tides problem.

Unknown

The methodology comes from a finance perspective, and has some great associated notes and articles.

This gives me a chance to comment on the odd relation between prediction and profit in finance.

Continue reading Returning to Tides

Posted on Categories OpinionTags , Leave a comment on Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.

In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.

Nightingale mortality

By w:Florence Nightingale (1820–1910). – http://www.royal.gov.uk/output/Page3943.asp [dead link], Public Domain, Link

For more please here.

Posted on Categories data science, Expository Writing, Opinion, Pragmatic Data Science, Pragmatic Machine Learning, TutorialsTags , , , , , Leave a comment on Lord Kelvin, Data Scientist

Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref).

NewImage

Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876

The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations.

The tide calculating machine embodied ideas of Sir Isaac Newton, and Pierre-Simon Laplace (ref), and could predict tide driven water levels by the means of wheels and gears.

The question is: can modern data science tools quickly forecast tides to similar accuracy?

Continue reading Lord Kelvin, Data Scientist

Posted on Categories OpinionTags , , , , Leave a comment on PyCharm Video Review

PyCharm Video Review

My basic video review of the PyCharm integrated development environment for Python with Anaconda and Jupyter/iPython integration. I like the IDE extensions enough to pay for them early in my evaluation. Highly recommended for data science projects, at least try one of the open-source or the trial versions.

Posted on Categories Opinion, StatisticsTags , Leave a comment on Some Notes on GNU Licenses in R Packages

Some Notes on GNU Licenses in R Packages

I was recently asked if Win-Vector LLC would move the R wrapr package from a GPL-3 license to an LGPL license. In the end I decided to move wrapr distribution to a “GPL-2 | GPL-3” license. This means the package is now available under both GPL-2 and GPL-3 licensing, allowing the user to pick which of these two licenses they wish to accept the software under. I decided to stick to “GPL-*” style licensing as I endorse the values underlying these licenses, and my (not-legal advice, I am not a lawyer!) opinion this is the licensing pattern closest to the license R itself is distributed under (and hence the closest to the values of the core R community).

Please read on for some background issues I found (not-legal advice, I am not a lawyer!)

Continue reading Some Notes on GNU Licenses in R Packages

Posted on Categories OpinionTags , , 4 Comments on A Comment on Data Science Integrated Development Environments

A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note regarding doing data science in Python:

A development environment [for Python] specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist.

“What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh

Actually, Python has a large number of very capable integrated development environments, some of which are specifically tailored for data science. Please read on for a small list of tools, and my recommendations for a specific data science in Python toolchain.

Continue reading A Comment on Data Science Integrated Development Environments

Posted on Categories Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine LearningTags , , Leave a comment on A Kind Note That We Really Appreciate

A Kind Note That We Really Appreciate

The following really made my day.

I tell every data scientist I know about vtreat and urge them to read the paper.
Jason Wolosonovich

Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this).

For those interested the R version of vtreat can be found here, the paper can be found here, and the in-development Python/Pandas version of vtreat can be found (with examples) here.

Chapter of 8 Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019 has a more operational discussion of vtreat (which itself uses concepts developed in chapter 4).