I am excited to announce `vtreat`

is now available for `Python`

on PyPi, in addition for `R`

on CRAN.

# Category: Opinion

## Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.

In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.

By w:Florence Nightingale (1820–1910). – http://www.royal.gov.uk/output/Page3943.asp [dead link], Public Domain, Link

For more please here.

## Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref).

The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations.

The tide calculating machine embodied ideas of Sir Isaac Newton, and Pierre-Simon Laplace (ref), and could predict tide driven water levels by the means of wheels and gears.

The question is: can modern data science tools quickly forecast tides to similar accuracy?

## PyCharm Video Review

My basic video review of the PyCharm integrated development environment for Python with Anaconda and Jupyter/iPython integration. I like the IDE extensions enough to pay for them early in my evaluation. Highly recommended for data science projects, at least try one of the open-source or the trial versions.

## Some Notes on GNU Licenses in R Packages

I was recently asked if Win-Vector LLC would move the R wrapr package from a GPL-3 license to an LGPL license. In the end I decided to move wrapr distribution to a “GPL-2 | GPL-3” license. This means the package is now available under both GPL-2 and GPL-3 licensing, allowing the user to pick which of these two licenses they wish to accept the software under. I decided to stick to “GPL-*” style licensing as I endorse the values underlying these licenses, and my (not-legal advice, I am not a lawyer!) *opinion* this is the licensing pattern closest to the license R itself is distributed under (and hence the closest to the values of the core R community).

Please read on for some background issues I found (not-legal advice, I am not a lawyer!)

## A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note regarding doing data science in Python:

A development environment [for Python] specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist.

“What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh

Actually, Python has a large number of very capable integrated development environments, some of which are specifically tailored for data science. Please read on for a small list of tools, and my recommendations for a specific data science in Python toolchain.

Continue reading A Comment on Data Science Integrated Development Environments

## A Kind Note That We Really Appreciate

The following really made my day.

I tell every data scientist I know about vtreat and urge them to read the paper.

Jason Wolosonovich

Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this).

For those interested the R version of vtreat can be found here, the paper can be found here, and the in-development Python/Pandas version of vtreat can be found (with examples) here.

Chapter of 8 Zumel, Mount, *Practical Data Science with R, 2nd Edition*, Manning 2019 has a more operational discussion of vtreat (which itself uses concepts developed in chapter 4).

## R Books Discount!

We, the community of Manning R and data science authors, have talked Manning into offering a catalog-wide 40% discount on all books. Please take a look at some great deals on some great technical books here: http://mng.bz/adRj !

## Programming Over lm() in R

Here is simple modeling problem in `R`

.

We want to fit a linear model where the names of the data columns carrying the outcome to predict (`y`

), the explanatory variables (`x1`

, `x2`

), and per-example row weights (`wt`

) are given to us as string values in variables.

## My Favorite data.table Feature

My favorite R data.table feature is the “`by`

” grouping notation when combined with the `:=`

notation.

Let’s take a look at this powerful notation.