R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.
Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.
Continue reading R Tip: How To Look Up Matrix Values Quickly
wrapr 2.0.0 is now up on CRAN.
This means the
:= variant of
unpack is now easy to install.
Please give it a try!
I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks.
vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.
Continue reading Re-Share: vtreat Data Preparation Documentation and Video
For data science projects I recommend using source control or version control, and committing changes at a very fine level of granularity. This means checking in possibly broken code, and the possibly weak commit messages (so when working in a shared project, you may want a private branch or second source control repository).
Please read on for our justification.
Continue reading Version Control is a Time Machine That Translates Common Hindsight Into Valuable Foresight
For all our remote learners, we are sharing a free coupon code for our R video course Introduction to Data Science. The code is ITDS2020, and can be used at this URL https://www.udemy.com/course/introduction-to-data-science/?couponCode=ITDS2020 . Please check it out and share it!
Here is a small quote from Practical Data Science with R Chapter 1.
It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong domain empathy to help define and solve the right problems.
Interested? Please check it out.
A big thank you to Dmytro Perepolkin for sharing a “Keep Calm and Use vtreat” poster!
Also, we have translated the Python vtreat steps from our recent “Cross-Methods are a Leak/Variance Trade-Off” article into R vtreat steps here.
This R-port demonstrates the new to R fit/prepare notation!
We want vtreat to be a platform agnostic (works in R, works in Python, works elsewhere) well documented standard methodology.
To this end: Nina and I have re-organized the basic vtreat use documentation as follows:
R regression example, fit/prepare
R regression example, design/prepare/experiment
R classification example, fit/prepare
R classification example, design/prepare/experiment
- Unsupervised tasks:
R unsupervised example, fit/prepare
R unsupervised example, design/prepare/experiment
- Multinomial classification:
R multinomial classification
R multinomial classification example, design/prepare/experiment
Python multinomial classification
We have a new Win Vector data science article to share:
Cross-Methods are a Leak/Variance Trade-Off
John Mount (Win Vector LLC), Nina Zumel (Win Vector LLC)
March 10, 2020
We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.
Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.
Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.
The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientists.
A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, like many others who have faced the same problem, simply checked whether the mean and standard deviation of the data had changed more than some amount, where the threshold value they checked against was selected in a more or less ad-hoc manner. But they were curious whether there was some other, perhaps more principled way, to check for a change in distribution.
Continue reading Monitoring for Changes in Distribution with Resampling Tests