We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.
Please check the lectures out.
Starting With Data Science
A rigorous hands-on introduction to data science for software engineers.
Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.
Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers
We have two new chapters of Practical Data Science with R, Second Edition online and available for review!
The newly available chapters cover:
Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.
Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.
If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.
For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.
vtreat‘s purpose is to produce pure numeric
data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).
In this note we will discuss a small aspect of the
vtreat package: variable screening.
Continue reading vtreat Variable Importance
This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.
Continue reading How to de-Bias Standard Deviation Estimates
We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the
cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.
Authors: John Mount, and Nina Zumel 2018-10-25
As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the
cdata package. The
cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.
cdata adheres to the so-called "Rule of Representation":
Fold knowledge into data, so program logic can be stupid and robust.
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003
The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.
We showed in the last post how
cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?
Let’s discuss that using the example from the previous post: "plotting the
iris data faceted".
Continue reading Designing Transforms for Data Reshaping with cdata
vtreat is a powerful
R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Continue reading Modeling multi-category Outcomes With vtreat
rquery is an
R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on
rqdatatable is a new package that supplies a screaming fast implementation of the
rquery system in-memory using the
rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now
rquery is also one of the fastest methods to wrangle data in-memory in
R (thanks to
data.table, via a thin adaption supplied by
Continue reading rqdatatable: rquery Powered by data.table