Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!
d xa xb ya yb
1 1 3 6 8
2 2 4 7 9
How can I get to:
id t x y
1 a 1 6
1 b 3 8
2 a 2 7
2 b 4 9
In Stata it's:
. reshape long x y, i(id) j(t) string
In R, it's:
. an hour of cursing followed by a desperate tweet 👆
Thanks for any help!
PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
Continue reading Controlling Data Layout With cdata
A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore.
This is as good an excuse as any to share a book update.
Continue reading Practical Data Science with R Book Update
Starting With Data Science
A rigorous hands-on introduction to data science for software engineers.
Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.
Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers
We have two new chapters of Practical Data Science with R, Second Edition online and available for review!
The newly available chapters cover:
Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and more.
Choosing and Evaluating Models – The chapter starts with exploring machine learning approaches and then moves to studying key model evaluation topics like mapping business problems to machine learning tasks, evaluating model quality, and how to explain model predictions.
If you haven’t signed up for our book’s MEAP (Manning Early Access Program), we encourage you to do so. The MEAP includes a free copy of Practical Data Science with R, First Edition, as well as early access to chapter drafts of the second edition as we complete them.
For those of you who have already subscribed — thank you! We hope you enjoy the new chapters, and we look forward to your feedback.
Please help share our news and this discount.
The second edition of our best-selling book Practical Data Science with R2, Zumel, Mount is featured as deal of the day at Manning.
The second edition isn’t finished yet, but chapters 1 through 4 are available in the Manning Early Access Program (MEAP), and we have finished chapters 5 and 6 which are now in production at Manning (so they should be available soon). The authors are hard at work on chapters 7 and 8 right now.
The discount gets you half off. Also the 2nd edition comes with a free e-copy the first edition (so you can jump ahead).
Here are the details in Tweetable form:
Deal of the Day January 13: Half off Practical Data Science with R, Second Edition. Use code dotd011319au at http://bit.ly/2SKAxe9.
vtreat‘s purpose is to produce pure numeric
data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).
In this note we will discuss a small aspect of the
vtreat package: variable screening.
Continue reading vtreat Variable Importance
We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the
cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.
Authors: John Mount, and Nina Zumel 2018-10-25
As a followup to our previous post, this post goes a bit deeper into reasoning about data transforms using the
cdata package. The
cdata packages demonstrates the "coordinatized data" theory and includes an implementation of the "fluid data" methodology for general data re-shaping.
cdata adheres to the so-called "Rule of Representation":
Fold knowledge into data, so program logic can be stupid and robust.
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley , 2003
The design principle expressed by this rule is that it is much easier to reason about data than to try to reason about code, so using data to control your code is often a very good trade-off.
We showed in the last post how
cdata takes a transform control table to specify how you want your data reshaped. The question then becomes: how do you come up with the transform control table?
Let’s discuss that using the example from the previous post: "plotting the
iris data faceted".
Continue reading Designing Transforms for Data Reshaping with cdata
Let’s take a quick look at a very important and common experimental problem: checking if the difference in success rates of two Binomial experiments is statistically significant. This can arise in A/B testing situations such as online advertising, sales, and manufacturing.
We already share a free video course on a Bayesian treatment of planning and evaluating A/B tests (including a free Shiny application). Let’s now take a look at the should be simple task of simply building a summary statistic that includes a classic frequentist significance.
Continue reading Quick Significance Calculations for A/B Tests in R
vtreat is a powerful
R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
vtreat and can now effectively prepare data for multi-class classification or multinomial modeling.
Continue reading Modeling multi-category Outcomes With vtreat