Starting With Data Science
A rigorous hands-on introduction to data science for software engineers.
Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.
Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers
The secret is out: Nina Zumel and I are busy working on Practical Data Science with R2, the second edition of our best selling book on learning data science using the R language.
Our publisher, Manning, has a great slide deck describing the book (and a discount code!!!) here:
We also just got back our part-1 technical review for the new book. Here is a quote from the technical review we are particularly proud of:
The dot notation for base
R and the
dplyr package did make me stand up and think. Certain things suddenly made sense.
Continue reading Practical Data Science with R2
Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.
Fri: Half off all eBooks
Sat: Half off all MEAPs
Sun: Half off all pBooks and liveVideos
Mon: Half off everything
The discount code is:
Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!
We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.
Please check the lectures out.
In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with
We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.
We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.
Continue reading Timing Working With a Row or a Column from a data.frame
I would like to write a bit on the meaning and history of the phrase “tidy data.”
Continue reading What is “Tidy Data”?
Also, Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019 is now content complete! It is deep into editing and soon into production!
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27
In this note we will use five real life examples to demonstrate data layout transforms using the
R package. The examples for this note are all demo-examples from tidyr:demo/ (current when we shared this note on 2019-04-27, removed 2019-04-28), and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, so they are a good set of examples or exercises to work through.
Continue reading Data Layout Exercises
Here is an example how easy it is to use
cdata to re-layout your data.
Tim Morris recently tweeted the following problem (corrected).
Please will you take pity on me #rstats folks?
I only want to reshape two variables x & y from wide to long!
d xa xb ya yb
1 1 3 6 8
2 2 4 7 9
How can I get to:
id t x y
1 a 1 6
1 b 3 8
2 a 2 7
2 b 4 9
In Stata it's:
. reshape long x y, i(id) j(t) string
In R, it's:
. an hour of cursing followed by a desperate tweet 👆
Thanks for any help!
PS – I can make reshape() or gather() work when I have just x or just y.
This is not to make fun of Tim Morris: the above should be easy. Using diagrams and slowing down the data transform into small steps makes the process very easy.
Continue reading Controlling Data Layout With cdata
R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called “method chaining” since the days of
Smalltalk (~1972). Let’s take a look at method chaining in
Python, in terms of pipe notation.
Continue reading Piping is Method Chaining
A good friend is now a professor at the University of Auckland and knew to photograph and send us this. Thanks!!!