Posted on Categories OpinionTags ,

Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.

In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.

Nightingale mortality

By w:Florence Nightingale (1820–1910). – [dead link], Public Domain, Link

For more please here.

Posted on Categories OpinionTags , , , ,

PyCharm Video Review

My basic video review of the PyCharm integrated development environment for Python with Anaconda and Jupyter/iPython integration. I like the IDE extensions enough to pay for them early in my evaluation. Highly recommended for data science projects, at least try one of the open-source or the trial versions.

Posted on Categories OpinionTags , , 4 Comments on A Comment on Data Science Integrated Development Environments

A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note regarding doing data science in Python:

A development environment [for Python] specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist.

“What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh

Actually, Python has a large number of very capable integrated development environments, some of which are specifically tailored for data science. Please read on for a small list of tools, and my recommendations for a specific data science in Python toolchain.

Continue reading A Comment on Data Science Integrated Development Environments

Posted on Categories Opinion, StatisticsTags , ,

Technical books are amazing opportunities

Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of:

Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years.

To us this reads as somebody with deep experience, confidence, and bit of humility. They do something technical and valuable, but because they understand it they do not consider it to be arcane magic.

In this note we describe might can happen if such a person (or if a junior version of such a person) acquires 1 or 2 technical books.

Continue reading Technical books are amazing opportunities

Posted on Categories Administrativia, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , , ,

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers

Starting With Data Science

A rigorous hands-on introduction to data science for software engineers.

Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets software engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or video) course.

Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Software Engineers

Posted on Categories Exciting Techniques, Opinion, TutorialsTags , ,

cdata Control Table Keys

In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys.

The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys that are spread across more than one column). This is easiest to demonstrate with an example.

Continue reading cdata Control Table Keys

Posted on Categories Administrativia, Opinion, Programming, StatisticsTags , , , , , 2 Comments on rquery: SQL from R

rquery: SQL from R

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.

IMG 5152

(John teaching rquery at BARUG, photo credit: Timothy Liu)

I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).

If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.

Posted on Categories data science, Opinion, Statistics, TutorialsTags , , , , ,

We Want to be Playing with a Moderate Number of Powerful Blocks

Many data scientists (and even statisticians) often suffer under one of the following misapprehensions:

  • They believe a technique doesn’t work in their current situation (when in fact it does), leading to useless precautions and missed opportunities.
  • They believe a technique does work in their current situation (when in fact it does not), leading to failed experiments or incorrect results.

I feel this happens less often if you are working with observable and composable tools of the proper scale. Somewhere between monolithic all in one systems, and ad-hoc one-off coding is a cognitive sweet spot where great work can be done.

Continue reading We Want to be Playing with a Moderate Number of Powerful Blocks

Posted on Categories Administrativia, Statistics, TutorialsTags , , , , , 8 Comments on Update on coordinatized or fluid data

Update on coordinatized or fluid data

We have just released a major update of the cdata R package to CRAN.


If you work with R and data, now is the time to check out the cdata package. Continue reading Update on coordinatized or fluid data