I have been doing a lot of writing lately (the book, clients, blog, status updates, and the occasional tweet). This has made me acutely aware of how different many of these writing tasks tend to be. Read more…
Bitcoin continues to surge in buzz and price (194,993 coin transfer, US Senate hearings and astronomical price and total capitalization). This gets me to thinking: what in finance terms is Bitcoin? It claims aspire to be a currency, but what is it actually behaving like? Read more…
We recently got this question from a subscriber to our book:
… will you in any way describe what subject areas, backgrounds, courses etc. would help a non data scientist prepare themselves to at least understand at a deeper level why they techniques you will discuss work…and also understand the boundary conditions and limits of the models etc….. ?
[…] I would love to understand what I could review first to better prepare to extract the most from it.
It’s a good question, and it raises an interesting philosophical point. To read our book, it will of course help to know a little bit about statistics and probability, and to be familiar with R and/or with programming in general. But we do plan on introducing the necessary concepts as needed into our discussion, so we don’t consider these subjects to be “pre-requisites” in a strict sense.
Part of our reason for writing this book is to make reading about statistics/probability and machine learning easier. That is, we hope that if you read our book, other reference books and textbooks will make more sense, because we have given you a concrete context for the abstract concepts that the reference books cover.
So, my advice to our subscriber was to keep his references handy as he read our book, rather than trying to brush up on all the “pre-requisite” subjects first.
Of course, everyone learns differently, and we’d like to know what other readers think. What (if anything) would you consider “pre-requisites” to our book? What would you consider good companion references?
If you are subscribed to our book, please join the conversation, or post other comments on the Practical Data Science with R author’s forum. Your input will help us write a better book; we look forward to hearing from you.
A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.
I’ve written about statistics v.s. machine learning , but I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.
We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).
The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990′s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion. Read more…
In data science work you often run into cryptic sentences like the following:
Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).
(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)
The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. We know the writer is claiming significance, but despite the technical terminology they have not actually said which test they actually ran (lm(), glm(), contingency table, normal test, t-test, f-test, g-test, chi-sq, permutation test, exact test and so on). I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.
We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.
The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.
In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely. Read more…
This article is not on the usual technical topics of this blog, so you have my apology up front for that. And instead of trying to help you, we are asking for your help.
Nina Zumel has written a lot of important and helpful articles for this blog. I would call out in particular: her invention of and leadership in our Statistics to English category, clear writing on statistical significance, visualization and working as a data scientist. She has also written a bit more on the whole person: I Write, Therefore I Think and On Balance.
In this spirit I would like to call your attention to a KickStarter that is important to her and all of us at Win-Vector LLC to: the Non Stop Bhangra Documentary.
I am asking you to please consider promoting this KickStarter to anyone you know that cares about music, entertainment/culture in the San Francisco bay area, Indian culture or the possibility of having some identity outside of professional work. Nina’s story is only one among many of an incredible collective of people who all give a lot of their time to share what has been called “infections joy” with many (including local elementary and high schools). We would really like to see filmmaker Odell Hussey get the money to complete the documentary project he has been donating many hours to for years. This is exactly the kind of project KickStarter was designed for: finishing a larger work.
I ask that you consider supporting the Non Stop Bhangra Documentary. Please join us in supporting this amazing project.
One thing I have observed in multiple software engineering and data science projects is inconvenient steps get skipped. This is negative and happens despite rules, best intentions and effort. Recently I have noticed a positive re-formulation of this in that project quality increases rapidly when you make it take more effort to do the wrong thing. Read more…
Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering. Read more…
A bit more on the ROC/AUC
The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but we would like to emphasize out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert. Read more…
I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain.
I would like to make a plea to my fellow data scientists to stop using Excel-like formats for informal data exchange and become much stricter in producing and insisting on truly open machine readable files. Open files are those in an open format (not proprietary like Microsoft Excel) and machine readable in this case means readable by a very simple program (preferring simple escaping strategies to complicated quoting strategies). A lot of commonly preferred formats surprisingly do not meet these conditions: for example Microsoft Excel, XML and quoted CSV all fail the test. A few formats that do meet these conditions: SQL dumps, JSON and what I call “strong TSV.” I will illustrate some of the difficulty in using ad-hoc formats in R and suggest work-arounds. Read more…