Check out: I Write, Therefore I Think
I am going to come-out and say it: I am emotionally done with 32 bit machines and operating systems. My sympathy for them is at an end.
I know that ARM is still 32 bit, but in that case you get something big back in exchange: the ability to deploy on smartphones and tablets. For PCs and servers 32 bit addressing’s time is long past, yet we still have to code for and regularly run into these machines and operating systems. The time/space savings of 32 bit representations is nothing compared to the loss of capability in sticking with that architecture and the wasted effort in coding around it. My work is largely data analysis in a server environment, and it is just getting ridiculous to not be able to always assume at least a 64 bit machine. Continue reading I am done with 32 bit machines
When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the analysis, and the final presentation of results (both technical and for the business audience) with a minimal amount of assistance from my clients or their people. Not no assistance, of course, but little enough that I’m not interfering too much with their day-to-day job.
This used to be a key selling point, because people with all the necessary skills used to be relatively rare. This is less true now; data science is a hot new career track. Training courses and academic tracks are popping up all over the place. So there is the question: what should such courses teach? Or more to the heart of the question — what does a data scientist do, and what do they need to know?
This was originally posted at ninazumel.com. I’m re-blogging it here.
I came across a post from Emily Willingham the other day: “Is a PhD required for Good Science Writing?”. As a science writer with a science PhD, her answer is: is it not required, and it can often be an impediment. I saw a similar sentiment echoed once by Lee Gutkind, the founder and editor of the journal Creative Nonfiction. I don’t remember exactly what he wrote, but it was something to the effect that scientists are exactly the wrong people to produce literary, accessible writing about matters scientific.
I don’t agree with Gutkind’s point, but I can see where it comes from. Academic writing has a reputation for being deliberately obscure and prolix, jargonistic. Very few people read journal papers for fun (well, except me, but I’m weird). On the other hand, a science writer with a PhD has been trained for critical thinking, and should have a nose for bullpucky, even outside their field of expertise. This can come in handy when writing about medical research or controversial new scientific findings. Any scientist — any person — is going to hype up their work. It’s the writer’s job to see through that hype.
I’m not a science writer in the sense that Dr. Willingham is. I write statistics and data science articles (blog posts) for non-statisticians. Generally, the audience that I write for is professionally interested in the topic, but aren’t necessarily experts at it. And as a writer, many of my concerns are the same as those of a popular science writer.
I want to cut through the bullpucky. I want you, the reader, to come away understanding something you thought you didn’t — or even couldn’t — understand. I want you, the analyst or data science practitioner, to understand your tools well enough to innovate, not just use them blindly. And if I’m writing about one of my innovations, I want you to understand it well enough to possibly use it, not just be awed at my supposed brilliance.
I don’t do these things perfectly; but in the process of trying, and of reading other writers with similar objectives, I’ve figured out a few things.
A recent run of too many articles on the same topic (exhibits: A, B and C) puts me in a position where I feel the need to explain my motivation. Which itself becomes yet another article related to the original topic. The explanation I offer is: this is the way mathematicians think. To us mathematicians the tension is that there are far too many observable patterns in the world to be attributed to mere chance. So our dilemma is: for which patterns/regularities should we derive some underlying law and which ones are not worth worrying about. Or which conjectures should try to work all the way to proof or counter-example? Continue reading The Mathematician’s Dilemma
Hollywood movies are obsessed with outrunning explosions and outrunning crashing alien spaceships. For explosions the movies give the optimal (but unusable) solution: run straight away. For crashing alien spaceships they give the same advice, but in this case it is wrong. We demonstrate the correct angle to flee.
We are very excited to announce a new Win-Vector LLC blog category tag: Pragmatic Machine Learning. We don’t normally announce blog tags, but we feel this idea identifies an important theme common to a number of our articles and to what we are trying to help others achieve as data scientists. Please look for more news and offerings on this topic going forward. This is the stuff all data scientists need to know.
In both working with and thinking about machine learning and statistics I am always amazed at the differences in perspective and view between these two fields. In caricature it boils down to: machine learning initiates expect to get rich and statistical initiates expect to get yelled at. You can see hints of what the practitioners expect to encounter by watching their preparations and initial steps. Continue reading The differing perspectives of statistics and machine learning
A big congratulations to Win-Vector LLC‘s Dr. Nina Zumel for authoring and teaching portions of EMC‘s new Data Science and Big Data Analytics training and certification program. A big congratulations to EMC, EMC Education Services and Greenplum for creating a great training course. Finally a huge thank you to EMC, EMC Education Services and Greenplum for inviting Win-Vector LLC to contribute to this great project.
How is it even possible to set expectations and launch data science projects?
Data science projects vary from “executive dashboards” through “automate what my analysts are already doing well” to “here is some data, we would like some magic.” That is you may be called to produce visualizations, analytics, data mining, statistics, machine learning, method research or method invention. Given the wide range of wants, diverse data sources, required levels of innovation and methods it often feels like you can not even set goals for data science projects.
Many of these projects either fail or become open ended (become unmanageable).
As an alternative we describe some of our methods for setting quantifiable goals and front-loading risk in data science projects. Continue reading Setting expectations in data science projects