One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).
Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.
Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant.
Software engineers are forced to decide early if their tools are designed to support complexity or enforce clarity (single install, only current version, and so on). Unfortunately for the end-users, complexity almost always wins. Seemingly clever ideas like late-binding and dependency injection move support costs from developers to users and everything becomes a variation of dll-hell.
It gets to the point where it is considered normal and acceptable to require provisioning of images, virtual machines, or docker containers just to try a piece of software. Things that were once linked now struggle to find each other over perverse message buses.
Lt. Gen. Brian Horrocks: “We will install all of the packages at once by air-dropping a combination of pre-prepared Amazon Machine Images, docker containers, VMWare virtual machines, Heroku Apps, and BSD jails. I’m not saying that this will be the easiest party that we’ve ever attended, but I still wouldn’t miss it for the world.”
However, these containers have to come from somewhere and I feel it makes sense to try and figure out if a piece of software can be successfully installed or used at all (versus being unmaintainable but artificially kept alive through container services). This is why I have taken the trouble to run down references and actually work out how to install things (such as how to install Caffe deep neural nets from scratch, or R on EC2, or even basic Hadoop on EC2). Yes there are ready-made containers and images for these things- but at some point you accumulate enough of these incompatible sub-universes, you have come “a bullshit too far”, and the machine stops.
This is why I was initially loathe to use Anaconda for my Python installs. There is already too much learned helplessness in software engineering. And Anaconda would represent one more dependency that comes with its own costs and could break. However after seeing Peter Wang‘s talk at the 2015 Data Science Summit & DATO Conference I decided to give Anaconda a try.
Summary: right now it works. My long instructions involving Python, pip, and many packages repositories get boiled down to something as short as the following (this example on Apple OSX):
# Install Anaconda from https://store.continuum.io/cshop/anaconda/ # add a few more packages anaconda/bin/conda install -c https://conda.binstar.org/r rpy2 anaconda/bin/conda install cvxopt # launch launcher open anaconda/Launcher.app # Then inside a Jupyter worksheet: %%R install.packages('ggplot2',repos='http://cran.us.r-project.org')
You now have your own copy of all the common Python scientific libraries and a second install of R (so you do have to re-install any libraries you want to use) ready to work together through the
rpy2 library. With minor changes (I decided to take the time to jump from Python 2 to 3) my old worksheets worked (ex1, ex2).
I would say definitely give Anaconda a try. Anaconda is responsible for installing the entire ecosystem (including the copy of R it wants to use) so the Anaconda developers directly experience “integration debt” (and presumably act in their own interest and continuously reduce this debt).