Posted on Categories Coding, Pragmatic Machine Learning, Programming, Statistics, TutorialsTags , , 3 Comments on Error Handling in R

Error Handling in R

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come back later and find the loop has crashed partway through, on an unanticipated error. Here’s a toy example:

> inputs = list(1, 2, 4, -5, 'oops', 0, 10)

> for(input in inputs) {
+   print(paste("log of", input, "=", log(input)))
+ }

[1] "log of 1 = 0"
[1] "log of 2 = 0.693147180559945"
[1] "log of 4 = 1.38629436111989"
[1] "log of -5 = NaN"
Error in log(input) : Non-numeric argument to mathematical function
In addition: Warning message:
In log(input) : NaNs produced

The loop handled the negative arguments more or less gracefully (depending on how you feel about NaN), but crashed on the non-numeric argument, and didn’t finish the list of inputs.

How are we going to handle this?

Continue reading Error Handling in R

Posted on Categories Coding, Computer Science, Computers, Opinion, Programming, RantsTags , ,

I am done with 32 bit machines

I am going to come-out and say it: I am emotionally done with 32 bit machines and operating systems. My sympathy for them is at an end.

I know that ARM is still 32 bit, but in that case you get something big back in exchange: the ability to deploy on smartphones and tablets. For PCs and servers 32 bit addressing’s time is long past, yet we still have to code for and regularly run into these machines and operating systems. The time/space savings of 32 bit representations is nothing compared to the loss of capability in sticking with that architecture and the wasted effort in coding around it. My work is largely data analysis in a server environment, and it is just getting ridiculous to not be able to always assume at least a 64 bit machine. Continue reading I am done with 32 bit machines

Posted on Categories Coding, Computer Science, Programming, TutorialsTags , , , , ,

What to do when you run out of memory

A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory.

Early computers were most limited by their paltry memory sizes. von Neumann himself commented that even a room full of genius mathematicians would not be capable of much if all they could communicate, think upon or remember were the characters on a single type written page (much more memory than the few hundred words available to the Eniac). The most visible portions of early computers are their external memories or secondary stores: card readers, paper tape readers and tape drives.


IMG 0062

SDC 920 computer, Computer History Museum, Mountain View CA

Historically computer scientists have concentrated on streaming or online algorithms (that is algorithms that work with the data in the order it is available and use limited memory). For many problems we have found this an insufficient model and it is much better to assume you can re-order and replicate data (such as scattering data to many processors and re-collecting it to sort). The scatter/gather paradigm is ubiquitous and is the underpinning of large scale sorting, databases and Map Reduce. So in one sense databases and Map Reduce different APIs on top of very related technologies (journaling, splitting and merging). Replicating data (or even delaying duplicate elimination) that is already “too large to handle” may seem counterintuitive; but it is exploiting the primary property of secondary storage: that secondary storage tends to be much larger than primary storage (typically by 2 orders of magnitude, compare a 2 terabyte drive to an 8 gigabyte memory stick). Continue reading What to do when you run out of memory

Posted on Categories Applications, Coding, Exciting Techniques, math programming, Mathematics, Programming, TutorialsTags , , , , , ,

Gradients via Reverse Accumulation

We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients. Continue reading Gradients via Reverse Accumulation

Posted on Categories Applications, Coding, Computer Science, Exciting Techniques, Mathematics, Programming, TutorialsTags , , , , , , , 5 Comments on Automatic Differentiation with Scala

Automatic Differentiation with Scala

This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion. Continue reading Automatic Differentiation with Scala

Posted on Categories Coding, Rants, StatisticsTags , , , 10 Comments on R annoyances

R annoyances

Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way. Continue reading R annoyances

Posted on Categories Coding, Computer Science, RantsTags , , ,

Postel’s Law: Not Sure Who To Be Angry With

One of my research interests is finding the principles that underly the management of information, complexity and uncertainty. When something as simple as a web-form is called “technology” it is time to step back and examine your principles. One principle I am not sure about Postel’s law. It doesn’t hold often enough to be relied on and when it fails I am not sure who to be angry with. Continue reading Postel’s Law: Not Sure Who To Be Angry With

Posted on Categories Coding, Statistics, TutorialsTags , 4 Comments on R examine objects tutorial

R examine objects tutorial

This article is quick concrete example of how to use the techniques from Survive R to lower the steepness of The R Project for Statistical Computing‘s learning curve (so an apology to all readers who are not interested in R). What follows is for people who already use R and want to achieve more control of the software. Continue reading R examine objects tutorial