Did you know R‘s `for()`

loop control structure drops class annotations from vectors? Continue reading For loops in R can lose class information

# Month: March 2016

## More on “npm” leftpad

Being interested in code quality and software engineering practice I have been following (with some relish) the current Javascript tempest in a teapot: “NPM & left-pad: Have We Forgotten How To Program?” (see also here for more discussion).

Image: Ben Halpern @ThePracticalDev

What happened is:

- A corporate site called NPM decided to remove control of a project called “Kik” from its author and give it to a company that claimed to own the trademark on “Kik.” This isn’t actually how trademark law works or we would see the Coca-Cola Company successfully saying we can’t call certain types of coal “coke” (though it is the sort of world the United States’s “Digital Millennium Copyright Act” assumes).
- The author of “Kik” decided since he obviously never had true control of the distribution of his modules distributed through NPM he would attempt to remove them (see here). This is the type of issue you worry about when you think about freedoms instead of mere discounts. We are thinking more about at this as we had to recently “re-sign” an arbitrary altered version of Apple’s software license just to run “git status” on our own code.
- Tons of code broke because it is currently more stylish to include dependencies than to write code.
- Egg is on a lot of faces when it is revealed one of the modules that is so critical to include is something called “leftpad.”
- NPM forcibly re-published some modules to try and mitigate the damage.

Everybody is rightly sick of this issue, but let’s pile on and look at the infamous leftpad. Continue reading More on “npm” leftpad

## Upcoming Win-Vector LLC appearances

Win-Vector LLC will be presenting on statistically validating models using R and data science at:

- Strata+Hadoop World “R Day” Tutorial 9:00am–5:00pm Tuesday, March 29 2016, San Jose, California.
- ODSC San Francisco Meetup, 6:30pm-9:00pm Thursday, March 31, 2016, San Francisco, California.

We will share code and examples.

Registration required (and Strata is a paid conference). Please Tweet/forward. We hope to see you soon!

## sample(): “Monkey’s Paw” style programming in R

The R functions `base::sample`

and `base::sample.int`

are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.

“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

## More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.

We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

## Bend or break: strings in R

A common complaint from new users of R is: the string processing notation is ugly.

- Using
`paste(,,sep='')`

to concatenate strings seems clumsy. - You are never sure which regular expression dialect
`grep()/gsub()`

are really using. - Remembering the difference between
`length()`

and`nchar()`

is initially difficult.

As always things can be improved by using additional libraries (for example: stringr). But this always evokes Python’s “There should be one– and preferably only one –obvious way to do it” or what I call the “rule 42” problem: “if it is the right way, why isn’t it the first way?”

From “Alice’s Adventures in Wonderland”:

Alice’s Adventures in Wonderland, drawn by John Tenniel.

At this moment the King, who had been for some time busily writing in his note-book, cackled out `Silence!' and read out from his book, `Rule Forty-two. All persons more than a mile high to leave the court.'

Everybody looked at Alice.

`I'm not a mile high,' said Alice.

`You are,' said the King.

`Nearly two miles high,' added the Queen.

`Well, I shan't go, at any rate,' said Alice: `besides, that's not a regular rule: you invented it just now.'

`It's the oldest rule in the book,' said the King.

`Then it ought to be Number One,' said Alice.

We will write a bit on evil ways that you should never actually use to try and weasel around the string concatenation notation issue in R. Continue reading Bend or break: strings in R

## Win-Vector video courses: price/status changes

Win-Vector LLC has been offering a couple of online video courses on the topics of data science and A/B testing (both using R). These are high quality courses and well worth the money and time needed to work through them closely (with all materials distributed on GitHub).

Our current distributor is Udemy, which has just announced a unilateral change in pricing policy (March 2, 2016). This note is about the current status of these courses. Continue reading Win-Vector video courses: price/status changes

## Reading and writing proofs

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than `1-1/e`

less than the unknown optimal pick (or picks up at least `63%`

of the possible value). This is significant, because naive optimization may only pick a set of value `1/k`

of the value of the optimal selection.

The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which *is* written in a proof can’t *all* be pivotal (else proofs would be a lot more fragile than they actually are).

Uwe Kils “Iceberg”

In this article I am attempting to reproduce some fraction of the insight found in: Polya “How to Solve It” (1945) and Doron Zeilberger “The Method of Undetermined Generalization and Specialization Illustrated with Fred Galvin’s Amazing Proof of the Dinitz Conjecture” (1994).

So I repeat the proof here (with some annotations and commentary). Continue reading Reading and writing proofs