Posted on Categories UncategorizedTags , , 9 Comments on For loops in R can lose class information

For loops in R can lose class information

Did you know R‘s for() loop control structure drops class annotations from vectors? Continue reading For loops in R can lose class information

Posted on Categories Coding, RantsTags , 4 Comments on More on “npm” leftpad

More on “npm” leftpad

Being interested in code quality and software engineering practice I have been following (with some relish) the current Javascript tempest in a teapot: “NPM & left-pad: Have We Forgotten How To Program?” (see also here for more discussion).

Image: Ben Halpern @ThePracticalDev

What happened is:

  1. A corporate site called NPM decided to remove control of a project called “Kik” from its author and give it to a company that claimed to own the trademark on “Kik.” This isn’t actually how trademark law works or we would see the Coca-Cola Company successfully saying we can’t call certain types of coal “coke” (though it is the sort of world the United States’s “Digital Millennium Copyright Act” assumes).
  2. The author of “Kik” decided since he obviously never had true control of the distribution of his modules distributed through NPM he would attempt to remove them (see here). This is the type of issue you worry about when you think about freedoms instead of mere discounts. We are thinking more about at this as we had to recently “re-sign” an arbitrary altered version of Apple’s software license just to run “git status” on our own code.
  3. Tons of code broke because it is currently more stylish to include dependencies than to write code.
  4. Egg is on a lot of faces when it is revealed one of the modules that is so critical to include is something called “leftpad.”
  5. NPM forcibly re-published some modules to try and mitigate the damage.

Everybody is rightly sick of this issue, but let’s pile on and look at the infamous leftpad. Continue reading More on “npm” leftpad

Posted on Categories Administrativia, Statistics, TutorialsTags ,

Upcoming Win-Vector LLC appearances

Win-Vector LLC will be presenting on statistically validating models using R and data science at:

We will share code and examples.

Registration required (and Strata is a paid conference). Please Tweet/forward. We hope to see you soon!

NewImage NewImage

Posted on Categories Coding, RantsTags , , , , , 9 Comments on sample(): “Monkey’s Paw” style programming in R

sample(): “Monkey’s Paw” style programming in R

The R functions base::sample and are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.

“The Monkey’s Paw”, story: William Wymark Jacobs, 1902; illustration Maurice Greiffenhagen.

Continue reading sample(): “Monkey’s Paw” style programming in R

Posted on Categories Administrativia, Opinion, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , ,

More on preparing data

The Microsoft Data Science User Group just sponsored Dr. Nina Zumel‘s presentation “Preparing Data for Analysis Using R”. Microsoft saw Win-Vector LLC‘s ODSC West 2015 presentation “Prepping Data for Analysis using R” and generously offered to sponsor improving it and disseminating it to a wider audience.


We feel Nina really hit the ball out of the park with over 400 new live viewers. Read more for links to even more free materials! Continue reading More on preparing data

Posted on Categories ProgrammingTags 3 Comments on Bend or break: strings in R

Bend or break: strings in R

A common complaint from new users of R is: the string processing notation is ugly.

  • Using paste(,,sep='') to concatenate strings seems clumsy.
  • You are never sure which regular expression dialect grep()/gsub() are really using.
  • Remembering the difference between length() and nchar() is initially difficult.

As always things can be improved by using additional libraries (for example: stringr). But this always evokes Python’s “There should be one– and preferably only one –obvious way to do it” or what I call the “rule 42” problem: “if it is the right way, why isn’t it the first way?”

From “Alice’s Adventures in Wonderland”:

Alice’s Adventures in Wonderland, drawn by John Tenniel.

At this moment the King, who had been for some time busily writing in his note-book, cackled out `Silence!' and read out from his book, `Rule Forty-two. All persons more than a mile high to leave the court.'

Everybody looked at Alice.

`I'm not a mile high,' said Alice.

`You are,' said the King.

`Nearly two miles high,' added the Queen.

`Well, I shan't go, at any rate,' said Alice: `besides, that's not a regular rule: you invented it just now.'

`It's the oldest rule in the book,' said the King.

`Then it ought to be Number One,' said Alice.

We will write a bit on evil ways that you should never actually use to try and weasel around the string concatenation notation issue in R. Continue reading Bend or break: strings in R

Posted on Categories Administrativia, Pragmatic Data Science, Pragmatic Machine Learning, StatisticsTags , , 3 Comments on Win-Vector video courses: price/status changes

Win-Vector video courses: price/status changes

Win-Vector LLC has been offering a couple of online video courses on the topics of data science and A/B testing (both using R). These are high quality courses and well worth the money and time needed to work through them closely (with all materials distributed on GitHub).

Our current distributor is Udemy, which has just announced a unilateral change in pricing policy (March 2, 2016). This note is about the current status of these courses. Continue reading Win-Vector video courses: price/status changes

Posted on Categories Expository Writing, MathematicsTags , , ,

Reading and writing proofs

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in the literature is: showing that when maximizing a monotone submodular function the greedy algorithm run for k steps picks a set that is scores no worse than 1-1/e less than the unknown optimal pick (or picks up at least 63% of the possible value). This is significant, because naive optimization may only pick a set of value 1/k of the value of the optimal selection.

The proof that the greedy algorithm does well in maximizing monotone increasing submodular functions is clever and a very good opportunity to teach about reading and writing mathematical proofs. The point is: one needs an active reading style as: most of what is crucial to a proof isn’t written, and that which is written in a proof can’t all be pivotal (else proofs would be a lot more fragile than they actually are).

Uwe Kils “Iceberg”

In this article I am attempting to reproduce some fraction of the insight found in: Polya “How to Solve It” (1945) and Doron Zeilberger “The Method of Undetermined Generalization and Specialization Illustrated with Fred Galvin’s Amazing Proof of the Dinitz Conjecture” (1994).

So I repeat the proof here (with some annotations and commentary). Continue reading Reading and writing proofs