vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.
My BARUGrquery talk went very well, thank you very much to the attendees for being an attentive and generous audience.
(John teaching rquery at BARUG, photo credit: Timothy Liu)
I am now looking for invitations to give a streamlined version of this talk privately to groups using R who want to work with SQL (with databases such as PostgreSQL or big data systems such as Apache Spark). rquery has a number of features that greatly improve team productivity in this environment (strong separation of concerns, strong error checking, high usability, specific debugging features, and high performance queries).
If your group is in the San Francisco Bay Area and using R to work with a SQL accessible data source, please reach out to me at firstname.lastname@example.org, I would be honored to show your team how to speed up their project and lower development costs with rquery. If you are a big data vendor and some of your clients use R, I am especially interested in getting in touch: our system can help R users start working with your installation.
If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.
Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.
In R the [[ ]] is the operator that (when supplied a simple scalar argument) pulls a single element out of lists (and the [ ] operator pulls out sub-lists).
For vectors [[ ]] and [ ]appear to be synonyms (modulo the issue of names). However, for a vector [[ ]] checks that the indexing argument is a scalar, so if you intend to retrieve one element this is a good way of getting an extra check and documenting intent. Also, when writing reusable code you may not always be sure if your code is going to be applied to a vector or list in the future.
It is safer to get into the habit of always using [[ ]] when you intend to retrieve a single element.
Example with lists:
#>  "a"
#>  "a"
Example with vectors:
#>  "a"
#>  "a"
The idea is: in situations where both [ ] and [[ ]] apply we rarely see [[ ]] being the worse choice.
Note on this article series.
This R tips series is short simple notes on R best practices, and additional packaged tools. The intent is to show both how to perform common tasks, and how to avoid common pitfalls. I hope to share about 20 of these about every other day to learn from the community which issues resonate and to also introduce some of features from some of our packages. It is an opinionated series and will sometimes touch on coding style, and also try to showcase appropriate Win-Vector LLC R tools.