Posted on Categories Opinion, Programming, StatisticsTags , , , 14 Comments on Neglected R Super Functions

Neglected R Super Functions

R has a lot of under-appreciated super powerful functions. I list a few of our favorites below.


6095431665 88664494f0 b

Atlas, carrying the sky. Royal Palace (Paleis op de Dam), Amsterdam.

Photo: Dominik Bartsch, CC some rights reserved.

Continue reading Neglected R Super Functions

Posted on Categories Administrativia, data science, Opinion, Practical Data Science, Pragmatic Data Science, StatisticsTags , , ,

Four Years of Practical Data Science with R

Four years ago today authors Nina Zumel and John Mount received our author’s copies of Practical Data Science with R!

1960860 10203595069745403 608808262 o

Continue reading Four Years of Practical Data Science with R

Posted on Categories Coding, Opinion, Pragmatic Data Science, Statistics, TutorialsTags , , , , , , ,

R Tip: Think in Terms of Values

R tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.

I know I write a lot about coding in R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.

R without data is like going to the theater to watch the curtain go up and down.

(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)

Usually you come to R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.

Continue reading R Tip: Think in Terms of Values

Posted on Categories Administrativia, Practical Data Science, StatisticsTags , , 2 Comments on Hangul/Korean edition of Practical Data Science with R!

Hangul/Korean edition of Practical Data Science with R!

Excited to see our new Hangul/Korean edition of “Practical Data Science with R” by Nina Zumel, John Mount, translated by Daekyoung Lim.

IMG 0865

Continue reading Hangul/Korean edition of Practical Data Science with R!

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , , 1 Comment on R Tip: Use let() to Re-Map Names

R Tip: Use let() to Re-Map Names

Another R tip. Need to replace a name in some R code or make R code re-usable? Use wrapr::let().



Continue reading R Tip: Use let() to Re-Map Names

Posted on Categories Coding, Opinion, Statistics, TutorialsTags , , , , , , , 13 Comments on R Tip: Break up Function Nesting for Legibility

R Tip: Break up Function Nesting for Legibility

There are a number of easy ways to avoid illegible code nesting problems in R.

In this R tip we will expand upon the above statement with a simple example.

Continue reading R Tip: Break up Function Nesting for Legibility

Posted on Categories Opinion, Rants, StatisticsTags , 3 Comments on The Many Faces of R

The Many Faces of R

Some days I see R as an eclectic programming language preferred by scientists.

“Programming languages as people.”

PP2

From Leftover Salad (David Marino).

Other days I see it more like the following.

Continue reading The Many Faces of R

Posted on Categories data science, Practical Data Science, Pragmatic Data Science, Pragmatic Machine Learning, Statistics, TutorialsTags , , , , , , 9 Comments on R Tip: Use the vtreat Package For Data Preparation

R Tip: Use the vtreat Package For Data Preparation

If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.

R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.

Vtreat

When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.

If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.

Continue reading R Tip: Use the vtreat Package For Data Preparation

Posted on Categories Coding, Statistics, TutorialsTags , , , 1 Comment on R Tip: Introduce Indices to Avoid for() Class Loss Issues

R Tip: Introduce Indices to Avoid for() Class Loss Issues

Here is an R tip. Use loop indices to avoid for()-loops damaging classes.

Below is an R annoyance that occurs again and again: vectors lose class attributes when you iterate over them in a for()-loop.

d <- c(Sys.time(), Sys.time())
print(d)
#> [1] "2018-02-18 10:16:16 PST" "2018-02-18 10:16:16 PST"

for(di in d) {
  print(di)
}
#> [1] 1518977777
#> [1] 1518977777

Notice we printed numbers, not dates/times. To avoid this problem introduce an index, and loop over that, not over the vector contents.

for(ii in seq_along(d)) {
  di <- d[[ii]]
  print(di)
}
#> [1] "2018-02-18 10:16:16 PST"
#> [1] "2018-02-18 10:16:16 PST"

Continue reading R Tip: Introduce Indices to Avoid for() Class Loss Issues

Posted on Categories Administrativia, StatisticsTags , , , , ,

Speaking on New Tools for R at Big Data Scale

I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts.


IMG 4606
John Mount discussing rquery SQL generation at LinkedIn.

If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have some great new tools I’d love to speak on and share. I’d love an invite, especially if your group is in the San Francisco Bay Area.

Note: we also now have a 1/2 to 1 day on-site “Spark for R Users” training offering. Again, please reach out if your team is interested.