R tip: use slices.
R has a very powerful array slicing ability that allows for some very slick data processing.
Continue reading R Tip: Use Slices
R tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.
I know I write a lot about coding in
R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.
R without data is like going to the theater to watch the curtain go up and down.
(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)
Usually you come to
R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.
Continue reading R Tip: Think in Terms of Values
Here is an R tip. Want to re-map a column of values? Use a named vector as the mapping.
Continue reading R Tip: Use Named Vectors to Re-Map Values
Another R tip. Need to replace a name in some R code or make R code re-usable? Use
Continue reading R Tip: Use let() to Re-Map Names
There are a number of easy ways to avoid illegible code nesting problems in
In this R tip we will expand upon the above statement with a simple example.
Continue reading R Tip: Break up Function Nesting for Legibility
R tip: use
stringsAsFactors = FALSE.
R often uses a concept of
factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.
It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”
Continue reading R Tip: Use
stringsAsFactors = FALSE
If you are working with predictive modeling or machine learning in
R this is the
R tip that is going to save you the most time and deliver the biggest improvement in your results.
R Tip: Use the
vtreat package for data preparation in predictive analytics and machine learning projects.
When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:
- Missing, invalid, or out of range values.
- Categorical variables with large sets of possible levels.
- Novel categorical levels discovered during test, cross-validation, or model application/deployment.
- Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
- Nested model bias poisoning results in non-trivial data processing pipelines.
Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.
vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.
vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.
If you are attempting high-value predictive modeling in
R, you should try out
vtreat and consider adding it to your workflow.
Continue reading R Tip: Use the vtreat Package For Data Preparation