Posted on Categories data science, Pragmatic Data Science, Statistics, TutorialsTags , ,

A comment on preparing data for classifiers

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly.

The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).

This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 and shared their fully machine readable results ( .arff and apparently standardized *_R.dat files) in a convenient single downloadable tar-file (see their paper for the URL).

The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.

The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).

To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.

But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names,, and car.names

  • car.names Free-form description of the data-set and format.
  • Comma separated data (without header).
  • car.c45-names Presumably machine readable header for a C4.5 package

The standard way to deal with this data is to (by hand) inspect car.names or car.c45-names and hand-build a custom command to load the data. Example R code to do this is given below:

url <- ""
tab <- read.table(text=getURL(url,write=basicTextGatherer()),
colnames(tab) <- c('buying', 'maint', 'doors', 
   'persons', 'lug_boot', 'safety', 'class')

Which (assuming RCurl is properly installed) yields:

   buying      maint       doors     persons   
 high :432   high :432   2    :432   2   :576  
 low  :432   low  :432   3    :432   4   :576  
 med  :432   med  :432   4    :432   more:576  
 vhigh:432   vhigh:432   5more:432             
  lug_boot    safety      class     
 big  :576   high:576   acc  : 384  
 med  :576   low :576   good :  69  
 small:576   med :576   unacc:1210  
                        vgood:  65  

For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.

The DWN paper car directory has 9 items:

  • original file from UCI.
  • car.names original file from UCI.
  • le_datos.m Matlab custom data loading code.
  • car.txt Facts about the data set.
  • car.arff Derived .arff format version of the data set.
  • car.cost Pricing of classification errors.
  • car_R.dat Derived standard tab separated values file with header.
  • conxuntos.dat Likely a result file.
  • conxuntos_kfold.dat Likely a result file.

The files I am interested in are car_R.dat and le_datos.m. car_R.dat looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named f*) and category to be predicted last (named clase and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into le_datos.m and see the following code fragment:

for i_fich=1:n_fich
  f=fopen(fich{i_fich}, 'r');
  if -1==f
	error('erro en fopen abrindo %s\n', fich{i_fich});
  for i=1:n_patrons(i_fich)
  	fprintf(2,'%5.1f%%\r', 100*n_iter++/n_patrons_total);
	for j = 1:n_entradas
	  t= fscanf(f,'%s',1);
	  if j==1 || j==2
		val={'vhigh', 'high', 'med', 'low'};
	  elseif j==3
		val={'2', '3', '4', '5-more'};
	  elseif j==4
		val={'2', '4', 'more'};
	  elseif j==5
		val={'small', 'med', 'big'};
	  elseif j==6
		val={'low', 'med', 'high'};
	  n=length(val); a=2/(n-1); b=(1+n)/(1-n);
	  for k=1:n
		if strcmp(t,val{k})
		  x(i_fich,i,j)=a*k+b; break
	t = fscanf(f,'%s',1);   % lectura da clase
	for j=1:n_clases
	  if strcmp(t,clase{j})
		cl(i_fich,i)=j; break

It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from -1 through 1 (using the linear rule x(i_fich,i,j)=a*k+b). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:

  buying maint doors persons lug_boot safety class
1  vhigh vhigh     2       2    small    low unacc
2  vhigh vhigh     2       2    small    med unacc
3  vhigh vhigh     2       2    small   high unacc
4  vhigh vhigh     2       2      med    low unacc
5  vhigh vhigh     2       2      med    med unacc
6  vhigh vhigh     2       2      med   high unacc

To this

        f1      f2      f3      f4      f5      f6      clase
1       -1.34125        -1.34125        -1.52084        -1.22439        -1.22439        -1.22439
2       -1.34125        -1.34125        -1.52084        -1.22439        -1.22439        0       1
3       -1.34125        -1.34125        -1.52084        -1.22439        -1.22439        1.22439 1
4       -1.34125        -1.34125        -1.52084        -1.22439        0       -1.22439        1

It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the Python scikit-learn methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).

Indicator variables encoding US Census reported levels of education.

The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes and the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).

This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.

If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.

Is there any actual damage in this encoding? Let’s load the processed data set and see.

url2 <- ''
dTreated <- read.table(url2,

The original data set supports a pretty good logistic regression model for unaccaptable cars:

train <- rbinom(dim(tab)[[1]],1,0.5)==1
m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety,
tab$pred <- predict(m1,newdata=tab,type="response")
##        unnacPred
## class   FALSE TRUE
##   acc     181   18
##   good     30    0
##   unacc    22  577
##   vgood    35    0

The transformed data set does not support as good a logistic regression mode.

m2 <- glm(clase==1~f1+f2+f3+f4+f5+f6,
dTreated$pred <- predict(m2,newdata=dTreated,type="response")

##      unnacPred
## class FALSE TRUE
##     0    35    0
##     1    43  556
##     2   118   81
##     3    28    2

Now obviously some modeling methods are more sensitive to this mis-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question why to ever use logistic regression? Because when you have a proper encoding of the data and the model structure is in fact somewhat linear, logistic regression can in fact be a very good method.

In the DWN paper 8 data sets (out of 123) have the a*k+b fragment in their le_datos.m file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so critique other authors is many other authors don’t share their work.

It always surprises statisticians that the indicator variable trick is not always first in mind. This means we forget to teach and re-teach the method enough. We also need to do more to root-out the incorrect alternatives to the method. Indicator encoding is sometimes hard to point out as it is either not done correctly or done silently.

In R, strings and factors can be treated as single columns or variables and are silently converted during model training and application (or can be explicitly built using model.matrix(). Oddly enough R also goes out of its way to also provide a publicly visible “convert to numbers by using interior codes” method (data.matrix()) which in my opinion is almost always the wrong method and lures unsuspecting programmers and engineers into error. I have written on this before, but if anything failed to fully appreciate the pervasive nature of the incorrect practice.

Python‘s scikit-learn supplies the correct encoding methods in sklearn.feature_extraction.DictVectorizer/sklearn.preprocessing.OneHotEncoder(). I think a lot of Python users get confused because they do not appreciate that Pandas (which deals so well with data representation) and scikit-learn (which really only wants to work with numbers) are two independent packages (and coded not to depend on each other) and some work is required to faithfully move data from one package to the other.

Note: as expected randomForest does better reversing the re-encoding. Also we accidentally left out the variable f6 in an early version of this post.


m1F <- randomForest(as.factor(class=='unacc')~
tab$predF <- predict(m1F,newdata=tab,type="response")

##       unnacPred
## class   FALSE TRUE
##   acc     193    6
##   good     30    0
##   unacc     9  590
##   vgood    35    0

m2F <- randomForest(as.factor(clase==1)~f1+f2+f3+f4+f5+f6,
dTreated$predF <- predict(m2F,newdata=dTreated,type="response")

##      unnacPred
## class FALSE TRUE
##     0    35    0
##     1    10  589
##     2   193    6
##     3    30    0

And we can confirm the encoding is in fact reversible by showing which variables and outcomes are in bijective correspondence. This means something as simple as changing the type/class declaration from real to string/factor would undoing the coding problem. The machine learning doesn’t need to know the original names of the levels, it just needs to know to treat the data as levels.


##            0    1    2    3
##   acc      0    0  384    0
##   good     0    0    0   69
##   unacc    0 1210    0    0
##   vgood   65    0    0    0


##         -1.34125 -0.447084 0.447084 1.34125
##   high         0       432        0       0
##   low          0         0        0     432
##   med          0         0      432       0
##   vhigh      432         0        0       0


##         -1.34125 -0.447084 0.447084 1.34125
##   high         0       432        0       0
##   low          0         0        0     432
##   med          0         0      432       0
##   vhigh      432         0        0       0


##         -1.52084 -0.168982 0.506946 1.18287
##   2          432         0        0       0
##   3            0       432        0       0
##   4            0         0        0     432
##   5more        0         0      432       0


##        -1.22439   0 1.22439
##   2         576   0       0
##   4           0 576       0
##   more        0   0     576

##         -1.22439   0 1.22439
##   big          0   0     576
##   med          0 576       0
##   small      576   0       0

##        -1.22439   0 1.22439
##   high        0   0     576
##   low       576   0       0
##   med         0 576       0

2 thoughts on “A comment on preparing data for classifiers”

  1. Thank you for this article. I also questioned why not the use of dummy variables instead of a numerical transformation on categorical variables. One other question I have is given the model, aren’t there issues of running into linear contrast issues with logistic regression if dummy variables represent all levels of the previous categorical variable?

    1. There is an issue when different indicator/dummy variables end up being linearly dependent (or even nearly so). Basically you lose a lot of the interpretability of the coefficients (as you at best only get bounds on linear-subspaces, not on values). But if your only goal is to make predictions (as it often is for data scientists, though not always so for statisticians) then simple precautions like regularization give you good models.

Comments are closed.