Jordi Casanellas
  • Data Science Blog
  • Dataguda
  • Astrophysics
    • Research
    • Teaching
    • Videos
    • Press
  • Contact
  • Data Science Blog
  • Dataguda
  • Astrophysics
    • Research
    • Teaching
    • Videos
    • Press
  • Contact
Jordi Casanellas

Missing data: to Impute or not to impute? + R examples

11/7/2016

9 Comments

 
Picture
Very often the data we want to analyse and make 
predictions with is full of black holes of missing data. What to do with that? Would you remove the entries (rows) with missing data? Would you remove the variables (predictors, columns) with missing values? Would you try to impute the missing values (to "guess" them)?

The strategy to follow depends on your (missing) data. Your data can have missing values which can be distributed at random, or not...

Missing at random: the fact that a certain value is missing has nothing to do whatsoever with its hypothetical value, and nothing to do with the values of the other variables.
  • ​The good news:
    • ​you can safely remove the entries with missing values because your data will retain the same shape after that. That would be a really good option if you have enough complete data.
    • your missing data is in general much easier to impute than if your data was not missing at random.
  • ​The bad news:
    • I bet your data is not missing at random :(
By the way, if you want to randomly remove some data from your complete dataframe in R, you can easily do it with the functions in the missForest package:

    
​​
Missing not at random: probably there is a reason behind your missing data, a pattern. It could be that:
  • The missing values in a variable depend on their hypothetical values. For example in a survey, if the variable "income" has much more missing values for high income respondents due to the fact that people with high income do not want to give that information. Or another example: in your data measuring rainfall you have no data on the days when it rained the most because your instrument has an upper limit in the amount of rain it can measure. 
  • The fact that a certain value is missing depends on other variables which are not missing, like when you have customer data from different sources (let's say mobile and web data) and the variable "customer age" is missing when source=web and not when source=mobile.
In any case:
  • The bad news:
    • if you exclude the entries (rows) with missing values from your analysis, you are probably introducing some bias. In the first example above (the survey), if you want to calculate the average size of the apartments of the respondents and you remove the entries with missing "income", then your analysis will be restricted to a subset of your original data with a probable smaller average size of apartments.
    • your missing data is in general much more difficult to impute. In the same survey example, your model to impute data may have to extrapolate to guess the missing income values (it may have no other high income data to learn from) ​.
  • ​The good news: 
    • as there is a reason behind why your data is missing, you may be able to do something about it. You can try to change how you gather your data, improve the sources, add new sources...
Picture
Strategies to deal with missing data
To ​Impute or not to impute, that is the question...

1. Do not impute


​​Complete-case analysis:
If you have enough data, a good a approch is to just remove the rows with missing values and work with the subsample of your data which is complete. Remember that you can only safely do that when your data is missing at random. If it is not, you can try to re-weight the complete data, although that is not practical at all when working with many variables.

To obtain the complete subsample of your data in R:


    


Available-case analysis:
The simpler version of this approach is to just remove (if you can afford it) the columns (predictors) with missing values. In R would be something like:

    
Of course you may be losing a lot of predictive power with this approach. So what I recommend is to build different models which you apply depending on the availability of data that you have. Imagine that you have data from costumers and in a 40% of the cases (not at random) you don't have information about their location. What you could do is to build 2 models, one that will include all the predictors, and another one that will not use the predictors related to location. The same subsampling of the data would happen both for the data you use for training and for the data you calculate the predictions on. So if the non-randomness of the missing values makes your two data samples intrinsically different, that will not bias your predictions because you will be training and evaluating in data with the same shape.


2. Impute

When to impute?
That depends on your data. Set a good cross-validation score, try different imputation/non-imputation methods, and measure the performance of your models.
In general, if you have large percentages of missing data, I would recommend to impute as little as possible (see also the available-case analysis strategy). But sometimes you may be forced to impute. For example if you have your model already built, running in production, and it has to compute error-free predictions every certain time. In this case you may need to avoid (depending on your algorithm) giving null values as inputs of your trained model. Hopefully you have already selected predictors that you expected to be always complete, but anyway you never know the unexpected missing data that will come, and your model needs to perform flawlessly in production. In this case you really need an strategy to impute missing data.

Simple imputation methods:
​A simple way to fill your missing values is set them equal to the mean of the variable, if it is numerical, or to the mode of the variable, if it is categorical (factor). That is easily done in R with the roughfix function of the randomForest package:

    
In some cases, it may also be a good strategy to interpolate within your numerical predictor to fill the missing values.

Alternatively, in categorical variables with missing values, you can create a new level called "missing" and impute all missing values with this new level. For numerical variables (e.g. Xvar and Yvar) you can roughly impute the missing values (for instance with the mean of the variable) and then create new variables ("wasXvarMissing" and "wasYvarMissing") which will be flags (0/1) of the entries that originally had missing values. 

Complex imputation methods:
First you should ask yourself if it would be possible to "manually" impute some of your missing values defining logical rules. You need to fully understand your data and why is it missing, and that approach may only work for very specific variables. For example if you have data from customers and you have missing values in the variable "NumberOfChildren", you may try to impute this variable by setting the value equal to that of the partners (husband, spouse) of your customers, which should be alright most of the times.

A second strategy is to use predictive models to impute your missing data. In R I've tried it with the mice and the missForest packages. Although the mice package looks really good, I must say it didn't work for me, because in order to apply it to medium-sized data (50k rows, 50 columns) I had to do a lot of tricks to avoid numerical and other errors: (i) I had to impute missing values one variable at a time, (ii) with the other variables, used as predictors for the imputation model, I had to one-hot-encode (get dummies) on the categorical ones and apply PCA to all of them. At the end I ended up with a huge custom function to use mice, and even with my tricks I got numerical errors from time to time.

So I strongly recommend missForest, which uses random forests to predict the missing values, because it is very easy to use and, more importantly, it is robust: it succesfully (and accurately) imputes missing data and you don't have to worry about numerical errors. It works as easy as:

    
Let's compare "simple" and "advanced" imputation methods on a sample dataset. It is a private dataset with about 10k entries and 30 other variables, where I have randomly removed 25% of the values of the variable "cost of flights". First, let's take a look at the actual distribution of the values that I have removed:
Picture
Actual distribution of data before removing values.
Now, in blue we can see the results of a simple imputation method (na.roughfix) which attributes the mean value to all missing values: ​
Picture
In blue, imputed values using na.roughfix (filled with the mean).
And finally, now in blue the values imputed using missForest. The distribution of the imputed values reproduces quite accurately the actual distribution of the values before artificially removing them:
Picture
In blue, imputed values using missForest.
In the plots above you can also see evaluation metrics of the imputation, like MAE (mean average error, in €), and RMSE (normalized mean squared error). Both show a great improvement when using an advanced missing value imputation method like random forests.

A similar improvement occurs for categorical data. Here we have removed in the same dataset a 25% of the values of the variable "Seat preference". The actual values before being removed are shown in red below. On the left panel, the imputation is done using na.roughfix, which attributes the mode of the variable (the more common class: "window") to all missing values. On the right panel, the imputation using missForest, which builds a predictive model (a random forest) to attribute a seat preference depending on all the other variables. The improvement is obvious at a first glance and is also measured by the evaluation metrics BA (balanced accuracy) and Sens (sensitivity).
Picture
Imputation of missing values using a rough fix (left) and using random forest (right).

​Do you have other strategies to deal with missing data? Please leave your thoughts in the comments section below.
9 Comments
David
26/9/2016 22:42:53

Hi. Nice article. Quick question: What's the difference between rfimpute and Missforest? I have a data set with 2 predictors that have missing data. Both categorical, one with 3 levels and the other with 2. Rfimpute seems to split across the 3 level predictor fine but with the 2 level predictor it lumps it all the NAs into 1 of the values and not between the 2 levels. I understood rfimpute to use the random forest proximity matrix does missforest work along those lines?

Reply
Jordi
1/10/2016 10:33:45

Hi David,

as far as I understand, the difference is:

**
** missForest :
**
For each original predictor with missing data, it builds a new separate model (a randomForest with the response being that predictor) that trains on the complete data. The predictions of that model are used to impute the NAs.

**
** rfImpute:
**
It only uses one model, a random forest with the original response variable, trained imputing with na.roughfix (means and modes). This model is only used to calculate the proximity between data points. The proximity is a measure of how often two data points end in the same leaf node for different trees. Then it imputes:
- For continuous predictors, the imputed value is the weighted average of the non-missing obervations, where the weights are the proximities.
- For categorical predictors, the imputed value is the category with the largest average proximity.

Refs:
- http://math.furman.edu/~dcs/courses/math47/R/library/randomForest/html/rfImpute.html
- https://cran.r-project.org/web/packages/missForest/missForest.pdf

To assess which method is imputing better, I would measure the performance of your predictive model on a test set using both methods. You can check what happens when you impute in your train set and when you just impute in your test set.

Good luck!

Reply
pablo link
17/4/2017 03:04:38

Hi Jordi, thanks for the post it's really interesting. I came to it looking for some thoughts -not just code- about missForest. Also, I was surprised about how we came up to very similar conclusions in a book I'm writing, it's really important to explain from intuition.

I think the same regarding mice package, too complex, too many tweaks, and it cannot be used with any algorithm because it creates several data frames with imputed data to use later in a sort-of bagging model. Well, in the end, this is a complete framework, as it says in the paper the final result is a combination of prior results.

I share with you one more imputation method: to create bins in numerical variables, for example equal frequency (so the variable will turn into categorical), and then convert the NA value into the category "NA".

Cheers!

Reply
Jordi link
17/4/2017 11:31:56

Indeed to bin a numerical variable and create a "missing" category can be a very useful workaround to deal with mising values.

Your "Data Science Live Book" at http://livebook.datascienceheroes.com/ looks very interesting. Thanks for writing and sharing!

Reply
pablo casas
10/11/2017 14:55:44

Hi Jordi, funally the missing data chapter is out!
https://livebook.datascienceheroes.com/data-preparation.html#missing_data

Cheers

Su link
6/2/2018 20:16:20

Hi Jordi, thanks for the comprehensive article! Very interesting indeed. I'm working on survey data with 1400 categorical variables most of which are blank, but seem to be left blank for a good reason instead of completely random. Will try the methods you've suggested and see what happens. Thanks again!

Reply
Jordi
6/2/2018 22:56:50

I am glad the article helped you. Thanks for letting me know. Good luck with your attempts!

Reply
Tanushree pareek
23/7/2018 18:00:33

Would you share the coding how did you make those plots imputed and observed data one?

Reply
Redmond
7/3/2019 13:22:41

Is anyone aware of how I can train missForest on a data set then use that model to impute results in a different dataset?

Cheers

Reply



Leave a Reply.

    Jordi

    Data Scientist.
    Here you'll find some examples of data analysis, visualizations, machine learning and related topics.

    Archives

    July 2016
    October 2015
    September 2015

    Categories

    All
    Bokeh
    Data Visualization
    Machine Learning
    Python
    R
    SQL

    RSS Feed

Picture