predictions with is full of black holes of missing data. What to do with that? Would you remove the entries (rows) with missing data? Would you remove the variables (predictors, columns) with missing values? Would you try to impute the missing values (to "guess" them)?
The strategy to follow depends on your (missing) data. Your data can have missing values which can be distributed at random, or not...
- The good news:
- you can safely remove the entries with missing values because your data will retain the same shape after that. That would be a really good option if you have enough complete data.
- your missing data is in general much easier to impute than if your data was not missing at random.
- The bad news:
- I bet your data is not missing at random :(
Missing not at random: probably there is a reason behind your missing data, a pattern. It could be that:
- The missing values in a variable depend on their hypothetical values. For example in a survey, if the variable "income" has much more missing values for high income respondents due to the fact that people with high income do not want to give that information. Or another example: in your data measuring rainfall you have no data on the days when it rained the most because your instrument has an upper limit in the amount of rain it can measure.
- The fact that a certain value is missing depends on other variables which are not missing, like when you have customer data from different sources (let's say mobile and web data) and the variable "customer age" is missing when source=web and not when source=mobile.
- The bad news:
- if you exclude the entries (rows) with missing values from your analysis, you are probably introducing some bias. In the first example above (the survey), if you want to calculate the average size of the apartments of the respondents and you remove the entries with missing "income", then your analysis will be restricted to a subset of your original data with a probable smaller average size of apartments.
- your missing data is in general much more difficult to impute. In the same survey example, your model to impute data may have to extrapolate to guess the missing income values (it may have no other high income data to learn from) .
- The good news:
- as there is a reason behind why your data is missing, you may be able to do something about it. You can try to change how you gather your data, improve the sources, add new sources...
To Impute or not to impute, that is the question...
1. Do not impute
If you have enough data, a good a approch is to just remove the rows with missing values and work with the subsample of your data which is complete. Remember that you can only safely do that when your data is missing at random. If it is not, you can try to re-weight the complete data, although that is not practical at all when working with many variables.
To obtain the complete subsample of your data in R:
The simpler version of this approach is to just remove (if you can afford it) the columns (predictors) with missing values. In R would be something like:
When to impute?
That depends on your data. Set a good cross-validation score, try different imputation/non-imputation methods, and measure the performance of your models.
In general, if you have large percentages of missing data, I would recommend to impute as little as possible (see also the available-case analysis strategy). But sometimes you may be forced to impute. For example if you have your model already built, running in production, and it has to compute error-free predictions every certain time. In this case you may need to avoid (depending on your algorithm) giving null values as inputs of your trained model. Hopefully you have already selected predictors that you expected to be always complete, but anyway you never know the unexpected missing data that will come, and your model needs to perform flawlessly in production. In this case you really need an strategy to impute missing data.
Simple imputation methods:
A simple way to fill your missing values is set them equal to the mean of the variable, if it is numerical, or to the mode of the variable, if it is categorical (factor). That is easily done in R with the roughfix function of the randomForest package:
Alternatively, in categorical variables with missing values, you can create a new level called "missing" and impute all missing values with this new level. For numerical variables (e.g. Xvar and Yvar) you can roughly impute the missing values (for instance with the mean of the variable) and then create new variables ("wasXvarMissing" and "wasYvarMissing") which will be flags (0/1) of the entries that originally had missing values.
Complex imputation methods:
First you should ask yourself if it would be possible to "manually" impute some of your missing values defining logical rules. You need to fully understand your data and why is it missing, and that approach may only work for very specific variables. For example if you have data from customers and you have missing values in the variable "NumberOfChildren", you may try to impute this variable by setting the value equal to that of the partners (husband, spouse) of your customers, which should be alright most of the times.
A second strategy is to use predictive models to impute your missing data. In R I've tried it with the mice and the missForest packages. Although the mice package looks really good, I must say it didn't work for me, because in order to apply it to medium-sized data (50k rows, 50 columns) I had to do a lot of tricks to avoid numerical and other errors: (i) I had to impute missing values one variable at a time, (ii) with the other variables, used as predictors for the imputation model, I had to one-hot-encode (get dummies) on the categorical ones and apply PCA to all of them. At the end I ended up with a huge custom function to use mice, and even with my tricks I got numerical errors from time to time.
So I strongly recommend missForest, which uses random forests to predict the missing values, because it is very easy to use and, more importantly, it is robust: it succesfully (and accurately) imputes missing data and you don't have to worry about numerical errors. It works as easy as:
A similar improvement occurs for categorical data. Here we have removed in the same dataset a 25% of the values of the variable "Seat preference". The actual values before being removed are shown in red below. On the left panel, the imputation is done using na.roughfix, which attributes the mode of the variable (the more common class: "window") to all missing values. On the right panel, the imputation using missForest, which builds a predictive model (a random forest) to attribute a seat preference depending on all the other variables. The improvement is obvious at a first glance and is also measured by the evaluation metrics BA (balanced accuracy) and Sens (sensitivity).
Do you have other strategies to deal with missing data? Please leave your thoughts in the comments section below.