Aim
To give researchers information and a structured guideline for handling missing data
Requirements
Clear documentation of the decisions that were made regarding handling missing data. If data were imputed, imputation methods are clearly documented.
Documentation
- Research protocol: Describe how missing data will be prevented during data collection;
- Data analysis plan: Describe the reasons for missing data and how missing data will be handled and how missing data analyses will be performed (per questionnaire/instrument);
- Syntax: Documentation of the missing data analyses that were performed, and of the eventual imputation method. Refer to data files with and without imputed data;
- Logbook: decisions and arguments for the way missing data will be handled.
Responsibilities
Executing researcher:
- Make sure you understand the pitfalls of ignoring missing data;
- Make sure you discuss the way you handle missing data with your supervisor and if needed with experts;
- Document the points mentioned under 3 (Documentation).
Project leaders:
- Advise the executing researcher to consult sources and experts before missing data were imputed or ignored;
- Inspect the research protocol how missing data will be prevented during data collection;
- Inspect the analysis plan on the steps that will be taken to analyze the missing data;
- Ensure that the executing researcher properly documents the points that are mentioned under 3 (Documentation).
Research assistant:
- Ask the executing researcher, when there are no guidelines, how to handle missing data in questionnaires/on instruments.
How To
1. Introduction
Missing data are a common problem in all kinds of research. The way you deal with it depends on how much data is missing, the kind of missing data (single items, a full questionnaire, a measurement wave), why it is missing, and how it is missing (at random or not at random) . Handling missing data is an important step in several phases of your study.
2. Why do you need to do something with missing data
The default option in standard software packages as SPSS, Stata or SAS is that cases with missing values are not included in the analyses. Deleting cases or persons results in a smaller sample size and larger standard errors. As a result the power to find a significant result decreases and the chance that you correctly accept the alternative hypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, you introduce bias in effect estimates. when there is a difference in characteristics between responders and non-responders. When the group of non-responders is large, and you delete them, your sample characteristics are different from your original sample and from the population you study. Therefore you need to inspect the missing data, before doing further analyses. Thus, always check the missing data in your data set before starting your analyses, and do never simply delete persons in your dataset with missing values.
3. What to do with missing data in different phases of your study
Data preparation:
If you work with questionnaires, make sure that all questions are clear and applicable to your respondents. If necessary, use the ‘not applicable’ answer option. In SPSS, missing values can be coded by the user (user system missings) or automatically, by SPSS itself (system missing value). It is not necessary to code your missing values by using numbers as 999 or -9999. You can also leave the cells open (empty) because in both ways, the missing values are deleted from the analyses. To decrease the chance of missing data, use digital applications to collect your data, such as Web based questionnaires where you can set the option that answering the question is required. You can also use these applications for sending reminders and tracking the respondents’ progress. If you work with physical or physiological data, the most frequent cause of missing data is a technical problem with the instruments. Testing the instruments in a pilot study will partly prevent you for these problems.
Data collection:
Closely monitor the completeness of the data when you receive or obtain the data. When you detect missing data during data collection, try to complete your data. Look back in the raw data (questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook why data are missing. This helps you to decide whether data are completely random or not.
Data processing:
Investigate the number of missing data you have (see 4.4) and estimate the need for imputation and think about the most adequate imputation method (see 4.5 and further).
Data analyses:
If you have missing values in your data set when starting your analyses, remember that case wise and list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability and accuracy of your results (see 4.2).
4. How much data is missing?
SPSS can help you to identify the amount of missing data. When you are interested in the percentage of missing values for each variable separately (e.g. item on a questionnaire) use the Frequency option in SPSS:
- Select Analyze → Descriptive Statistics → Frequencies;
- Move all variables into the “Variable(s)” window;
- Click OK.
The “Statistics” box tells you the number of missing values for each variable. However, be aware that this only gives you information about the percentage of missing values for each variable separately. It is more important to study the full percentage of missing data, especially when you use more variables in your analysis.
When you are interested in the full percentage of missing data use the following option:
- Select Analyze → Multiple Imputation → Analyze patterns;
- Move all variables into the “Variable(s)” window;
- Click OK.
The output tells you the percentage of variables with missing data, the percentage of cases with missing data, and the number of missing values. This final pie chart tells you the full percentage of missing data. Note the 5% borderline. Also patterns of missing data are presented.
- Tip: use the Help button, and click “show me” for more information about the options and output in SPSS.
When you want to find out more about the patterns of missing data and the relation between missing data between variables, use the following option:
- Analyze →Missing Value Analysis;
- Move all variables of interest into the Quantitative or Categorical Variable(s) window;
- Use the ‘patterns’ button to get information about the relation between missing data on more variables;
5. What kind of data is missing?
Next step is to identify the kind of data that is missing. You can find out this information from the steps described in 4.
- A single item, or several items of a questionnaire is missing;
- A full questionnaire or a single variable (such as blood pressure);
- A measurement wave (in longitudinal / randomized studies).
6. What type of missings do you have?
Missing values are either random or non-random. Random missing values may occur because the subject accidentally did not answer some questions. For example, the subject may be tired and/or not paying attention, and misses the question. Random missing values may also result from data entry mistakes. Non-random missing values may occur because subjects purposefully do not answer some questions. Subjects may be reluctant to answer some questions because of social desirability concerns about the content of the question, such as questions about sensitive topics like income, past crimes, sexual history, prejudice or bias toward certain groups, and etc. Think about your dataset. Could the missing values be non-random?
Table: Rubin [1] developed in 1976 a typology for missing data (1,2).
Type of missings | Description |
MCAR: Missing Completely At Random: | The data are MCAR when the missing values in your dataset happened by coincidence. The observed values in your dataset are just a random sample from your dataset, when it would have been complete. An example is when respondents accidentally skip questions. |
MAR: Missing at Random (most of the time) | The data is MAR when the probability of missing data can be covered by other variables. An example is when older respondents have more missing values than younger respondents. However, within the group of older and younger respondents, the data are still MCAR and age explains the missingness. Another example is when respondents with low scores on the first wave are not invited for a second wave. |
MNAR: Missing Not At Random: | The data are MNAR when the probability that a value for a certain variable is missing is related to the scores on that variable itself. An example is that respondents with low income intentionally skip their low income scores because that violates their privacy. MNAR is a serious problem, which cannot be solved with a technique as multiple imputation. |
7. How do you know what kind of missings you have?
There are three kinds of methods (see: https://bookdown.org/mwheymans/bookmi ).
- First you can inspect the data by yourself. Are the missings equally distributed in the data? Are low and / or high scores missing? If the missings are not equally spread this might be an indication that the data are MNAR. With this method you a-priori must know what the distribution of the variable normally is, i.e. is it normal or skewed? You need this information before you can judge which part of the data suffers from missing values. This method only applies if your dataset is large.
- You can test whether the respondents with missing data differ from the respondents without missing data on important variables (In SPSS: Analyze -> Missing Value Analysis -> select important variables -> descriptives -> t-test formed by indicator). Significant? Indication that the data are not MCAR. Be aware that if your sample size is large (>500) this t-test might be significant if the data truly are not MAR. So, just looking at the means and their difference might be good enough. In case this mean difference is very small, this might be an indication of MCAR.
It is also possible to do a test for MCAR data. This is called Little´s test (in SPSS: Analyze -> Missing Value Analysis, EM button).
It is important to note that you’re not able to test whether your missing data is MAR or MNAR. The above mentioned procedures (1 and 2) will only give you an indication for MCAR data or MAR/MNAR data. Pay attention to the possibility of MNAR, because all analyses have serious problems when the missing data is MNAR.
8. How to handle missing data?
Missing data is random:
For MCAR and MAR, many missing data methods have been developed in the last two decades (3). Although MCAR seems to be the least problematic mechanism, deleting cases can still reduce the power of finding an effect or association. It is argued that the MAR mechanism is most frequently seen in practice. An argument for this is that in most research multifactorial or multivariable problems are studied, so when data on variables are missing it is mostly related to other variables in the dataset.
Missing data is not random:
For MNAR, imputation is not sufficient, because the missing data are totally different from the available data, i.e. your complete data has become a selective group of persons. If you think your data is MNAR it might be wise to contact a statistician who is willing to help you.
For MCAR and MAR, there are roughly two kinds of techniques for imputation:
- Single (stochastic) imputation is possible in SPSS and is an easy way to handle missings when just a few cases are missing (less than 5%) and you think your missing values are MCAR or MAR. However, after single imputation the cases are more similar which may result in an underestimation of the standard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (the null hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is less adequate when you have >5% missing data. This is also the case when item scores are missing in questionnaires (7, 4).
- Multiple imputation is more complex, but also implemented in SPSS 17.0, and later versions, R software (package mice), Stata and SAS. Multiple imputation takes into account the uncertainty of missing values (present in all values of variables) and is therefore more preferred than single imputation methods. When the amount of missing data is high (exceeds 5% in several variables and different persons), multiple imputation is more adequate. Multiple Imputation works for total scores in questionnaires as well as for item scores in questionnaires (4,5,6).
Imputation techniques
Single (stochastic) imputation
Single stochastic regression imputation, can be performed by applying Multiple Imputation once, i.e. generate one imputed dataset. Single stochastic regression imputation may be an improvement over single regression imputation because imputation uncertainty is accounted for by adding noise (error) to the imputed values. However, it is still not the best solution and will underestimate the standard errors, like all single imputation procedures as mean imputation. These are therefore not recommended to use.
Multiple imputation (MI)
In MI the value is imputed for several times (5). There are more imputed datasets created. The different imputations are then based on random draws of different estimations of the underlying distribution in the source population. There is more uncertainty created in the dataset. Therefore the standard error increases and becomes a better estimation of the correct standard error. The amount of imputations is dependent on the amount of missing data (7). A working rule is, when 10% of the cases is missing in a multivariable model, 10 imputed datasets have to be generated, when 15% is missing, 15 imputed datasets, etc. After imputation the statistical analysis has to be repeated in each multiply imputed dataset. Finally, results have to be pooled in a summary measure. Most statistical packages can do this automatically for frequently used methods like t-tests or regression analyses. For more specialized pooling procedures in R see: https://mwheymans.github.io/psfmi/ and https://alexanderrobitzsch.github.io/miceadds/. Multiple imputation is possible in recent versions (version 17) of SPSS (analyze –> multiple imputation –> impute missing data values) in R (using the mice package, in Stata (mi impute) and in SAS. For more information see https://bookdown.org/mwheymans/bookmi/.
Multiple imputation in Multilevel data
There are also special imputation methods developed for missing data in multilevel datasets. These methods are only available for R software. See reference 5 or consult an imputation expert for more information.
Sensitivity analysis
After imputation, sensitivity analysis is needed to determine how your substantive results depend on how you handled the missing data.
Follow these steps:
- Do a complete case analysis (default option in SPSS; cases with missings are not included);
- Do a missing data analysis after you imputed the results;
- Compare substantive conclusions, decide how to report.
To read more about sensitivity analysis go to Post-hoc and sensitivity analyses in this Handbook
Imputation of descriptive statistics
A different approach may be used for descriptive studies. If you want to show the (observed) study data (means and standard deviations), for example to compare them with other countries/settings, without directly linking them to a conclusion, imputation is not immediately needed. However, to use statistics (t-tests, regressions, etc.) complete data analysis would certainly be needed. Be clear about imputation and point out why you choose to present imputed/non-imputed data. Also, take your missing data evaluation and solution seriously. The missing data and imputation analysis can take as long as the normal data analysis and for complex imputation models even longer!
9. When is imputation of missing data not necessary?
When missing data is MCAR or MAR, and you use Maximum Likelihood estimation techniques in analyses such as Structural Equation Modelling (SEM) or Linear Mixed Models (LMM), imputation of missing data is not necessary in the case of outcome missing data in Longitudinal and Multilevel study designs (8,9,10). Also when you only have missing outcome data in an observational study or randomized controlled trial and your outcome is also assessed at baseline, imputation is not needed to get reliable results (14). These techniques use the available data, and ignore the missing values and still give correct results. In such situations you do not have to use an extra imputation technique to handle your missing values. This is different for missing data in covariates in longitudinal and Multilevel study designs. In these situations Multiple Imputation is indicated, however, more complex imputation models have to be used (11). This is also the case for missing data that are MNAR. Specific models are available like selection or pattern-mixture models.
10. Summary
- Make every effort to avoid missing data, or failing that, to understand how much and why data is missing.
- Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications.
- Avoid default methods (listwise deletion, pairwise deletion).
- Avoid default fix-ups (mean imputation, etc.) where possible.
- Use multiple imputation to take proper account of missings.
- Do a sensitivity analysis.
APH expert on Missing Data:
- Dr. Martijn Heymans: mw.heymans@amsterdamumc.nl
Course:
At the website of Epidm you can find information about a course about Missing data, see: https://www.epidm.nl/en/courses/missing-data-consequences-and-solutions/