2.1 Introduction
The IMPUTE module is a general-purpose multivariate imputation procedure that can handle relatively complex data structures when the data are missing at random (Rubin, 1976). Survey data sets often consist of large numbers of variables that have a variety of distributional forms. Typically, such data sets have hundreds of variables, some continuous, others counts, many dichotomous or polytomous, and semi-continuous or limited dependent variables. IMPUTE can handle such complex data structures.
IMPUTE produces imputed values for each individual in the data set conditional on all the values observed for that individual using the sequential regression approach (also called Chained Equations or Flexible Conditional Specifications). The basic strategy is to create imputations through a sequence of multiple regressions, varying the type of regression model by the type of variable being imputed. Covariates include all other variables observed or imputed for that individual. The imputations are defined as draws from the posterior predictive distribution specified by the regression model with a flat or non-informative prior distribution for the parameters in the regression model. The sequence of imputing missing values can be continued in a cyclical manner, each time overwriting previously drawn values, building interdependence among imputed values and exploiting the correlational structure among covariates. To generate multiple imputations, the same procedure can be applied with different random starting seeds or by taking every pth imputed set of values in the cycles mentioned above. For details see Raghunathan et. al. (2001) and Raghunathan (2015).
IMPUTE assumes the variables in the data set are one of the following five types: continuous; binary; categorical (polytomous with more than two categories); counts; and mixed (a continuous variable with a non-zero probability mass at zero). The types of regression models used are linear, logistic, Poisson, generalized logit or mixed logistic/linear, depending on the type of variable being imputed.
IMPUTE can also accommodate two common features of survey data that add to the complexity of the modeling process: (1) the restriction of imputations to sub-populations; and (2) the bounding of imputed values. First, certain restrictions are imperative, requiring the sub-setting of sample individuals to satisfy particular criteria while fitting the regression models. For example, the variable ‘Number of Years Since Quit Smoking’ is defined only for former smokers; hence, the imputation process for this variable should be restricted only to former smokers. Restrictions also arise due to skip patterns in the questionnaire. For example, certain questions about income from a second job are asked only when the respondent indicates having a second job. The imputation of such variables has to be handled in a hierarchical manner.
Second, there are certain logical or consistency bounds for missing values that must be incorporated in the imputation process. Such interrelationships among the variables make the model specification difficult. For instance, ‘Years of Smoking’ should not only be restricted to current or past smokers but the imputed values might be required to be less than a specified number years, based on other respondent characteristics, such as evidence of smoking as a teenager. In such a case, the imputed upper bound for ‘Year of Smoking’ might be the respondent’s current age minus 12. This assumes that the respondent may have started smoking at 12 years of age. For a former smoker, ‘Year of Smoking’ would also have take into account years since the respondent stopped smoking. Another example of bounds is discussed in Heeringa, Little and Raghunathan (1997). They address imputation of bracketed response questions in which a respondent is unable or unwilling to provide an exact response (e.g., income and assets), but does define the bounds within which the imputed values must lie. The bounds involve drawing values from a truncated predictive distribution.
Any imputation software package is a tool that needs to be used judiciously. To obtain a valid imputation each regression model needs to carefully developed and specified by the user. Developing such good prediction models requires exploratory data analysis, model building and model checking through residual diagnostics. Thus, if there are p variables in the data set with missing values then p regression models have to be developed appropriately for this software package to produce statistically valid results. There are many good books on regression that discuss model building strategies (for example, Weisberg (2013), Atkinson (1985), Vittinghoff, Glidden, Shiboski and McCulloch (2005) and Gelman and Hill (2006)). Raghunathan (2015) discusses model building and model checking strategies in the context of missing data.
2.2 Required IMPUTE Statements
2.2.1 Input and Output Data Sets
DATAIN filename;
This required statement identifies the location and name of the input data set. For example, in a SAS environment, the filename can be expressed as “libname.sasdata”. In other environments, read the data set and include the name of the data set in the filename.
DATAIN Mylib1.Mydata;
indicates that the SAS data file Mydata is located in the library Mylib1. Mylib1 is the name assigned to a directory with the SAS Libname statement. (See later sections for examples).
DATAOUT outfile [ALL];
This statement identifies the location and name of the output dataset containing the imputed data. The ALL keyword is optional. If it is specified and more than one imputation is generated (see keyword MULTIPLES) then the output dataset will be a concatenation of the multiple imputed data sets. The system variable ‘_MULT_’ , automatically added to the output file, can be used to distinguish each imputation. For example,
DATAOUT Mylib2.Impdata ALL;
will store the SAS file Impdata in the library Mylib2, a pointer to the directory with appropriate SAS libname statement.
2.2.2 Declaring Variable Types
IMPUTE requires that the SAS data set variables be defined by type. Six types of variables are recognized by the IMPUTE module: continuous, categorical (binary is included as categorical), count, mixed, transfer and drop. If no variable types are specified, all variables will be assumed to be continuous. Variable types should be declared before any BOUNDS, INTERACT, or RESTRICT statements (see below).
CONTINUOUS variable list;
Variables declared as CONTINUOUS may take on any value on a continuum. Income is an example of a continuous variable. A normal linear regression model is used to impute the missing values in these variables. You may want to transform the variable to achieve normality and then impute on the transformed scale. After imputation you may re-transform the variable back to its original form.
CATEGORICAL variable list;
CATEGORICAL variables have values that represent discrete values. Gender is a categorical variable. A logistic or generalized logistic model is used to impute missing categorical values.
MIXED variable list;
Variables declared as MIXED are both categorical and continuous. In a mixed variable a value of zero is treated as a discrete category, while values greater than zero are considered continuous. Alcohol consumption is an example of a mixed variable. A two stage model is use to impute the missing values. First, a logistic regression model is used to impute zero vs. non-zero status. Conditional on imputing a non-zero status, a normal linear regression model is used to impute non-zero values.
COUNT variable list;
COUNT variables have non-negative integer values. A Poisson regression model is used to impute the missing values. The number of annual doctor visits is an example of a COUNT variable.
Sometimes a normal linear regression model is not appropriate because, for example, the distribution of the residuals appear non-normal based on the residual diagnostics. For such variables there are two options (see He and Raghunathan (2006), Bondarenko and Raghunathan (2010) and Raghunathan, Berglund and Solenberger (2017)), ABB (Approximate Bayesian Bootstrap) and GH (Tukey’s gh-distribution). These can be specified as
ABB varlist;
or
GH varlist;
where varlist are the continuous or mixed variables declared earlier.
DROP variable list;
Variables listed after the DROP keyword will be excluded from the imputation procedure and will not appear in the imputed data set.
TRANSFER variable list;
Variables listed after the TRANSFER keyword are carried over to the imputed data set, but are not imputed nor used as predictors in the imputation model. Transfer variables, however, can be used in the RESTRICT and BOUNDS statements (see below). ID is an example of a variable that you might want to treat as a transfer variable or any variables not to be used as predictors in the imputation process (for all the variables being imputed).
DEFAULT variable type;
variable type can be Continuous, Categorical, Count, Mixed, Transfer or Drop. This keyword declares that by default all the variables in the data set should be treated as the variable type. The most efficient use of the DEFAULT statement is to declare the most numerous variable type in your data set as the default type, eliminating the need to type a long list of variables. The DEFAULT statement must be given before declaring other variable types.
RUN;
This should be the last statement in your setup file.
2.3 Restrictions and Bounds
RESTRICT variable(logical expression);
This command is used to restrict the imputation of a variable to those observations that satisfy the logical expression. For instance, suppose that the variable yrssmoke indicates the number of years an individual smoked, and the variable smoke takes the value 1 for a current smoker, 2 for a former smoker or 3 for someone who never smoked. Then the declaration,
RESTRICT yrssmoke(smoke=1,2);
will impute yrssmoke values only for current and former smokers. It will automatically set yrssmoke equal to 0 for never smokers.
Restrictions on more than one variable may be combined as follows:
RESTRICT yrssmoke(smoke=1,2) births(female=1) income(employed=1);
When the restriction is not met, the value of the restricted variable will be set to zero for a continuous and count variables. For a categorical variable, a separate category will be created with the response code, one higher than the highest observed code for the restricted categorical variable. For example, the statement,
RESTRICT smoke(age>= 13);
where smoke has 3 categories as described below, will create a category 4 for those with age <= 12.
BOUNDS variable (logical expression);
This keyword is useful for restricting the range of values to be imputed for a continuous variable.
For example,
BOUNDS yrssmoke (> 0,<= Age-12);
will ensure that the imputed values for yrssmoke are between 0 and the individual’s Age minus 12. Smoking is assumed not to begin before the age of 12.
Again, as in the RESTRICT statement more than one variable can be included in the BOUNDS statement.
For example,
BOUNDS yrssmoke (>0,<= Age-12) numcig(>0);
2.3.1 Model-Building Statements
The fundamental idea behind the sequential regression approach is that the imputation for every variable should be conditional on all other variables as predictors (unless listed under DROP or TRANSFER statements). There are practical circumstances where this may not be possible. The following two commands are useful to select the predictors based on their predictive power.
MAXPRED number; OR MAXPRED varlist (number) ;
Specifies the maximum number of predictor variables to be included as predictors in the regression model. A step-wise regression procedure is used to select the best predictors subject to the maximum number. Setting MAXPRED to a small number of predictors will greatly reduce the computational time especially for a very large data sets but the imputations will not be fully conditional.
For example;
MAXPRED 5;
will include the five best predictor variables for every regression model, the five making the largest contribution to the r-square (for linear regression models) and Nagelkerke coefficient of determination for other models.
You can also restrict the number of predictors for selected variables.
MAXPRED Income (7) Educ (3);
will limit the number of predictors of Income to the seven largest contributors to the r-square, while the number of predictors of Educ are limited to the three largest contributors. For other variables, all variables will be used as predictors.
The second option to reduce the number of predictors is use of the minimum additional increase in r-square needed for a variable to be included as a predictor. For example,
MINRSQD decimal;
Specifies the minimum marginal r-squared (or generalized r-squared) of decimal to be included as a predictor. This can reduce computation time. A small decimal number like 0.005 would build very large regression models whereas 0.25 will include a smaller number of predictors in the regression models. If neither MAXPRED nor MINRSQD is set then no variable selection will be performed.
MAXLOGI number;
Specifies the maximum number of iterative algorithms to be performed in a logistic or multilogit regression model. The default is 50. This is useful if the Newton-Raphson algorithm used in computing the maximum likelihood estimates does not converge after 50 iterations. This applies to the convergence criterion for the logistic, polytomous and Poisson regression models. You can check whether you have such a non-convergence problem by inspecting the log file (e.g., mysetup.log).
MINCODI decimal;
Specifies the minimum proportional change in any regression coefficient to continue the logistic regression iteration process. This applies to the convergence criterion for the logistic, polytomous and Poisson regression models.
Sometimes one may want to include interaction terms as predictors in the model. These are derived variables. There are two possible options. The first option is to construct the product terms as new variables in the data set and impute them just like any other variable. The product term will be set to missing if either variable is missing. The second option is to impute separately but use the product as the predictor in other regression models. The following options implement this approach.
INTERACT variable1*variable2;
This keyword enables the user to specify interaction terms to be included in the imputation regression model.
For example, a specification
INTERACT Income*Income, Age*Race;
will result in including a square term for Income and an interaction term of Age and Race in the imputation model for all the variables in the data set (except for the variables in the particular interaction term).
OFFSETS count variables (offset variable);
This statement is used to specify an offsets variable when fitting a Poisson regression model. For example,
OFFSETS Injuries(Years);
will fit a model predicting the number for injuries occurring per year.
Finally, the command,
DIAGNOSE variables/[all];
produces imputation diagnostic plots for all the listed variables. This will produce a series of imputation plots used to evaluate the imputation process. For more details about these plots see Bondarenko and Raghunathan (2016). By default, it will produce a set of recommended set of plots and numerical summaries. The optional command ‘all’will produce all the plots generated as a part of the program. Like the ‘all‘ feature in the PRINT command described in Section 2.3.2, the number of output graphs will be voluminous.
2.3.2 Other Commands
ITERATIONS number;
Specifies the number of cycles that the imputation program should iterate for each variable and imputation. You can specify any number greater than or equal to 2. Current investigations show that about 10 cycles are sufficient for most imputations. You may want to experiment with several values and check the differences in the resulting analysis.
MULTIPLES number;
Indicates the number of imputations to be performed. By default only a single imputation is generated. Multiples and iterations determine p, the total number of cycles for regression model fitting for each variable. If 5 multiples and 10 iterations were specified then a total of 50 cycles will be performed. After every 10th cycle an imputed data set will be created.
BY varlist;
This command can be used to perform imputations separately for the distinct combination of values of the variables in the varlist. For example, if the variable race is coded as White/non-White and the variable gender is coded as Man/Woman then BY race gender; will create 4 subgroups and separately impute missing values in all other variables in each subgroup. No missing values in variables in the varlist are allowed.
PERTURB keyword;;
The keyword PERTURB followed by a keyword (COEF/SIR) allows the user to control perturbations of imputed values. By default, the IMPUTE module will perturb model coefficients using a multivariate normal approximation of the posterior distribution of the parameters in the regression model and the predicted values using the appropriate regression model conditional on the perturbed coefficients. This is equivalent to using the COEF instruction. SIR uses the Sampling-Importance-Resampling algorithm to generate coefficients from the actual posterior distribution of parameters in the logistic, polytomous and Poisson regression models (See Rubin 1987a, Raghunathan and Rubin 1988, Raghunathan 1994, Gelman, et. al 1995). This is appropriate in situations where normal approximation to the posterior distribution is not appropriate. One example of this situation is a logistic regression with a low prevalence of the outcome variable (say, less than 1% or 2%).
One should be able to reproduce the imputed data sets at a later time. The SEED option is useful the generate the same random number sequence and, hence, regenerate the same set of imputed values.
SEED number;
Specifies a seed for the random draws from the posterior predictive distribution where number should be greater than zero. A zero seed will result in no perturbations in the regression coefficients or in the predicted values. If the SEED keyword is missing from the setup file then the seed will be determined by your computer’s internal clock. However, you may not be able to recreate the imputed data set at a later date. For replication of results at a later date, this option must be used and the seed number should be archived.
NOBS number;;
Specifies the number of observations to be used in the analysis. By default all observations in the data set will be used. You might use NOBS to subset a large data set while testing your setup file.
PRINT instruction;
Indicates the printout desired. The options are STANDARD, DETAILS, COEF, and ALL. For the IMPUTE procedure, the STANDARD and DETAILS keywords instruct IVEware to print the number and distribution of observed values, imputed values, and combined observed and imputed values for each variable. If the keyword COEF is present, then IMPUTE will also print the unperturbed and perturbed coefficients for each iteration of each multiple imputation. When the ALL keyword is used, in addition to the above, the coefficient covariance matrix for each iteration of each multiple imputation is also printed. IMPUTE also prints a list of the variables used in the imputation model with columns indicating the number of observed cases and the number of imputed cases for each of the variables.
The output from IMPUTE has a column labeled ‘double counted,’ and is useful for diagnostic purposes. This entry should be zero. A non-zero entry indicates the actual observations in the data set do not satisfy the restriction specified in RESTRICT statement. This has caused the program to count it twice (once satisfying the restriction and once more as not satisfying the restriction). The IMPUTE command, therefore, changes the observed value of a restricted variable according the restriction rule (zero for continuous variables, one higher than the highest observed code for categorical variables; see RESTRICT above) before proceeding with the imputation. In such situations, the data should be checked for consistency with respect to the specified restriction. For example, if the variable SMOKE, indicating whether or not a respondent smokes, is missing and the variable YRSMK, indicating the number of years the respondent has smoked, is observed (say, 10), then logically the respondent should be classified as a smoker. If, however, the value for SMOKE is missing in the data set, this creates inconsistency. The IMPUTE program changes SMOKE to smoker before proceeding with the imputation but then alerts the user to the problem in the data set. The user should correct the data (either 10 for YRSSMK is an error or setting SMOKE to missing is an error) and re-run the imputation. Since the restrictions can be complex, it is possible that for some subjects there could be no resolution. The user should, therefore, be made aware of the problems.
TITLE text \n text;
Indicates the title(s) to be printed at the top of each page of the printout. A \n indicates that the text that follows should be printed on the next line. For example,
TITLE This is the title on the first line \n This is the title on the second line;
2.4 PUTDATA
The IMPUTE module outputs a single data set, the one specified on the DATAOUT statement of your setup file. If you have requested more than one imputation with the keyword MULTIPLE and have included the keyword ALL in the DATAOUT statement the imputations are concatenated in the single output file. The imputations can be distinguished by the system variable ‘ MULT ‘. If you request more than one imputation with the keyword MULTIPLE and have not included the keyword ALL in DATAOUT statement only the first imputation will be included in the output file. The additional imputations are stored in an internal file and can be retrieved by submitting the PUTDATA statement. For example, suppose that
<impute name="myfile">
datain mydata;
dataout myoutdata1;
/* Other Impute commands are here */
multiples 5;
run;
</impute>
is the command file executed for creating multiple imputations. The following code uses PUTDATA to extract the remaining 4 data sets. These data sets are now available for further analysis simply by calling them into other commands.
/* extract the remaining four multiply imputed datasets */
<putdata name="myfile" mult="2" dataout="mydataout2" />
<putdata name="myfile" mult="3" dataout="mydataout3" />
<putdata name="myfile" mult="4" dataout="mydataout4" />
<putdata name="myfile" mult="5" dataout="mydataout5" />