Model Selection in Multivariate Analysis with Missing Data
A new model selection algorithm based on maximizing penalized likelihood function with the smoothly clipped absolute deviation (SCAD) penalty function is developed for missing data problems. The algorithm can be implemented effectively by a one-step algorithm when the number of the variables is much smaller than the sample size. A modified tuning parameter criterion based on Bayesian Information Criterion (BIC) for missing data problems is proposed to select the optimal tuning parameter for the penalty function. One advantage of the proposed approach over the current available one is to use the observed data log-likelihood so that it asymptotically selects the true model when missing data mechanism is assumed to be Missing at Random (MAR). A new model selection scheme that not only selects covariates for the outcome variable but also selects covariate models, which are important in high-dimensional covariates subject to missing values, is also proposed. The proposed algorithm is implicitly applied to linear regression models and logistic regression models. Gauss-Hermite Quadrature and Monte Carlo Simulations are used to compute the intractable integrations in the Expectation-Maximization (EM) algorithm. Several simulations are carried out to examine the performance of the proposed algorithm compared with other available variable selection methods for missing data. A real data from a case-control study to investigate potential risk factors of hip fracture is used to illustrate the application of the proposed method. Including interaction effects, several selection processes are run on the data with the proposed and imputation methods to confirm the optimal model.