将“Impute-then-exclude versus exclude-then-impute: Lessons when imputing a variable used both in cohort creation and as an independent variable in the analysis model.”翻译成简体中文,并保持原来的语序:Impute-then-exclude与exclude-then-impute:填补既用于队列创建又用作分析模型中独立变量时的变量的经验教训。
Impute-then-exclude versus exclude-then-impute: Lessons when imputing a variable used both in cohort creation and as an independent variable in the analysis model.
发表日期:2023 Feb 19
作者:
Peter C Austin, Daniele Giardiello, Stef van Buuren
来源:
STATISTICS IN MEDICINE
摘要:
我们研究了缺失数据变量被用作创建分析样本的包含/排除标准,以及作为科学感兴趣的分析模型的主要暴露因素的情景。一个例子是癌症分期,在分析样本中通常会排除患有第四期癌症的患者,而癌症分期(I至III期)是分析模型的暴露变量。我们考虑了两种分析策略。第一种策略被称为“先排除再填补”,它排除了目标变量的观察值等于指定值的主体,然后使用多重填补技术来完成结果样本中的数据。第二种策略称为“先填补再排除”,首先使用多重填补技术来完成数据,然后根据完成样本的观察值或填充值排除主体。蒙特卡罗模拟被用来比较五种方法(一种基于“先排除再填补”,四种基于“先填补再排除”),以及完全病例分析的使用。我们考虑了完全随机缺失和随机缺失数据机制。我们发现使用实质上与模型兼容的完全条件规范的填补后排除策略在72个不同场景下具有优越的性能。我们使用心力衰竭住院患者的实证数据来演示这些方法的应用,当心力衰竭亚型用于队列建立时(排除具有保留射血分数的心力衰竭患者),并且还是分析模型的暴露因素。
©2023 The Authors。由约翰威利和儿子有限公司出版的医学统计。
We examined the setting in which a variable that is subject to missingness is used both as an inclusion/exclusion criterion for creating the analytic sample and subsequently as the primary exposure in the analysis model that is of scientific interest. An example is cancer stage, where patients with stage IV cancer are often excluded from the analytic sample, and cancer stage (I to III) is an exposure variable in the analysis model. We considered two analytic strategies. The first strategy, referred to as "exclude-then-impute," excludes subjects for whom the observed value of the target variable is equal to the specified value and then uses multiple imputation to complete the data in the resultant sample. The second strategy, referred to as "impute-then-exclude," first uses multiple imputation to complete the data and then excludes subjects based on the observed or filled-in values in the completed samples. Monte Carlo simulations were used to compare five methods (one based on "exclude-then-impute" and four based on "impute-then-exclude") along with the use of a complete case analysis. We considered both missing completely at random and missing at random missing data mechanisms. We found that an impute-then-exclude strategy using substantive model compatible fully conditional specification tended to have superior performance across 72 different scenarios. We illustrated the application of these methods using empirical data on patients hospitalized with heart failure when heart failure subtype was used for cohort creation (excluding subjects with heart failure with preserved ejection fraction) and was also an exposure in the analysis model.© 2023 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.