研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

机器学习插补方法在乳腺癌生存中的应用的多指标比较。

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival.

发表日期:2024 Aug 30
作者: Imad El Badisy, Nathalie Graffeo, Mohamed Khalis, Roch Giorgi
来源: BMC Medical Research Methodology

摘要:

处理临床预后研究中缺失的数据是一项重要但具有挑战性的任务。本研究旨在从不同的分析角度对不同机器学习 (ML) 插补方法的有效性和可靠性进行全面评估。具体来说,它重点关注用于评估机器学习插补方法的三类不同的性能指标:回归估计的插补后偏差、插补后预测准确性和实质性无模型指标。作为说明,我们应用了现实世界乳腺癌生存研究的数据。这种综合方法旨在从各种分析角度对机器学习插补方法的有效性和可靠性进行全面评估。使用具有 30% 随机缺失 (MAR) 值的模拟数据集。对多种单一插补 (SI) 方法(特别是 KNN、missMDA、CART、missForest、missRanger、missCforest)和多重插补 (MI) 方法(特别是 mouseCART 和 mouseRF)进行了评估。使用的性能指标包括高尔距离、估计偏差、经验标准误、覆盖率、置信区间长度、预测准确性、错误分类比例 (PFC)、归一化均方根误差 (NRMSE)、AUC 和 C 指数得分。分析显示,就高尔距离而言,CART 和 missForest 最为准确,而 missMDA 和 CART 在二元协变量方面表现出色; missForest 和 mouseCART 在连续协变量方面表现更佳。在评估回归估计的偏差和准确性时,miceCART 和 mouseRF 表现出最小的偏差。总体而言,各种插补方法表现出比完整案例分析 (CCA) 更高的效率,其中 MICE 方法提供了最佳置信区间覆盖。就 Cox 模型的预测准确性而言,missMDA 和 missForest 具有更高的 AUC 和 C 指数得分。尽管提供了更好的预测准确性,但研究发现,与 MI 方法相比,SI 方法在回归系数中引入了更多偏差。这项研究强调了在事件发生时间研究中根据研究目标和数据类型选择适当的插补方法的重要性。研究的不同性能指标的方法的有效性各不相同,凸显了在多重插补框架内使用先进机器学习算法来增强研究完整性和研究结果稳健性的价值。© 2024。作者。
Handling missing data in clinical prognostic studies is an essential yet challenging task. This study aimed to provide a comprehensive assessment of the effectiveness and reliability of different machine learning (ML) imputation methods across various analytical perspectives. Specifically, it focused on three distinct classes of performance metrics used to evaluate ML imputation methods: post-imputation bias of regression estimates, post-imputation predictive accuracy, and substantive model-free metrics. As an illustration, we applied data from a real-world breast cancer survival study. This comprehensive approach aimed to provide a thorough assessment of the effectiveness and reliability of ML imputation methods across various analytical perspectives. A simulated dataset with 30% Missing At Random (MAR) values was used. A number of single imputation (SI) methods - specifically KNN, missMDA, CART, missForest, missRanger, missCforest - and multiple imputation (MI) methods - specifically miceCART and miceRF - were evaluated. The performance metrics used were Gower's distance, estimation bias, empirical standard error, coverage rate, length of confidence interval, predictive accuracy, proportion of falsely classified (PFC), normalized root mean squared error (NRMSE), AUC, and C-index scores. The analysis revealed that in terms of Gower's distance, CART and missForest were the most accurate, while missMDA and CART excelled for binary covariates; missForest and miceCART were superior for continuous covariates. When assessing bias and accuracy in regression estimates, miceCART and miceRF exhibited the least bias. Overall, the various imputation methods demonstrated greater efficiency than complete-case analysis (CCA), with MICE methods providing optimal confidence interval coverage. In terms of predictive accuracy for Cox models, missMDA and missForest had superior AUC and C-index scores. Despite offering better predictive accuracy, the study found that SI methods introduced more bias into the regression coefficients compared to MI methods. This study underlines the importance of selecting appropriate imputation methods based on study goals and data types in time-to-event research. The varying effectiveness of methods across the different performance metrics studied highlights the value of using advanced machine learning algorithms within a multiple imputation framework to enhance research integrity and the robustness of findings.© 2024. The Author(s).