一个乳腺癌检测AI算法在个性化乳腺摄影筛查方案中的表现。

Performance of a Breast Cancer Detection AI Algorithm Using the Personal Performance in Mammographic Screening Scheme.

Original text

发表日期：2023 Sep

作者： Yan Chen, Adnan G Taib, Iain T Darker, Jonathan J James

来源： RADIOLOGY

摘要：

背景：个人乳腺摄影筛查绩效评估计划（PERFORMS）方案用于评估读者绩效。这个方案能否评估人工智能（AI）算法的绩效尚不清楚。目的：比较人工读者和商用AI算法对PERFORMS测试集的解读绩效。材料和方法：在这项回顾性研究中，人工读者于2018年5月至2021年3月评估了两个PERFORMS测试集，每个测试集包括60个具有挑战性的病例，而AI算法于2022年进行了评估。AI独立地对每个乳房进行评估，为检测到的特征分配恶性肿瘤疑似得分。使用每个乳房的最高得分进行绩效评估。计算了AI和人工读者的敏感性、特异性和受试者工作特征曲线下面积（AUC）等绩效指标。该研究的样本能力可检测中等效应（优势比值为3.5或0.29）的敏感性。结果：共有552名人工读者解读了两个PERFORMS测试集，其中包括161个正常乳房、70个恶性乳房和9个良性乳房。在乳房水平上，AI的AUC和人工读者的AUC之间无差异（分别为0.93%和0.88%，P = .15）。当使用开发者建议的回忆得分阈值时，AI和人工读者的敏感性没有差异（分别为84%和90%，P = .34），但AI的特异性（89%）高于人工读者（76%，P = .003）。然而，由于测试集的规模，无法证明二者的等效性。当使用回忆阈值与人工读者平均绩效（90%敏感性，76%特异性）相匹配时，AI的绩效没有差异，敏感性为91%（P = .73），特异性为77%（P = .85）。结论：在从PERFORMS方案中获得的两个精选测试集的病例评估中，AI的诊断绩效与平均人工读者相当。 ©RSNA，2023，请参阅本期Philpotts的社论。

Background The Personal Performance in Mammographic Screening (PERFORMS) scheme is used to assess reader performance. Whether this scheme can assess the performance of artificial intelligence (AI) algorithms is unknown. Purpose To compare the performance of human readers and a commercially available AI algorithm interpreting PERFORMS test sets. Materials and Methods In this retrospective study, two PERFORMS test sets, each consisting of 60 challenging cases, were evaluated by human readers between May 2018 and March 2021 and were evaluated by an AI algorithm in 2022. AI considered each breast separately, assigning a suspicion of malignancy score to features detected. Performance was assessed using the highest score per breast. Performance metrics, including sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), were calculated for AI and humans. The study was powered to detect a medium-sized effect (odds ratio, 3.5 or 0.29) for sensitivity. Results A total of 552 human readers interpreted both PERFORMS test sets, consisting of 161 normal breasts, 70 malignant breasts, and nine benign breasts. No difference was observed at the breast level between the AUC for AI and the AUC for human readers (0.93% and 0.88%, respectively; P = .15). When using the developer's suggested recall score threshold, no difference was observed for AI versus human reader sensitivity (84% and 90%, respectively; P = .34), but the specificity of AI was higher (89%) than that of the human readers (76%, P = .003). However, it was not possible to demonstrate equivalence due to the size of the test sets. When using recall thresholds to match mean human reader performance (90% sensitivity, 76% specificity), AI showed no differences inperformance, with a sensitivity of 91% (P =. 73) and a specificity of 77% (P = .85). Conclusion Diagnostic performance of AI was comparable with that of the average human reader when evaluating cases from two enriched test sets from the PERFORMS scheme. © RSNA, 2023 See also the editorial by Philpotts in this issue.