从互联网上非标准化的临床照片生成黑色素瘤和痣数据集。
Generation of a Melanoma and Nevus Data Set From Unstandardized Clinical Photographs on the Internet.
发表日期:2023 Oct 04
作者:
Soo Ick Cho, Cristian Navarrete-Dechent, Roxana Daneshjou, Hye Soo Cho, Sung Eun Chang, Seong Hwan Kim, Jung-Im Na, Seung Seog Han
来源:
JAMA Dermatology
摘要:
用于诊断皮肤病图像的人工智能 (AI) 训练需要大量干净数据。皮肤科图像具有不同的组成,并且许多由于隐私问题而无法访问,这阻碍了人工智能的发展。从黑色素瘤和痣的非标准化互联网图像中构建用于判别性和生成性人工智能的训练数据集。在这项诊断研究中,总共有5619(CAN5600 数据集)和 2006(CAN2000 数据集;手动修改的 CAN5600 子集)黑色素瘤或痣的裁剪病变图像是使用基于区域的卷积神经网络 (CNN) 从互联网上大约 500000 张照片中半自动注释的CNN 和大型掩模修复。对于无监督预训练,还通过从大约 80 个国家的 18482 个网站收集图像,创建了具有多样性的 132673 个可能的病变(LESION130k 数据集)。使用生成对抗网络(StyleGAN2-ADA;训练, CAN2000数据集;预训练, LESION130k数据集)生成总共5000张合成图像(GAN5000数据集)。用于确定恶性的受试者工作特征曲线下面积(AUROC)分析了肿瘤。在每次测试中,使用 7 个预先存在的公共数据集(总共 2312 个图像;包括爱丁堡、SNU 子集、Asan 测试、滑铁卢、7 点标准评估、PAD-UFES-20 和 MED-NODE)中的 1 个作为测试数据集。随后,对 EfficientNet Lite0 CNN 在所提出的数据集上的性能与在其余 6 个预先存在的数据集上训练的 EfficientNet Lite0 CNN 的性能进行了比较研究。在带注释或合成图像上训练的 EfficientNet Lite0 CNN 取得了更高或相当的平均值 (SD)使用经过病理证实的公共数据集对 EfficientNet Lite0 进行 AUROC 训练,包括 CAN5600 (0.874 [0.042]; P = .02)、CAN2000 (0.848 [0.027]; P = .08) 和 GAN5000 (0.838 [0.838 [0.838 ])。 040];P = .31 [Wilcoxon 符号秩检验]) 和预先存在的数据集通过训练数据集大小的增加而组合 (0.809 [0.063])。本诊断研究中的合成数据集是使用互联网上的各种人工智能技术创建的图片。在创建的数据集 (CAN5600) 上训练的神经网络比在预先存在的数据集上训练的同一网络表现更好。带注释的数据集(CAN5600 和 LESION130k)和合成数据集(GAN5000)都可以共享,以供人工智能培训和医生之间达成共识。
Artificial intelligence (AI) training for diagnosing dermatologic images requires large amounts of clean data. Dermatologic images have different compositions, and many are inaccessible due to privacy concerns, which hinder the development of AI.To build a training data set for discriminative and generative AI from unstandardized internet images of melanoma and nevus.In this diagnostic study, a total of 5619 (CAN5600 data set) and 2006 (CAN2000 data set; a manually revised subset of CAN5600) cropped lesion images of either melanoma or nevus were semiautomatically annotated from approximately 500 000 photographs on the internet using convolutional neural networks (CNNs), region-based CNNs, and large mask inpainting. For unsupervised pretraining, 132 673 possible lesions (LESION130k data set) were also created with diversity by collecting images from 18 482 websites in approximately 80 countries. A total of 5000 synthetic images (GAN5000 data set) were generated using the generative adversarial network (StyleGAN2-ADA; training, CAN2000 data set; pretraining, LESION130k data set).The area under the receiver operating characteristic curve (AUROC) for determining malignant neoplasms was analyzed. In each test, 1 of the 7 preexisting public data sets (total of 2312 images; including Edinburgh, an SNU subset, Asan test, Waterloo, 7-point criteria evaluation, PAD-UFES-20, and MED-NODE) was used as the test data set. Subsequently, a comparative study was conducted between the performance of the EfficientNet Lite0 CNN on the proposed data set and that trained on the remaining 6 preexisting data sets.The EfficientNet Lite0 CNN trained on the annotated or synthetic images achieved higher or equivalent mean (SD) AUROCs to the EfficientNet Lite0 trained using the pathologically confirmed public data sets, including CAN5600 (0.874 [0.042]; P = .02), CAN2000 (0.848 [0.027]; P = .08), and GAN5000 (0.838 [0.040]; P = .31 [Wilcoxon signed rank test]) and the preexisting data sets combined (0.809 [0.063]) by the benefits of increased size of the training data set.The synthetic data set in this diagnostic study was created using various AI technologies from internet images. A neural network trained on the created data set (CAN5600) performed better than the same network trained on preexisting data sets combined. Both the annotated (CAN5600 and LESION130k) and synthetic (GAN5000) data sets could be shared for AI training and consensus between physicians.