Rofo 2023; 195(S 01): S36
DOI: 10.1055/s-0043-1763039
Abstracts
Vortrag (Wissenschaft)
IT/Bildverarbeitung/Software

Automatic Evaluation of Chest Radiographs – The Data Source Matters, But How Much Exactly?

S Tayebi Arasteh
1   Uniklinik RWTH Aachen, Diagnostische und Interventionelle Radio, Aachen
,
P Isfort
2   Uniklinik RWTH Aachen University, Diagnostische und Interventionelle Radiologie, Aachen
,
C Kuhl
2   Uniklinik RWTH Aachen University, Diagnostische und Interventionelle Radiologie, Aachen
,
S Nebelung
2   Uniklinik RWTH Aachen University, Diagnostische und Interventionelle Radiologie, Aachen
,
D Truhn
2   Uniklinik RWTH Aachen University, Diagnostische und Interventionelle Radiologie, Aachen
› Author Affiliations
 
 

    Zielsetzung Artificial intelligence (AI) models can support radiologists in their diagnosis, but usually do not work as well on external data, i.e. on data from hospitals that did not participate in the initial training. In this study, we perform a large-scale analysis of the domain transferability, i.e. on the performance of radiologic AI models on external data utilizing 550,000 publicly available chest radiographs from five institutions from across the globe with differing annotations and imaging protocols.

    Material und Methoden We tested domain transferability on multicentric datasets including VinDr-CXR (n=18,000), ChestX-ray14 (n=112,120), CheXpert (n=157,676), MIMIC-CXR-JPG (n=215,187), and UKA-CXR (n=54,824), using 11 different labels including cardiomegaly, lung opacity, lung lesion, pneumonia, edema, enlarged cardiomediastinum, consolidation, pleural effusion, pneumothorax, atelectasis, and no finding. AI models based on the ResNet architecture were trained on each dataset and were evaluated both on a held-out test set of the same dataset (original domain, OD) and cross evaluated on the test sets of the external domains (ED) if labels were available. The area under the receiver-operator-curve (AUC) was used as the primary evaluation metric. Bootstrapping was employed to determine the statistical spread and calculate p-values.

    Ergebnisse The average AUC over all labels of each test cohort was: 1) CheXpert & MIMIC-CXR-JPG (EDs) vs. ChestX-ray14 (OD): 0.74±0.07 & 0.74±0.05 vs. 0.74±0.08, 2) MIMIC-CXR-JPG (ED) vs. CheXpert (OD): 0.72±0.11 vs. 0.76±0.09, 3) CheXpert (ED) vs. MIMIC-CXR-JPG (OD): 0.70±0.07 vs. 0.76±0.05, 4) UKA-CXR (ED) vs. VinDr-CXR (OD): 0.86±0.07 vs. 0.94±0.02.

    Schlussfolgerungen Given sufficient high-quality data, models can be trained on data from a single institution that perform reasonably well on external data from institutions that did not participate in the initial training as performance drops by no more than 9%.


    #

    Publication History

    Article published online:
    13 April 2023

    © 2023. Thieme. All rights reserved.

    Georg Thieme Verlag
    Rüdigerstraße 14, 70469 Stuttgart, Germany