Subscribe to RSS
DOI: 10.1055/a-2772-7798
Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine
Neue KI-Systeme zur Thoraxröntgen-Diagnostik: Qualitätsbewertung der Übereinstimmung mit ärztlichen Diagnosen im klinischen AlltagAuthors
Supported by: JF Senge and P Dlotko were supported by the Dioscuri program initiated by the Max Planck Society, jointly managed with the National Science Centre (Poland), and mutually funded by the Polish Ministry of Science and Higher Education and the German Federal Ministry of Education and Research.
Abstract
Purpose
The rising demand for radiology services calls for innovative solutions to sustain diagnostic quality and efficiency. This study evaluated the diagnostic agreement between two commercially available artificial intelligence (AI) chest X-ray systems and human radiologists during routine clinical practice.
Materials and Methods
We retrospectively analyzed 279 chest X-rays (204 standing, 63 supine, 12 sitting) from a Swiss university hospital. Seven thoracic pathologies – cardiomegaly, consolidation, mediastinal mass, nodule, pleural effusion, pneumothorax, and pulmonary oedema – were assessed. Radiologists’ routine reports were compared against Rayvolve (AZmed) and ChestView (Gleamer, both from Paris, France). A Python code, provided as open access supplement, calculated performance metrics, agreement measures, and effect size quantification.
Results
Agreement between radiologists and AI ranged from moderate to almost perfect: Human-AZmed (Gwet’s AC1: 0.47–0.72, moderate to substantial), and Human-Gleamer (Gwet’s AC1: 0.56–0.96, moderate to almost perfect). Balanced accuracies ranged from 0.67–0.85 for Human-AZmed and 0.71–0.85 for Human-Gleamer, with peak performance for pleural effusion (0.85 both systems). Specificity consistently exceeded sensitivity across pathologies (0.70–0.98 vs 0.45–0.85). Common findings showed strong performance, pleural effusion (MCC 0.70–0.73), cardiomegaly (MCC 0.51), and consolidation (MCC 0.45–0.46). Rare pathologies demonstrated lower agreement, mediastinal mass, and nodules (MCC 0.23–0.31). Standing radiographs yielded superior agreement compared to supine studies. The two AI systems showed substantial inter-system agreement for consolidation and pleural effusion (balanced accuracy 0.81–0.84).
Conclusion
Both commercial AI chest X-ray systems demonstrated comparable performance to human radiologists for common thoracic pathologies, with no meaningful differences between platforms. Performance was strongest for standing radiographs but declined for rare findings and supine studies. Position-dependent variability and reduced sensitivity for uncommon pathologies underscore the continued need for human oversight in clinical practice.
Key Points
-
AI systems matched radiologists for common chest X-ray findings.
-
Standing radiographs achieved the highest diagnostic agreement.
-
Rare pathologies showed weaker AI-human agreement.
-
Supine studies reduced diagnostic performance.
-
Human oversight remains essential in clinical practice.
Citation Format
-
Bosbach WA, Schoeni L, Senge JF et al. Novel Artificial Intelligence Chest X-ray Diagnostics: A Quality Assessment of Their Agreement with Human Doctors in Clinical Routine. Rofo 2025; DOI 10.1055/a-2778-3892
Zusammenfassung
Ziel
Die steigende Nachfrage nach radiologischen Untersuchungen erfordert innovative Lösungen zur Aufrechterhaltung der diagnostischen Qualität und Effizienz. Diese Studie bewertete die diagnostische Übereinstimmung zwischen zwei kommerziell verfügbaren KI-Systemen für Thoraxröntgenaufnahmen und Radiologen im klinischen Alltag.
Materialien und Methoden
Wir analysierten retrospektiv 279 Thoraxröntgenaufnahmen (204 stehend, 63 liegend, 12 sitzend) eines Schweizer Universitätsspitals. Sieben thorakale Pathologien wurden bewertet: Kardiomegalie, Konsolidierung, Mediastinaltumor, Rundherd, Pleuraerguss, Pneumothorax und Lungenödem. Die Routinebefunde der Radiologen wurden mit Rayvolve (AZmed) und ChestView (Gleamer, beide aus Paris, Frankreich) verglichen. Ein Python-Code, als Open-Access-Supplement bereitgestellt, berechnete Leistungsmetriken, Übereinstimmungsmaße und Effektstärkenquantifizierung.
Ergebnisse
Die Übereinstimmung zwischen Radiologen und KI reichte von moderat bis fast perfekt: Mensch-AZmed (Gwet’s AC1: 0,47–0,72, moderat bis substanziell) und Mensch-Gleamer (Gwet’s AC1: 0,56–0,96, moderat bis fast perfekt). Die balancierte Genauigkeit lag zwischen 0,67–0,85 für Mensch-AZmed und 0,71–0,85 für Mensch-Gleamer, mit Höchstleistung bei Pleuraerguss (0,85 beide Systeme). Die Spezifität übertraf durchgehend die Sensitivität bei allen Pathologien (0,70–0,98 vs. 0,45–0,85). Häufige Befunde zeigten starke Leistung: Pleuraerguss (MCC 0,70–0,73), Kardiomegalie (MCC 0,51) und Konsolidierung (MCC 0,45–0,46). Seltene Pathologien demonstrierten geringere Übereinstimmung: Mediastinaltumor und Rundherde (MCC 0,23–0,31). Stehende Röntgenaufnahmen erzielten bessere Übereinstimmung als Aufnahmen in Rückenlage. Die beiden KI-Systeme zeigten substanzielle Übereinstimmung untereinander bei Konsolidierung und Pleuraerguss (balancierte Genauigkeit 0,81–0,84).
Schlussfolgerung
Beide kommerziellen KI-Systeme für Thoraxröntgen zeigten vergleichbare Leistung zu Radiologen bei häufigen thorakalen Pathologien, ohne bedeutsame Unterschiede zwischen den Plattformen. Die Leistung war bei stehenden Aufnahmen am stärksten, nahm jedoch bei seltenen Befunden und Aufnahmen in Rückenlage ab. Lageabhängige Variabilität und reduzierte Sensitivität für seltene Pathologien unterstreichen die anhaltende Notwendigkeit ärztlicher Supervision in der klinischen Praxis.
Kernaussagen
-
KI-Systeme entsprachen Radiologen bei häufigen Thoraxröntgen-Befunden.
-
Stehende Aufnahmen erzielten die höchste diagnostische Übereinstimmung.
-
Seltene Pathologien zeigten schwächere KI-Mensch-Übereinstimmung.
-
Liegende Aufnahmen reduzierten die diagnostische Leistung.
-
Ärztliche Supervision bleibt in der klinischen Praxis unerlässlich.
Keywords
Chest X-ray - Deep Learning - Multi-label Classification - Explainability - Medical ImagingPublication History
Received: 09 April 2025
Accepted after revision: 11 December 2025
Article published online:
20 January 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Tang A, Tam R, Cadrin-Chênevert A. et al. Canadian Association of Radiologists White Paper on Artificial Intelligence in Radiology. Can Assoc Radiol J 2018; 69: 120-35
- 2 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53: 102-10
- 3 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7: 1-14
- 4 Bosbach WA, Clement C, Strunz F. et al. Automation of 99mTc Mercaptoacetyltriglycine (MAG3) Report Writing Using a Vision Language Model. EJNMMI Res 2025; 15: 142
- 5 Barat M, Soyer P, Dohan A. Appropriateness of Recommendations Provided by ChatGPT to Interventional Radiologists. Can Assoc Radiol J 2023; 1-6
- 6 Ramedani S, Ramedani M, Tengg-Kobligk Von H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. 2023 IEEE 19th Int Conf Intell Comput Commun Process 2023; 287-292
- 7 Hammernik K, Klatzer T, Kobler E. et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med 2018; 79: 3055-71
- 8 Bosbach WA, Schoeni L, Beisbart C. et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025;
- 9 Bosbach WA, Schoeni L, Beisbart C. et al. Open access supplement to the manuscript: Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 in Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; accepted for publicat. Figshare 2025;
- 10 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5: 5
- 11 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 12 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach WA et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in MRI. Figshare 2024;
- 13 Granstedt J, Kc P, Deshpande R. et al. Hallucinations in medical devices. ArXiv 2025;
- 14 De Lacey G, Morley S, Berman L. The Chest X-Ray – A Survival Guide. Cambridge (UK): 2008
- 15 Babar Z, van Laarhoven T, Zanzotto FM. et al. Evaluating diagnostic content of AI-generated radiology reports of chest X-rays. Artif Intell Med 2021; 116: 102075
- 16 Yu F, Endo M, Krishnan R. et al. Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation. MedRxiv 2022;
- 17 Bettinger H, Lenczner G, Guigui J. et al. Evaluation of the Performance of an Artificial Intelligence (AI) Algorithm in Detecting Thoracic Pathologies on Chest Radiographs. Diagnostics 2024; 14
- 18 Gasmi I, Calinghen A, Parienti JJ. et al. Comparison of diagnostic performance of a deep learning algorithm, emergency physicians, junior radiologists and senior radiologists in the detection of appendicular fractures in children. Pediatr Radiol 2023; 53: 1675-84
- 19 Dupuis M, Delbos L, Veil R. et al. External validation of a commercially available deep learning algorithm for fracture detection in children. Diagn Interv Imaging 2022; 103: 151-159
- 20 Fu T, Viswanathan V, Attia A. et al. Assessing the Potential of a Deep Learning Tool to Improve Fracture Detection by Radiologists and Emergency Physicians on Extremity Radiographs. Acad Radiol 2023; 1-11
- 21 Lin TY, Goyal P, Girshick R. et al. Focal Loss for Dense Object Detection. IEEE Trans Pattern Anal Mach Intell 2020; 42: 318-27
- 22 Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 3rd Int Conf Learn Represent ICLR 2015 – Conf Track Proc. In: . 2015: 1-14
- 23 Regnard NE, Lanseur B, Ventre J. et al. Assessment of performances of a deep learning algorithm for the detection of limbs and pelvic fractures, dislocations, focal bone lesions, and elbow effusions on trauma X-rays. Eur J Radiol 2022; 154
- 24 Bennani S, Regnard NE, Ventre J. et al. Using AI to Improve Radiologist Performance in Detection of Abnormalities on Chest Radiographs. Radiology 2023; 309
- 25 Selvam S, Peyrony O, Elezi A. et al. Efficacy of a deep learning-based software for chest X-ray analysis in an emergency department. Diagn Interv Imaging 2025; 106: 299-311
- 26 Wu Y, Kirillov A, Massa F. et al. Detectron2 2019. Accessed September 14, 2025 at: https://github.com/facebookresearch/detectron2
- 27 Panicek DM, Hricak H. How sure are you, doctor? A standardized lexicon to describe the radiologists level of certainty. Am J Roentgenol 2016; 207: 2-3
- 28 statsmodels.stats.proportion.proportion_confint. Statsmodels 0150 (+710) 2025. Accessed September 08, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_confint.html
- 29 balanced_accuracy_score. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html
- 30 matthews_corrcoef. Scikit-Learn 172 Doc 2025. Accessed September 10, 2025 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
- 31 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33: 159-74
- 32 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients 2023. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html; accessed September 3, 2025.
- 33 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples Nahathai. BMC Med Res Methodol 2013; 13: 1-7
- 34 statsmodels.stats.contingency_tables.mcnemar. Statsmodels 0150 (+638) 2025. Accessed March 30, 2025 at: https://www.statsmodels.org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html
- 35 Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary prevention trials of cardiovascular events using aspirin. Am J Cardiol 2011; 107: 1796-801
- 36 Cohen J. The earth is round (p<.05). Am Psychol 1994; 49: 997-1003
- 37 Sullivan GM, Feinn R. Using Effect Size—or Why the P Value Is Not Enough. J Grad Med Educ 2012; 4: 279-82
- 38 OpenAI Inc. GPT-5 2025. Accessed August 30, 2025 at: https://chatgpt.com/overview
- 39 Anthropic PBC. Claude Sonnet 4 [Large language model] 2025. Accessed August 05, 2025 at: https://www.anthropic.com
- 40 Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co. L. DeepSeek V3.1 2025. Accessed September 10, 2025 at: https://www.deepseek.com
- 41 Rajpurkar P, Irvin J, Zhu K. et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. ArXiv 2017;
- 42 Irvin J, Rajpurkar P, Ko M. et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. 33rd AAAI Conf Artif Intell. In: . 2019: 590-597
- 43 Feng Y, Teh HS, Cai Y. Deep Learning for Chest Radiology: A Review. Curr Radiol Rep 2019; 7: 1-9
- 44 Gefter WB, Post BA, Hatabu H. Commonly Missed Findings on Chest Radiographs: Causes and Consequences. Chest 2023; 163: 650-61
- 45 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence 1955: 1–13. Accessed October 30, 2021 at: http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 46 Duran LDD. Deskilling of medical professionals: An unintended consequence of AI implementation?. G Di Filos 2021; 2
