logo

A systematic evaluation of GPT-4V's multimodal capability for chest X-ray image analysis

A systematic evaluation of GPT-4V's multimodal capability for chest X-ray image analysis

Liu Yunyi
Li Yingshu
Wang Zhanyu
Liang Xinyu
Liu Lingqiao
Wang Lei
Cui Leyang
Tu Zhaopeng
Wang Longyue
Zhou Luping
Meta-Radiology第2卷, 第4期在线发表 2024-7-15
1000

This work evaluates GPT-4V's multimodal capability for medical image analysis, focusing on three representative tasks radiology report generation, medical visual question answering, and medical visual grounding. For the evaluation, a set of prompts is designed for each task to induce the corresponding capability of GPT-4V to produce sufficiently good outputs. Three evaluation ways including quantitative analysis, human evaluation, and case study are employed to achieve an in-depth and extensive evaluation. Our evaluation shows that GPT-4V excels in understanding medical images can generate high-quality radiology reports and effectively answer questions about medical images. Meanwhile, it is found that its performance for medical visual grounding needs to be substantially improved. In addition, we observe the discrepancy between the evaluation outcome from quantitative analysis and that from human evaluation. This discrepancy suggests the limitations of conventional metrics in assessing the performance of large language models like GPT-4V and the necessity of developing new metrics for automatic quantitative analysis.

GPT-4VMedical imageRadiology report generation medical visual question answering medical visual groundingLarge language model evaluation