Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Yong Li, Qihao Yang, Fu Lee Wang, Lap Kei Lee, Yingying Qu, Tianyong Hao

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Insufficient training data is a common barrier to effectively learn multimodal information interactions and question semantics in existing medical Visual Question Answering (VQA) models. This paper proposes a new Asymmetric Cross Modal Attention network called ACMA, which constructs an image-guided attention and a question-guided attention to improve multimodal interactions from insufficient data. In addition, a Semantic Understanding Auxiliary (SUA) in the question-guided attention is newly designed to learn rich semantic embeddings for improving model performance on question understanding by integrating word-level and sentence-level information. Moreover, we propose a new data augmentation method called Multimodal Augmented Mixup (MAM) to train the ACMA, denoted as ACMA-MAM. The MAM incorporates various data augmentations and a vanilla mixup strategy to generate more non-repetitive data, which avoids time-consuming artificial data annotations and improves model generalization capability. Our ACMA-MAM outperforms state-of-the-art models on three publicly accessible medical VQA datasets (VQA-Rad, VQA-Slake, and PathVQA) with accuracies of 76.14 %, 83.13 %, and 53.83 % respectively, achieving improvements of 2.00 %, 1.32 %, and 1.59 % accordingly. Moreover, our model achieves F1 scores of 78.33 %, 82.83 %, and 51.86 %, surpassing the state-of-the-art models by 2.80 %, 1.15 %, and 1.37 % respectively.

Original languageEnglish
Article number102667
JournalArtificial Intelligence in Medicine
Volume144
DOIs
Publication statusPublished - Oct 2023

Keywords

  • Cross modal attention
  • Data augmentation
  • Medical Visual Question Answering
  • Mixup
  • Multimodal interaction

Fingerprint

Dive into the research topics of 'Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering'. Together they form a unique fingerprint.

Cite this