Effectiveness of symmetric metamorphic relations on validating the stability of code generation LLM

Research output: Contribution to journalArticlepeer-review

Abstract

Pre-trained large language models (LLMs) are increasingly used in software development for code generation, with a preference for private LLMs over public ones to avoid the risk of exposing corporate secrets. Validating the stability of these LLMs’ outputs is crucial, and our study proposes using symmetric Metamorphic Relations (MRs) from Metamorphic Testing (MT) for this purpose. Our study involved an empirical experiment with ten LLMs (eight private and two public) and two publicly available datasets. We defined seven symmetric MRs to generate “Follow-up” datasets from “Source” datasets for testing. Our evaluation aimed to detect violations (inconsistent predictions) between “Source” and “Follow-up” datasets and assess the effectiveness of MRs in identifying correct and incorrect non-violated predictions from ground truths. Results showed that one public and four private LLMs did not violate “Case transformation of prompts” MR. Furthermore, effectiveness and performance results indicated that proposed MRs are effective tools for explaining the instability of LLM's outputs by “Case transformation of prompts”, “Duplication of prompts”, and “Paraphrasing of prompts”. The study underscored the importance of enhancing LLMs’ semantic understanding of prompts for better stability and highlighted potential future research directions, including exploring different MRs, enhancing semantic understanding, and applying symmetry to prompt engineering.
Original languageEnglish
Article number112330
JournalJournal of Systems and Software
Volume222
DOIs
Publication statusPublished - Apr 2025
Externally publishedYes

Keywords

  • Code generation
  • Large language model
  • Metamorphic relation
  • Metamorphic testing
  • True satisfaction

Fingerprint

Dive into the research topics of 'Effectiveness of symmetric metamorphic relations on validating the stability of code generation LLM'. Together they form a unique fingerprint.

Cite this