TY - JOUR
T1 - Effectiveness of symmetric metamorphic relations on validating the stability of code generation LLM
AU - Chan, Pak Yuen Patrick
AU - Keung, Jacky
AU - Yang, Zhen
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2025/4
Y1 - 2025/4
N2 - Pre-trained large language models (LLMs) are increasingly used in software development for code generation, with a preference for private LLMs over public ones to avoid the risk of exposing corporate secrets. Validating the stability of these LLMs’ outputs is crucial, and our study proposes using symmetric Metamorphic Relations (MRs) from Metamorphic Testing (MT) for this purpose. Our study involved an empirical experiment with ten LLMs (eight private and two public) and two publicly available datasets. We defined seven symmetric MRs to generate “Follow-up” datasets from “Source” datasets for testing. Our evaluation aimed to detect violations (inconsistent predictions) between “Source” and “Follow-up” datasets and assess the effectiveness of MRs in identifying correct and incorrect non-violated predictions from ground truths. Results showed that one public and four private LLMs did not violate “Case transformation of prompts” MR. Furthermore, effectiveness and performance results indicated that proposed MRs are effective tools for explaining the instability of LLM's outputs by “Case transformation of prompts”, “Duplication of prompts”, and “Paraphrasing of prompts”. The study underscored the importance of enhancing LLMs’ semantic understanding of prompts for better stability and highlighted potential future research directions, including exploring different MRs, enhancing semantic understanding, and applying symmetry to prompt engineering.
AB - Pre-trained large language models (LLMs) are increasingly used in software development for code generation, with a preference for private LLMs over public ones to avoid the risk of exposing corporate secrets. Validating the stability of these LLMs’ outputs is crucial, and our study proposes using symmetric Metamorphic Relations (MRs) from Metamorphic Testing (MT) for this purpose. Our study involved an empirical experiment with ten LLMs (eight private and two public) and two publicly available datasets. We defined seven symmetric MRs to generate “Follow-up” datasets from “Source” datasets for testing. Our evaluation aimed to detect violations (inconsistent predictions) between “Source” and “Follow-up” datasets and assess the effectiveness of MRs in identifying correct and incorrect non-violated predictions from ground truths. Results showed that one public and four private LLMs did not violate “Case transformation of prompts” MR. Furthermore, effectiveness and performance results indicated that proposed MRs are effective tools for explaining the instability of LLM's outputs by “Case transformation of prompts”, “Duplication of prompts”, and “Paraphrasing of prompts”. The study underscored the importance of enhancing LLMs’ semantic understanding of prompts for better stability and highlighted potential future research directions, including exploring different MRs, enhancing semantic understanding, and applying symmetry to prompt engineering.
KW - Code generation
KW - Large language model
KW - Metamorphic relation
KW - Metamorphic testing
KW - True satisfaction
UR - https://www.scopus.com/pages/publications/85214303528
U2 - 10.1016/j.jss.2024.112330
DO - 10.1016/j.jss.2024.112330
M3 - Article
SN - 0164-1212
VL - 222
JO - Journal of Systems and Software
JF - Journal of Systems and Software
M1 - 112330
ER -