TY - GEN
T1 - Leveraging statistic and semantic features for similar question detection using fusion xgboost
AU - Liao, Siyuan
AU - Wong, Leung Pun
AU - Lee, Lap Kei
AU - Au, Oliver
AU - Hao, Tianyong
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - Question text similarity calculation is a fundamental and essential research problem for community question answering services. Different question text collections have various characteristics. Some frequently answered questions may have distinct statistical patterns, while some questions are syntactically different but semantically similar. To measure question similarity more adaptively to different kinds of question text, this paper proposes a method for identifying similar question utilizing the combination of both statistic and semantic features based on XGBoost. The method extracts semantic and statistical features from question text. After that, a feature set generation method is proposed, along with a model fusion strategy. Based on the standard Yahoo! dataset containing 25,569 questions with answers, three experiments have been conducted to evaluate the performance of the method. Results show that it achieves a precision of 88.65% and a recall of 71.85% outperforming a list of baseline methods.
AB - Question text similarity calculation is a fundamental and essential research problem for community question answering services. Different question text collections have various characteristics. Some frequently answered questions may have distinct statistical patterns, while some questions are syntactically different but semantically similar. To measure question similarity more adaptively to different kinds of question text, this paper proposes a method for identifying similar question utilizing the combination of both statistic and semantic features based on XGBoost. The method extracts semantic and statistical features from question text. After that, a feature set generation method is proposed, along with a model fusion strategy. Based on the standard Yahoo! dataset containing 25,569 questions with answers, three experiments have been conducted to evaluate the performance of the method. Results show that it achieves a precision of 88.65% and a recall of 71.85% outperforming a list of baseline methods.
KW - Feature set generation
KW - Question-answering
KW - Similar question detection
KW - XGBoost
UR - https://www.scopus.com/pages/publications/85092170802
U2 - 10.1007/978-3-030-59413-8_9
DO - 10.1007/978-3-030-59413-8_9
M3 - Conference contribution
AN - SCOPUS:85092170802
SN - 9783030594121
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 106
EP - 120
BT - Database Systems for Advanced Applications. DASFAA 2020 International Workshops - BDMS, SeCoP, BDQM, GDMA, and AIDE, Proceedings
A2 - Nah, Yunmook
A2 - Kim, Chulyun
A2 - Kim, Seon Ho
A2 - Moon, Yang-Sae
A2 - Whang, Steven Euijong
T2 - 7th International Workshop on Big Data Management and Service, BDMS 2020, 6th International Symposium on Semantic Computing and Personalization, SeCoP 2020, 5th Big Data Quality Management, BDQM 2020, 4th International Workshop on Graph Data Management and Analysis, GDMA 2020, 1st International Workshop on Artificial Intelligence for Data Engineering, AIDE 2020, held in conjunction with the 25th International Conference on Database Systems for Advanced Applications, DASFAA 2020
Y2 - 24 September 2020 through 27 September 2020
ER -