TY - JOUR
T1 - Fast medical concept normalization for biomedical literature based on stack and index optimized self-attention
AU - Liang, Likeng
AU - Hao, Tianyong
AU - Zhan, Choujun
AU - Qiu, Hong
AU - Wang, Fu Lee
AU - Yan, Jun
AU - Weng, Heng
AU - Qu, Yingying
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/10
Y1 - 2022/10
N2 - Medical concept normalization aims to construct a semantic mapping between mentions and concepts and to uniformly represent mentions that belong to the same concept. In the large-scale biological literature database, a fast concept normalization method is essential to process a large number of requests and literature. To this end, we propose a hierarchical concept normalization method, named FastMCN, with much lower computational cost and a variant of transformer encoder, named stack and index optimized self-attention (SISA), to improve the efficiency and performance. In training, FastMCN uses SISA as a word encoder to encode word representations from character sequences and uses a mention encoder which summarizes the word representations to represent a mention. In inference, FastMCN indexes and summarizes word representations to represent a query mention and output the concept of the mention which with the maximum cosine similarity. To further improve the performance, SISA was pre-trained using the continuous bag-of-words architecture with 18.6 million PubMed abstracts. All experiments were evaluated on two publicly available datasets: NCBI disease and BC5CDR disease. The results showed that SISA was three times faster than the transform encoder for encoding word representations and had better performance. Benefiting from SISA, FastMCN was efficient in both training and inference, i.e. it achieved the peak performance of most of the baseline methods within 30 s and was 3000–5600 times faster than the state-of-the-art method in inference.
AB - Medical concept normalization aims to construct a semantic mapping between mentions and concepts and to uniformly represent mentions that belong to the same concept. In the large-scale biological literature database, a fast concept normalization method is essential to process a large number of requests and literature. To this end, we propose a hierarchical concept normalization method, named FastMCN, with much lower computational cost and a variant of transformer encoder, named stack and index optimized self-attention (SISA), to improve the efficiency and performance. In training, FastMCN uses SISA as a word encoder to encode word representations from character sequences and uses a mention encoder which summarizes the word representations to represent a mention. In inference, FastMCN indexes and summarizes word representations to represent a query mention and output the concept of the mention which with the maximum cosine similarity. To further improve the performance, SISA was pre-trained using the continuous bag-of-words architecture with 18.6 million PubMed abstracts. All experiments were evaluated on two publicly available datasets: NCBI disease and BC5CDR disease. The results showed that SISA was three times faster than the transform encoder for encoding word representations and had better performance. Benefiting from SISA, FastMCN was efficient in both training and inference, i.e. it achieved the peak performance of most of the baseline methods within 30 s and was 3000–5600 times faster than the state-of-the-art method in inference.
KW - Distributed word embedding
KW - Medical concept normalization
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85131081781&partnerID=8YFLogxK
U2 - 10.1007/s00521-022-07228-y
DO - 10.1007/s00521-022-07228-y
M3 - Article
AN - SCOPUS:85131081781
SN - 0941-0643
VL - 34
SP - 16311
EP - 16324
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 19
ER -