TY - JOUR
T1 - Learning Chinese word embeddings from semantic and phonetic components
AU - Wang, Fu Lee
AU - Lu, Yuyin
AU - Cheng, Gary
AU - Xie, Haoran
AU - Rao, Yanghui
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2022/12
Y1 - 2022/12
N2 - As an important task in Asian language information processing, Chinese word embedding learning has attracted much attention recently. Based on either Skip-gram or CBOW, several methods have been proposed to exploit Chinese characters and sub-character components for learning Chinese word embeddings. Chinese characters are combinations of meaning, structure, and phonetic information (pinyin). However, previous works only cover the former two aspects and cannot effectively explore distinct semantics of characters. To address this issue, we develop a Pinyin-enhance Skip-gram model named rsp2vec, in addition to a radical and pinyin-enhanced Chinese word embedding (rPCWE) learning models based on CBOW. For our models, the phonetic information and semantic components of Chinese characters are encoded into embeddings simultaneously. Evaluations on word analogy reasoning, word relevance, text classification, named entity recognition, and case studies validate the effectiveness of our models.
AB - As an important task in Asian language information processing, Chinese word embedding learning has attracted much attention recently. Based on either Skip-gram or CBOW, several methods have been proposed to exploit Chinese characters and sub-character components for learning Chinese word embeddings. Chinese characters are combinations of meaning, structure, and phonetic information (pinyin). However, previous works only cover the former two aspects and cannot effectively explore distinct semantics of characters. To address this issue, we develop a Pinyin-enhance Skip-gram model named rsp2vec, in addition to a radical and pinyin-enhanced Chinese word embedding (rPCWE) learning models based on CBOW. For our models, the phonetic information and semantic components of Chinese characters are encoded into embeddings simultaneously. Evaluations on word analogy reasoning, word relevance, text classification, named entity recognition, and case studies validate the effectiveness of our models.
KW - Chinese word embedding
KW - Phonetic information
KW - Semantic components
UR - http://www.scopus.com/inward/record.url?scp=85136984453&partnerID=8YFLogxK
U2 - 10.1007/s11042-022-13488-6
DO - 10.1007/s11042-022-13488-6
M3 - Article
AN - SCOPUS:85136984453
SN - 1380-7501
VL - 81
SP - 42805
EP - 42820
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 29
ER -