TY - GEN
T1 - Cantonese to Written Chinese Translation via HuggingFace Translation Pipeline
AU - Kwok, Raptor Yick Kan
AU - Ay Yeung, Siu Kei
AU - LI, Zongxi
AU - Hung, Kevin
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/12/15
Y1 - 2023/12/15
N2 - Cantonese, a low-resource language [5] that has been used in Southeastern China for hundreds of years, with over 85 million native speakers worldwide, is poorly supported in the mainstream language model for existing translation platforms such as Baidu, Google and Bing. This paper presents a large parallel corpus of 130 thousand Cantonese and Written Chinese pairs. The data are used to train a translation model using the translation pipeline of the Hugging Face Transformers architecture, a dominant architecture for natural language processing nowadays [18]. The BLEU score and manual assessment evaluate the performance. The translation results achieve a BLEU score of 41.35and chrF++ score of 44.88on the entire validation set. The model also works reasonably well with long sentences of over 20 Chinese characters. It achieves a BLEU score of 48.61and chrF++ score of 39.87on long sentences. Those results are comparable with the existing Baidu Fanyi and Bing Translate. We also establish a Cantonese sentence evaluation metric to classify the quality of the source Cantonese sentence by professional translators. We then compare the BLEU and chrF++ scores with the corresponding evaluation score and found that the better the quality of the source sentence, the higher the BLEU and chrF++ scores. Last, we proved that our corpus enabled the Cantonese translation capability of the Chinese BART pre-Trained model.
AB - Cantonese, a low-resource language [5] that has been used in Southeastern China for hundreds of years, with over 85 million native speakers worldwide, is poorly supported in the mainstream language model for existing translation platforms such as Baidu, Google and Bing. This paper presents a large parallel corpus of 130 thousand Cantonese and Written Chinese pairs. The data are used to train a translation model using the translation pipeline of the Hugging Face Transformers architecture, a dominant architecture for natural language processing nowadays [18]. The BLEU score and manual assessment evaluate the performance. The translation results achieve a BLEU score of 41.35and chrF++ score of 44.88on the entire validation set. The model also works reasonably well with long sentences of over 20 Chinese characters. It achieves a BLEU score of 48.61and chrF++ score of 39.87on long sentences. Those results are comparable with the existing Baidu Fanyi and Bing Translate. We also establish a Cantonese sentence evaluation metric to classify the quality of the source Cantonese sentence by professional translators. We then compare the BLEU and chrF++ scores with the corresponding evaluation score and found that the better the quality of the source sentence, the higher the BLEU and chrF++ scores. Last, we proved that our corpus enabled the Cantonese translation capability of the Chinese BART pre-Trained model.
KW - Cantonese
KW - neural networks
KW - translation
KW - Written Chinese
UR - http://www.scopus.com/inward/record.url?scp=85187550528&partnerID=8YFLogxK
U2 - 10.1145/3639233.3639332
DO - 10.1145/3639233.3639332
M3 - Conference contribution
T3 - ACM International Conference Proceeding Series
SP - 77
EP - 84
BT - NLPIR 2023 - 2023 7th International Conference on Natural Language Processing and Information Retrieval
T2 - 7th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2023
Y2 - 15 December 2023 through 17 December 2023
ER -