Stop word list construction and application in Chinese language processing

Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han, Lu Sheng Wang

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring Chinese stop word lists becomes critical. In this paper, to save the time and release the burden of manual stop word selection, we propose an automatic aggregated methodology based on statistical and information models for extraction of stop word list in Chinese language. Result analysis shows that our stop list is comparable with a general English stop word list, and our list is much more general than other Chinese stop lists as well. Extensive experiments have been conducted on Chinese segmentation to investigate the effectiveness of the stop word list extracted. The results show that our stop word list can improve the accuracy of Chinese segmentation significantly. Our stop word extraction algorithm is a promising technique, which saves the time for manual generation and constructs a standard. It could be applied into other languages in the future.

Original languageEnglish
Pages (from-to)1036-1044
Number of pages9
JournalWSEAS Transactions on Information Science and Applications
Volume3
Issue number6
Publication statusPublished - Jun 2006
Externally publishedYes

Keywords

  • Information theory
  • Statistical modeling
  • Stop word

Fingerprint

Dive into the research topics of 'Stop word list construction and application in Chinese language processing'. Together they form a unique fingerprint.

Cite this