Maximal match chinese segmentation augmented by resources generated from a very large dictionary for post-processing

Ka Po Chow, Andy C. Chin, Wing Fu Tsoi

Research output: Contribution to conferencePaperpeer-review

1 Citation (Scopus)

Abstract

We used a production segmentation system, which draws heavily on a large dictionary derived from processing a large amount (over 150 million Chinese characters) of synchronous textual data gathered from various Chinese speech communities, including Beijing, Hong Kong, Taipei, and others. We run this system in two tracks in the Second International Chinese Word Segmentation Bakeoff, with Backward Maximal Matching (right-to-left) as the primary mechanism. We also explored the use of a number of supplementary features offered by the large dictionary in post-processing, in an attempt to resolve ambiguities and detect unknown words. While the results might not have reached their fullest potential, they nevertheless reinforced the importance and usefulness of a large dictionary as a basis for segmentation, and the implication of following a uniform standard on the segmentation performance on data from various sources.

Original languageEnglish
Pages176-179
Number of pages4
Publication statusPublished - 2005
Externally publishedYes
Event4th SIGHAN Workshop on Chinese Language Processing at the 2nd International Joint Conference on Natural Language Processing, SIGHAN@IJCNLP 2005 - Jeju Island, Korea, Republic of
Duration: 14 Oct 200515 Oct 2005

Conference

Conference4th SIGHAN Workshop on Chinese Language Processing at the 2nd International Joint Conference on Natural Language Processing, SIGHAN@IJCNLP 2005
Country/TerritoryKorea, Republic of
CityJeju Island
Period14/10/0515/10/05

Fingerprint

Dive into the research topics of 'Maximal match chinese segmentation augmented by resources generated from a very large dictionary for post-processing'. Together they form a unique fingerprint.

Cite this