Abstract
We used a production segmentation system, which draws heavily on a large dictionary derived from processing a large amount (over 150 million Chinese characters) of synchronous textual data gathered from various Chinese speech communities, including Beijing, Hong Kong, Taipei, and others. We run this system in two tracks in the Second International Chinese Word Segmentation Bakeoff, with Backward Maximal Matching (right-to-left) as the primary mechanism. We also explored the use of a number of supplementary features offered by the large dictionary in post-processing, in an attempt to resolve ambiguities and detect unknown words. While the results might not have reached their fullest potential, they nevertheless reinforced the importance and usefulness of a large dictionary as a basis for segmentation, and the implication of following a uniform standard on the segmentation performance on data from various sources.
Original language | English |
---|---|
Pages | 176-179 |
Number of pages | 4 |
Publication status | Published - 2005 |
Externally published | Yes |
Event | 4th SIGHAN Workshop on Chinese Language Processing at the 2nd International Joint Conference on Natural Language Processing, SIGHAN@IJCNLP 2005 - Jeju Island, Korea, Republic of Duration: 14 Oct 2005 → 15 Oct 2005 |
Conference
Conference | 4th SIGHAN Workshop on Chinese Language Processing at the 2nd International Joint Conference on Natural Language Processing, SIGHAN@IJCNLP 2005 |
---|---|
Country/Territory | Korea, Republic of |
City | Jeju Island |
Period | 14/10/05 → 15/10/05 |