A large synchronous corpus as monitoring corpus: Some comparative content analysis of Chinese and Japanese language developments

Benjamin K. Tsou, Andy C. Chin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Appropriate and large corpora are uncommon but they can provide important resources for wide ranging efforts in natural language processing, ranging from contextualized or localized speech and text input to automatic patent translation. They also provide lesser known rich resources for human and automatic content analysis such as sentiment analysis of texts and product reviews. Furthermore they can function as a monitoring corpus and enhance the human centered communication environment by allowing more substantive introspection and comparison of content rather than the linguistic form in communication. This paper discusses the methodological background of a very large and unique synchronous corpus of Chinese, LIVAC, which regularly and synchronously samples news media texts from 6 major Chinese cities and occasionally from Japan. For 16 continuous years, it has processed and analyzed more than 400 million characters of Chinese news media texts and culled more than 1.5 million basic lexical entries and useful information such as on their associated basic linguistic and usage characteristics. We make an attempt to capitalize on its synchronous nature and homothematic content and to use an innovative Windows approach to explore its use as a Monitoring Corpus by tracking and doing innovative and meaningful content analysis of salient cultural items. They include content rich words such as BAR and VEHICLE and their differential derivative development and usage within windows of different sizes and up to 10 years apart. It will be shown that based on the comparative analysis of the contents in the windows, salient information can be obtained on possible changes in the relative cultural orientations and mutual influences among the Chinese communities, and between Chinese and Japanese societies, and how innovative analysis has been made possible by using the LIVAC synchronous corpus as a monitoring corpus.

Original languageEnglish
Title of host publication2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
Pages90-96
Number of pages7
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event2010 4th International Universal Communication Symposium, IUCS 2010 - Beijing, China
Duration: 18 Oct 201019 Oct 2010

Publication series

Name2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings

Conference

Conference2010 4th International Universal Communication Symposium, IUCS 2010
Country/TerritoryChina
CityBeijing
Period18/10/1019/10/10

Keywords

  • Chinese
  • Homothematic coprus
  • Japanese
  • Lingusitic and social variation
  • Monitoring corpus
  • Synchronous corpus

Fingerprint

Dive into the research topics of 'A large synchronous corpus as monitoring corpus: Some comparative content analysis of Chinese and Japanese language developments'. Together they form a unique fingerprint.

Cite this