TY - JOUR
T1 - A learner corpus is born this way
T2 - From raw data to processed dataset
AU - Leung, Chung Hong Danny
AU - Chow, Mei Yung Vanliza
AU - Ge, Haoyan
N1 - Publisher Copyright:
© 2022 The Authors
PY - 2022/10
Y1 - 2022/10
N2 - This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.
AB - This data article presents the development of a learner corpus (i.e. a systematic computerized web-based repository of written texts produced by language learners) from the initial phase of the development where written assignments were collected from language learners as raw data to the critical phases where the processed text data and meta data were aligned and transformed to the web interface of the corpus. The corpus developed is called the CELL (Chinese and English Learner Language) Corpus, which comprises: i) text data containing 4.2 million English words and 18 million Chinese characters; and ii) meta data including the demographic information of the participants whose text data were collected. This article first outlines the steps for collecting the text data and meta data and then explains the processes for cleaning, annotating and tagging the text data. Discussion of the problems the research team encountered with segmentation of the Chinese text data and accuracy check of the processed datasets is also included in this article. The CELL Corpus comes with the concordance and word list features which will enable language teachers and researchers to investigate frequency, accuracy and complexity of vocabulary use in learner language. The steps and processes reported in this article will inform future development of learner language corpora of different languages.
KW - Data processing
KW - Learner language corpus
KW - Meta data
KW - Natural language toolkit
KW - Written data
KW - ‘Regular expression’ text processing technique
UR - http://www.scopus.com/inward/record.url?scp=85136460023&partnerID=8YFLogxK
U2 - 10.1016/j.dib.2022.108527
DO - 10.1016/j.dib.2022.108527
M3 - Article
AN - SCOPUS:85136460023
VL - 44
JO - Data in Brief
JF - Data in Brief
M1 - 108527
ER -