Education First - Cambridge Open Language Database

{{category Corpus}} !!!EFCAMDAT https://philarion.mml.cam.ac.uk/ {{outline}} ---- !!概要 !規模 *830万語 *100万個の作文 *174,000人 *CEFR A1-C2 !付加情報 *エラー *品詞 *文法依存関係 *国籍 !ユーザーマニュアル https://corpus.mml.cam.ac.uk/faq/EFCamDat-Intro_release2.pdf !!利用 *ユーザー登録をするだけ。無料。 !データの選択 {{pre The current selection contains 1180309 scripts (±83543589 words) from: - 174743 learners - All nationalities - All unit(s) from level(s): all levels }} *「script」という単位 *Teaching levels and units **１から１６のレベルを選んだあと **そこに含まれるユニット（テーマ）を選ぶ *Learner nationalities **エリアで選んだあと、 **国を選ぶ !検索パタンの指定（これをしなければ、データ全体をダウンロードすることになる） *[word="単語"] *[pos="品詞"] *[lemma="レマ"] *連続して複数の項目を指定することもできる **[word="the"][pos="N"][word="of"] !データのダウンロード {{pre Segment of interest: Whole scripts matching your criteria Sentences matching your criteria Information included: Raw script text Syntactic annotations Error corrections Export format: XML compressed (zipped) XML uncompressed }} !!例 *日本人の書いたスクリプトは21,374個（1,602,328語） *3,441人 *すべてのレベルから126ユニット *Raw script textのみXMLformatでダウンロード **圧縮状態で3.5MB **解凍して13MB !!XMLフォーマット {{pre Education First - Cambridge Open Language Database EFCamDat_2.0 (EF201403) https://philarion.mml.cam.ac.uk/efcamdat/ 二文字該当したユニット番号、二桁が、カンマ区切りでエッセイのトピック日付時間ミリ秒まで二桁数字エッセイ本文個々のエッセイはという単位になっている }} !データの修正 *の部分が、複数のselectionがなければ不要。 **不要な要素が入ったままだとエラーになる *タグの部分削除 !属性と要素 *項目の内容を、項目のタグ名と要素に分けて書けば話は単純 *しかし、「内容」をタグの属性(Attribute)として表記する方法もある 25 *EFCAMDATは両方の方式でデータが書かれている。 **xmlconvertで変換するときに手間がかかる !データフレームに変換するスクリプト: xml2dfcamdat() {{pre library(xmlconvert) xml2dfcamdat <- function(a){ # 2021-10-26 copyleft sugiura@nagoya-u.jp writing.df <- xml_to_df(a, records.tag = "writing", fields="attributes") learner.df <- xml_to_df(a, records.tag = "learner", fields="attributes") topic.df <- xml_to_df(a, records.tag = "topic", fields="attributes") tags.df <- xml_to_df(a, records.tag = "writing", fields="tags") tmp.df <- cbind(writing.df, learner.df, topic.df, tags.df) colnames(tmp.df) <- c("writing_ID", "level", "unit", "learner_ID", "nationality", "topic_ID", "learner", "topic", "date", "grade", "text") (tmp.df[,-7]) } }} !!出典 Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54. Geertzen, J. , Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF), Cascadilla Press, MA.