EFCAMDAT

[Corpus]

TOP ↑ ↓

https://philarion.mml.cam.ac.uk/

EFCAMDAT

概要

規模
付加情報
ユーザーマニュアル

利用

データの選択
検索パタンの指定（これをしなければ、データ全体をダウンロードすることになる）
データのダウンロード

例
XMLフォーマット

データの修正
属性と要素
データフレームに変換するスクリプト: xml2dfcamdat()

出典

ユーザー登録をするだけ。無料。

データの選択

TOP ↑ ↓

The current selection contains 1180309 scripts (±83543589 words) from:
- 174743 learners
- All nationalities
- All unit(s) from level(s): all levels

「script」という単位

Teaching levels and units
- １から１６のレベルを選んだあと
- そこに含まれるユニット（テーマ）を選ぶ

Learner nationalities
- エリアで選んだあと、
- 国を選ぶ

検索パタンの指定（これをしなければ、データ全体をダウンロードすることになる）

TOP ↑ ↓

[word="単語"]
[pos="品詞"]
[lemma="レマ"]

連続して複数の項目を指定することもできる
- [word="the"][pos="N"][word="of"]

データのダウンロード

TOP ↑ ↓

Segment of interest: 	
	Whole scripts matching your criteria
	Sentences matching your criteria

Information included: 	
	Raw script text
	Syntactic annotations
	Error corrections

Export format: 	
	XML compressed (zipped)
	XML uncompressed

例

TOP ↑ ↓

日本人の書いたスクリプトは21,374個（1,602,328語）
3,441人
すべてのレベルから126ユニット
Raw script textのみXMLformatでダウンロード
- 圧縮状態で3.5MB
- 解凍して13MB

XMLフォーマット

TOP ↑ ↓

<?xml version="1.0" encoding="UTF-8"?>

<selection id="32の英数文字">

  <meta>
    <title>Education First - Cambridge Open Language Database</title>
    <version>EFCamDat_2.0 (EF201403)</version>
    <url>https://philarion.mml.cam.ac.uk/efcamdat/</url>
    <nationalities>二文字</nationalities>
    <units>該当したユニット番号、二桁が、カンマ区切りで</units>
  </meta>

  <writings>

    <writing id="4桁数字" level="一桁数字" unit="一桁数字">
      <learner id="6桁数字" nationality="二文字"/>
      <topic id="二桁数字">エッセイのトピック</topic>
      <date>日付 時間 ミリ秒まで</date>
      <grade>二桁数字</grade>
    <text>
      エッセイ本文
    </text>
    </writing>

    個々のエッセイは<writing>という単位になっている

  </writings>

</selection>

データの修正

TOP ↑ ↓

<selection id="32の英数文字">の部分が、複数のselectionがなければ不要。
- 不要な要素が入ったままだとエラーになる
<selection>タグの部分削除

属性と要素

TOP ↑ ↓

項目の内容を、項目のタグ名と要素に分けて書けば話は単純
しかし、「内容」をタグの属性(Attribute)として表記する方法もある

<learner>25</learner>
<learner id="25"/>
<learner id="25" age="22"/>

EFCAMDATは両方の方式でデータが書かれている。
- xmlconvertで変換するときに手間がかかる

データフレームに変換するスクリプト: xml2dfcamdat()

TOP ↑ ↓

library(xmlconvert)

xml2dfcamdat <- function(a){
    # 2021-10-26 copyleft sugiura@nagoya-u.jp
    writing.df <- xml_to_df(a, 
                           records.tag = "writing", fields="attributes")
    learner.df <- xml_to_df(a, 
                           records.tag = "learner", fields="attributes")
    topic.df <- xml_to_df(a, 
                           records.tag = "topic", fields="attributes")
    tags.df <- xml_to_df(a, 
                           records.tag = "writing", fields="tags")

    tmp.df <- cbind(writing.df, learner.df, topic.df, tags.df)
    colnames(tmp.df) <- c("writing_ID", "level", "unit", "learner_ID", "nationality", "topic_ID", "learner", "topic", "date", "grade", "text")
    (tmp.df[,-7])
}

出典

TOP ↑ ↓

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.

Geertzen, J. , Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF), Cascadilla Press, MA.

EFCAMDAT

EFCAMDAT

概要

規模

付加情報

ユーザーマニュアル

利用

データの選択

検索パタンの指定（これをしなければ、データ全体をダウンロードすることになる）

データのダウンロード

例

XMLフォーマット

データの修正

属性と要素

データフレームに変換するスクリプト: xml2dfcamdat()

出典

https://sugiura-ken.org/wiki/

Menu

keyword

category

更新履歴