corpus

[R] [Corpus]
R.package

TOP ↑ ↓

corpus

raw data に整形しておく
corpus_frame() で「corpus data frame object」形式のデータとして保存
text_tekens() でトークン化
text_filter()
text_ntoken()
text_ntype()
text_nsentence()
text_stats()
term_stats()
text_locate() で　KWIC検索
text_sample() で同様にランダムに検索可能

https://cran.r-project.org/web/packages/corpus/index.html

raw data に整形しておく

TOP ↑ ↓

corpus_frame() で「corpus data frame object」形式のデータとして保存

TOP ↑ ↓

title	text

text_tekens() でトークン化

TOP ↑ ↓

text_filter()

TOP ↑ ↓

オプションを指定することで各種整形ができる

text_ntoken()

TOP ↑ ↓

tokenの数

text_ntype()

TOP ↑ ↓

typeの数

text_nsentence()

TOP ↑ ↓

文の数

text_stats()

TOP ↑ ↓

上三つをまとめて行う

term_stats()

TOP ↑ ↓

各用語が、コーパス・データ中のいくつのサブコーパスに含まれるか

term_stats(data)

オプションでngramも同様に

term_stats(data, ngrams = 5)

特定の語を含むngramも同様に
- グラム数の範囲指定可能
- 何語目に含むか指定可能

term_stats(data, ngrams = 2:3, types = TRUE,
          subset = type1 == "dorothy" & !type2 %in% stopwords_en)

text_locate() で　KWIC検索

TOP ↑ ↓

stemmerオプションでステミング可能
複数のkeywordの指定可能

text_sample() で同様にランダムに検索可能

TOP ↑ ↓

corpus

corpus

raw data に整形しておく

corpus_frame() で「corpus data frame object」形式のデータとして保存

text_tekens() でトークン化

text_filter()

text_ntoken()

text_ntype()

text_nsentence()

text_stats()

term_stats()

text_locate() で　KWIC検索

text_sample() で同様にランダムに検索可能

https://sugiura-ken.org/wiki/

Menu

keyword

category

更新履歴