トップ履歴一覧 Farm ソース検索ヘルプ PDF RSS ログイン

quanteda2

quanteda を使った例

install.packages("quanteda", dependencies = T)
library(quanteda)

quanteda を使った例

サンプル：学習者コーパスNICEST 1.0の学習者データ78ファイル
readtextパッケージの readtext() を使ってテキストを読み込む
コーパスデータ化する corpus()
コーパスデータから条件で絞り込んで、サブ・セットを作成 corpus_subset(コーパスデータ, 条件)
コーパスデータの概要を見る summary()
文書変数（属性情報）を追加する docvars(コーパス, 文書変数)
中身を見る texts()
KWIC検索する kwic(コーパス, pattern="文字列")
トークン化する tokens()
文書行列の作成 document-feature matrix (DFM) dfm()
文書行列の閲覧 View()
文書行列の概要を見る summary(文書行列)
単語頻度一覧表 topfeatures()
ワードクラウドの作成 textplot_wordcloud()
いくつかの単語をまとめてグループにし、そのグループに該当するものの頻度を調べる。
文書行列を並べ替えて語彙頻度一覧作成 dfm_sort()
コーパスの結合　+ するだけ
文書に特徴的な単語を選び出す keyness
文書の類似度 textstat_simil(文書行列)
文書間の距離を測る textstat_dist(文書行列)
語彙的分散を観察する textplot_xray()
ランダムサンプリング corpus_sample(コーパスデータ, size=数, replace = FALSE)

サンプル：学習者コーパスNICEST 1.0の学習者データ78ファイル

フォルダー内に、テキストデータのみのテキストファイルが複数入っている
作業ディレクトリーをそのフォルダーに設定してある状態で

> getwd()
[1] "C:/Users/ /Documents/NICEST-samples//NICEST_JP10"
> list.files()
 [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" "JAN0001_P6B.txt" "JAN0001_P7B.txt"
 [8] "JAN0001_P8B.txt" "JAN0002_P1A.txt" "JAN0002_P2A.txt" "JAN0002_P3A.txt" "JAN0002_P5A.txt" "JAN0002_P6A.txt" "JAN0002_P8A.txt"
[15] "JAN0003_P1B.txt" "JAN0003_P2B.txt" "JAN0003_P3B.txt" "JAN0003_P4B.txt" "JAN0003_P5B.txt" "JAN0003_P6B.txt" "JAN0003_P7B.txt"
[22] "JAN0003_P8B.txt" "JAN0004_P1B.txt" "JAN0004_P2B.txt" "JAN0004_P3B.txt" "JAN0004_P4B.txt" "JAN0004_P5B.txt" "JAN0004_P6B.txt"
[29] "JAN0004_P7B.txt" "JAN0004_P8B.txt" "JAN0005_P1B.txt" "JAN0005_P2B.txt" "JAN0005_P3B.txt" "JAN0005_P4B.txt" "JAN0005_P5B.txt"
[36] "JAN0005_P6B.txt" "JAN0005_P7B.txt" "JAN0005_P8B.txt" "JAN0006_P1A.txt" "JAN0006_P2A.txt" "JAN0006_P3A.txt" "JAN0006_P4A.txt"
[43] "JAN0006_P5A.txt" "JAN0006_P6A.txt" "JAN0006_P7A.txt" "JAN0006_P8A.txt" "JAN0007_P1B.txt" "JAN0007_P2B.txt" "JAN0007_P3B.txt"
[50] "JAN0007_P4B.txt" "JAN0007_P5B.txt" "JAN0007_P6B.txt" "JAN0007_P7B.txt" "JAN0007_P8B.txt" "JAN0008_P1B.txt" "JAN0008_P2B.txt"
[57] "JAN0008_P3B.txt" "JAN0008_P4B.txt" "JAN0008_P5B.txt" "JAN0008_P6B.txt" "JAN0008_P7B.txt" "JAN0008_P8B.txt" "JAN0009_P1B.txt"
[64] "JAN0009_P2B.txt" "JAN0009_P3B.txt" "JAN0009_P4B.txt" "JAN0009_P5B.txt" "JAN0009_P6B.txt" "JAN0009_P7B.txt" "JAN0009_P8B.txt"
[71] "JAN0010_P1B.txt" "JAN0010_P2B.txt" "JAN0010_P3B.txt" "JAN0010_P4B.txt" "JAN0010_P5B.txt" "JAN0010_P6B.txt" "JAN0010_P7B.txt"
[78] "JAN0010_P8B.txt"

readtextパッケージの readtext() を使ってテキストを読み込む

install.packages("readtext", dependencies = T)
library(readtext)

nicest.tmp <- readtext("*.txt")
> nicest.tmp
readtext object consisting of 78 documents and 0 docvars.
# Description: df[,2] [78 x 2]
  doc_id          text               
  <chr>           <chr>              
1 JAN0001_P1B.txt "\"Some peopl\"..."
2 JAN0001_P2B.txt "\"You may th\"..."
3 JAN0001_P3B.txt "\"Compared w\"..."
4 JAN0001_P4B.txt "\"You may ha\"..."
5 JAN0001_P5B.txt "\"Elderly pe\"..."
6 JAN0001_P6B.txt "\"Group tour\"..."
# ... with 72 more rows

コーパスデータ化する corpus()

nicestJ1 <- corpus(nicest.tmp)
nicestJ1
Corpus consisting of 78 documents and 0 docvars.

文書変数の追加修正 docvars()
文書属性の追加修正 metadoc()　こちらは分析対象とならない備考欄的情報

reference

https://github.com/koheiw/workshop-IJTA/blob/master/documents/corpus.md
https://quanteda.io/articles/pkgdown/examples/quickstart_ja.html

コーパスデータから条件で絞り込んで、サブ・セットを作成 corpus_subset(コーパスデータ, 条件)

絞り込みの例
- 見出し > 数字
- 見出し == "文字列"

コーパスデータの概要を見る summary()

> summary(nicestJ1)
Corpus consisting of 78 documents:

           Text Types Tokens Sentences
JAN0001_P1B.txt   116    214        12
JAN0001_P2B.txt   138    268        17
JAN0001_P3B.txt    97    169        11
JAN0001_P4B.txt    68     99         8
JAN0001_P5B.txt   120    262        16
JAN0001_P6B.txt   114    224        13
JAN0001_P7B.txt   121    268        18
JAN0001_P8B.txt    71    108         8
JAN0002_P1A.txt    98    170        15

文書変数（属性情報）を追加する docvars(コーパス, 文書変数)

> docvars(nicestJ1, "lang") <- "jpn"
> summary(nicestJ1)
Corpus consisting of 78 documents:

            Text Types Tokens Sentences lang
 JAN0001_P1B.txt   116    214        12  jpn
 JAN0001_P2B.txt   138    268        17  jpn
 JAN0001_P3B.txt    97    169        11  jpn
 JAN0001_P4B.txt    68     99         8  jpn
 JAN0001_P5B.txt   120    262        16  jpn
 JAN0001_P6B.txt   114    224        13  jpn
 JAN0001_P7B.txt   121    268        18  jpn
 JAN0001_P8B.txt    71    108         8  jpn
 JAN0002_P1A.txt    98    170        15  jpn
 JAN0002_P2A.txt   117    216        19  jpn
 JAN0002_P3A.txt   111    179        17  jpn
 JAN0002_P5A.txt   112    203        18  jpn
 JAN0002_P6A.txt   126    222        17  jpn
 JAN0002_P8A.txt   117    203        15  jpn

中身を見る texts()

このままだと、すべて出力される
特定のデータを見る場合は、何番目のデータを見るかを [番号] で指定する。

> texts(nicestJ1)[3]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         JAN0001_P3B.txt 
"Compared with past, young people nowadays do not give enough time to （以下略）

KWIC検索する kwic(コーパス, pattern="文字列")

> kwic(nicestJ1, pattern="however")
                                                                                           
 [JAN0001_P1B.txt, 13]          not important for human, | however | , who make todays life
 [JAN0001_P2B.txt, 31]       compared with young people, | however | , I recognize that is 
 [JAN0001_P7B.txt, 71] misunderstanding about that area. | However | , if you only know

トークン化する tokens()

（実際は、これをしなくても文書行列を直接作成すればよい）

結果は、各文書を要素とするリスト形式で保存される

句読点を除くオプション remove_punct=T
数字を除くオプション remove_numbers=T
stopwordsを除くオプション remove=stopwords("english")

> tokens(nicestJ1, remove_numbers=T, remove_punct=T)
tokens from 8 documents.
JAN0001_P1B.txt :
  [1] "Some"         "people"       "say"          "that"         "specialized"  "knowledge"    "is"           "not"          "important"    "for"         
 [11] "human"        "however"      "who"          "make"         "todays"       "life"         "such"         "a"            "convenience"  "are"         
 [21] "always"       "a"            "few"          "number"       "of"           "genius"       "with"         "very"         "specific"     "knowledges"  
 [31] "To"           "consider"     "this"         "it"           "can"          "be"           "said"         "that"         "to"           "specialized" 
 [41] "in"           "one"          "specific"     "subject"      "is"           "better"       "than"         "to"           "get"          "bload"       
 [51] "knowledge"    "of"           "many"         "academic"     "subjects"     "There"        "is"           "some"         "more"         "reasons"     
（以下略）

token化したものをすべて小文字にする tokens_tolower()
token化したものをstem化する tokens_wordstem()

文書 行列の作成 document-feature matrix (DFM) dfm()

tmより新しいので、こっちのほうが良い（Welbers et al 2017）
オプション
- ステム化 stem=T
- 句読点の削除 remove_punct=T
対象は、トークン化したオブジェクトだが、
- トークン化してなくても自動でトークン化してくれる

> dfm(nicestJ1)
> nicestJ1.dfm <- dfm(nicestJ1, stem=T, remove_punct=T)
> nicestJ1.dfm
Document-feature matrix of: 8 documents, 342 features (72.8% sparse).

機能語などの頻出単語（stopwords）を除くオプション remove = stopwords("english")
文書をグループ化することもできる groups = "見出し"

文書 行列の閲覧 View()

★大文字 V

文書 行列の概要を見る summary(文書行列)

単語頻度一覧表 topfeatures()

デフォルトは、上位10語
オプションで数字をつけるとそこまで

> topfeatures(nicestJ1.dfm)
 to you not  in  is  of  it the can and 
 60  53  33  30  27  27  25  25  23  22 
> topfeatures(nicestJ1.dfm, 20)
   to   you   not    in    is    of    it   the   can   and     i  this  they peopl  have  that    do think   use   are 
   60    53    33    30    27    27    25    25    23    22    21    19    18    17    17    16    16    15    14    13

ワードクラウドの作成 textplot_wordcloud()

> textplot_wordcloud(nicestJ1.dfm)

いくつかの単語をまとめてグループにし、そのグループに該当するものの頻度を調べる。

例えば、接続語句のリストのグループを複数作っておいて、それぞれのグループに属する接続語句の頻度をかぞえる。
グループを「辞書」と呼ぶ。コマンド名は dictionary()

connectives <- dictionary(list(additive = c("moreover", "furthermore", "and"),
                               adversative =  c("however","but","conversely"),
                               resultative = c("therefore", "thus", "so")))

connectives.dfm <- dfm(コーパスデータ, dictionary = connectives)

View(connectives.dfm) で見る。

Document-feature matrix of: 10 documents, 3 features (10.0% sparse).
10 x 3 sparse Matrix of class "dfm"
                 features
docs              additive adversative resultative
  JAN0001_P1B.txt        1           1           2
  JAN0001_P2B.txt        4           4           3
  JAN0001_P3B.txt        3           1           1
  JAN0001_P4B.txt        2           1           1
  JAN0001_P5B.txt        4           1           3
  JAN0001_P6B.txt        2           2           2
  JAN0001_P7B.txt        5           3           0
  JAN0001_P8B.txt        1           1           0
  JAN0002_P1A.txt        3           0           1
  JAN0002_P2A.txt        4           3           2

（★以下、データフレーム名が違うので、注意。そのままコピペでは動きません。）

文書 行列を並べ替えて語彙頻度一覧作成 dfm_sort()

dfm_sort(文書行列)[何行目,頻度順に単語を何位まで]

> summary(nicestJPN.dfm)
Length  Class   Mode 
  4780    dfm     S4 

> dfm_sort(nicestJPN.dfm)
Document-feature matrix of: 10 documents, 478 features (79.4% sparse).

> dfm_sort(nicestJPN.dfm)[ , 1:20]

Document-feature matrix of: 10 documents, 20 features (15.5% sparse).

10 x 20 sparse Matrix of class "dfm"
                 features
docs              to you not in is of the people and i can this it have they do that young think are
  JAN0001_P1B.txt 11  11   6  2  5  3   2      1   1 2   3    1  4    2    0  4    3     0     1   2
  JAN0001_P2B.txt  4   2   7  2  3  5   1      3   4 5   8    4  3    6    8  2    4     8     2   4
  JAN0001_P3B.txt  8   0   4  3  4  3   4      3   3 1   1    2  3    1    1  5    0     3     1   0
  JAN0001_P4B.txt  3   3   0  2  3  0   0      1   2 1   1    2  2    1    4  0    0     0     1   0
  JAN0001_P5B.txt  7   0   5  9  3  7   8      7   4 5   2    5  0    3    2  1    6     2     4   3
  JAN0001_P6B.txt 10  14   5  5  5  5   5      1   2 2   2    2  3    1    0  0    1     0     2   3
  JAN0001_P7B.txt 13  16   6  6  4  2   4      0   5 3   5    3  5    1    3  2    1     0     1   0
  JAN0001_P8B.txt  4   7   0  1  0  2   1      1   1 2   1    0  2    1    0  1    1     0     1   1
  JAN0002_P1A.txt  7   1   3  6  4  3   2      7   3 5   0    3  0    1    0  1    2     0     2   0
  JAN0002_P2A.txt  1   0   3  3  4  3   5      6   4 1   1    1  1    5    4  5    1     5     2   3

コーパスの結合　+ するだけ

nicest20.corpus <- nicestJAN.corpus + nicestNTV.corpus

文書に特徴的な単語を選び出す keyness

文書の類似度 textstat_simil(文書 行列)

> textstat_simil(nicestJPN.dfm)
textstat_simil object; method = "correlation"
                JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt
JAN0001_P1B.txt           1.000           0.339           0.450           0.361           0.344           0.622
JAN0001_P2B.txt           0.339           1.000           0.500           0.324           0.427           0.302
JAN0001_P3B.txt           0.450           0.500           1.000           0.284           0.453           0.377
JAN0001_P4B.txt           0.361           0.324           0.284           1.000           0.267           0.358
JAN0001_P5B.txt           0.344           0.427           0.453           0.267           1.000           0.333
JAN0001_P6B.txt           0.622           0.302           0.377           0.358           0.333           1.000
JAN0001_P7B.txt           0.669           0.319           0.386           0.422           0.310           0.620
JAN0001_P8B.txt           0.521           0.272           0.273           0.343           0.223           0.573
JAN0002_P1A.txt           0.424           0.299           0.425           0.277           0.498           0.360
JAN0002_P2A.txt           0.215           0.429           0.351           0.176           0.350           0.170
                JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt
JAN0001_P1B.txt           0.669           0.521           0.424           0.215
JAN0001_P2B.txt           0.319           0.272           0.299           0.429
JAN0001_P3B.txt           0.386           0.273           0.425           0.351
JAN0001_P4B.txt           0.422           0.343           0.277           0.176
JAN0001_P5B.txt           0.310           0.223           0.498           0.350
JAN0001_P6B.txt           0.620           0.573           0.360           0.170
JAN0001_P7B.txt           1.000           0.554           0.363           0.187
JAN0001_P8B.txt           0.554           1.000           0.262           0.144
JAN0002_P1A.txt           0.363           0.262           1.000           0.310
JAN0002_P2A.txt           0.187           0.144           0.310           1.000

類似度を測る方法には、オプションが何種類もある。

文書間の距離を測る textstat_dist(文書行列)

> textstat_dist(nicestJPN.dfm)
textstat_dist object; method = "euclidean"
                JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt
JAN0001_P1B.txt               0            28.2            22.0            22.5            29.4            21.4
JAN0001_P2B.txt            28.2               0            22.9            25.3            28.5            30.3
JAN0001_P3B.txt            22.0            22.9               0            18.2            25.4            25.2
JAN0001_P4B.txt            22.5            25.3            18.2               0            27.8            24.7
JAN0001_P5B.txt            29.4            28.5            25.4            27.8               0            30.9
JAN0001_P6B.txt            21.4            30.3            25.2            24.7            30.9               0
JAN0001_P7B.txt            22.6            32.6            28.6            28.1            34.0            24.7
JAN0001_P8B.txt            20.4            26.0            18.8            13.9            28.5            21.8
JAN0002_P1A.txt            23.0            27.1            19.5            19.4            24.8            26.0
JAN0002_P2A.txt            27.8            25.2            22.2            22.6            28.3            30.3
                JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt
JAN0001_P1B.txt            22.6            20.4            23.0            27.8
JAN0001_P2B.txt            32.6            26.0            27.1            25.2
JAN0001_P3B.txt            28.6            18.8            19.5            22.2
JAN0001_P4B.txt            28.1            13.9            19.4            22.6
JAN0001_P5B.txt            34.0            28.5            24.8            28.3
JAN0001_P6B.txt            24.7            21.8            26.0            30.3
JAN0001_P7B.txt               0            26.1            29.3            33.3
JAN0001_P8B.txt            26.1               0            19.9            23.3
JAN0002_P1A.txt            29.3            19.9               0            23.5
JAN0002_P2A.txt            33.3            23.3            23.5               0

語彙的分散を観察する textplot_xray()

kwic() との組み合わせで使う

textplot_xray(
  kwic(nicestJAN.corpus, pattern = "and"),
  kwic(nicestJAN.corpus, pattern = "but"),
  kwic(nicestJAN.corpus, pattern = "so")
)

分布の図を絶対値にするオプション scale = "absolute"

textplot_xray(
  kwic(nicestJAN.corpus, pattern = "and"),
  kwic(nicestJAN.corpus, pattern = "but"),
  kwic(nicestJAN.corpus, pattern = "so"),
  scale = "absolute"
)

ランダムサンプリング corpus_sample(コーパスデータ, size=数, replace = FALSE)

Reference

https://quanteda.io/articles/pkgdown/quickstart_ja.html
http://i.amcat.nl/lda/1_textanalysis.html
https://quanteda.io/articles/pkgdown/examples/plotting.html

https://sugiura-ken.org/wiki/

Menu

keyword

category

[GLMM]
[Linux]
[HSP]
[R]
[Python]
[Corpus]
[Google]

更新履歴

2026/3/1

FFmpeg

2026/2/7

droplevels()

2026/2/6

GoogleAlerts

2026/1/15

NICT JLE Corpus

2026/1/10

facet

2026/1/8

ImageMagick

2025/12/21

spacyr

2025/12/14

RStudio

2025/12/8

Quarto

2025/11/9

contrast coding

2025/10/23

TAASSC

2025/9/13

CaboCha

2025/9/9

Ubuntu

2025/9/4

Excel.tips

2025/8/3

2025/7/31

MeCab

2025/7/29

RMarkdown

2025/7/24

balloonplot

2025/7/6

R

2025/6/19

Colab

2025/6/18

GoogleSpreadsheet

2025/6/17

ChatGPT

2025/6/15

2025/6/12

TextForHSP05

2025/5/18

Overleaf

2025/5/14

2025/5/13

2025/5/12

GitHub

2025/5/6

sessioninfo

2025/5/2

cocaClean

2025/4/30

Audacity

2025/4/29

2025/4/28

Git

2025/4/19

2025/4/2

Postorius

2025/3/17

GoogleSlides

2025/3/16

2025/3/15

2025/3/12

2025/3/7

udpipe

2025/3/4

effects

2025/3/2

2025/2/25

Word.tips

2025/2/16

gtsummary

2025/1/31

dispersion

2025/1/25

TACT

2025/1/23

Windows11

2025/1/10

Multiple Regression Analysis

2024/12/5

Thunderbird

2024/11/25

GoogleForm

2024/11/3

formMule

2024/10/31

Discord

2024/10/17

2024/10/13

Forward Difference Coding

2024/10/8

AlmaLinux

2024/10/6

NUCT

2024/10/4

FLUCALC

2024/10/3

2024/10/2

ERRANT

2024/9/28

ROOT

2024/9/25

heatmap

2024/9/24

Log-r

2024/9/17

2024/9/15

ICC

2024/9/14

2024/9/1

sjPlot

2024/8/31

theme()

2024/8/16

RandomForest

2024/8/13

Mann-Whitney U test

2024/8/8

DateDay.hsp

2024/8/7

2024/8/3

2024/7/31

R.package

2024/7/27

OrdinalLogisticRegression

2024/7/25

2024/7/24

repmod

2024/7/22

modelsummary

2024/7/8

2024/6/27

2024/6/25

WER

2024/6/17

dummy variables

2024/6/7

GoodNotes

2024/6/5

2024/6/1

pivot_longer()

2024/5/31

TextForHSP03

2024/5/30

Python2024memo

2024/5/29

RadioButton

2024/5/26

R meets Google Spreadsheet

2024/5/24

2024/5/18

2024/5/11

Google Books Ngram Viewer

2024/5/8

スクリーンショット

2024/4/29

case_when

2024/4/24

R.data

2024/4/5

MicrosoftTeams

2024/1/28

sakura editor

2024/1/11

LCR2023

2024/1/6

jtools

2023/12/29

abline()