トップ 差分 一覧 Farm ソース 検索 ヘルプ PDF RSS ログイン

quanteda2

*disclaimer
93771

R

quanteda を使った例

install.packages("quanteda", dependencies = T)
library(quanteda)


サンプル: 学習者コーパスNICEST 1.0の学習者データ78ファイル

  • フォルダー内に、テキストデータのみのテキストファイルが複数入っている
  • 作業ディレクトリーをそのフォルダーに設定してある状態で
> getwd()
[1] "C:/Users/ /Documents/NICEST-samples//NICEST_JP10"
> list.files()
 [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" "JAN0001_P6B.txt" "JAN0001_P7B.txt"
 [8] "JAN0001_P8B.txt" "JAN0002_P1A.txt" "JAN0002_P2A.txt" "JAN0002_P3A.txt" "JAN0002_P5A.txt" "JAN0002_P6A.txt" "JAN0002_P8A.txt"
[15] "JAN0003_P1B.txt" "JAN0003_P2B.txt" "JAN0003_P3B.txt" "JAN0003_P4B.txt" "JAN0003_P5B.txt" "JAN0003_P6B.txt" "JAN0003_P7B.txt"
[22] "JAN0003_P8B.txt" "JAN0004_P1B.txt" "JAN0004_P2B.txt" "JAN0004_P3B.txt" "JAN0004_P4B.txt" "JAN0004_P5B.txt" "JAN0004_P6B.txt"
[29] "JAN0004_P7B.txt" "JAN0004_P8B.txt" "JAN0005_P1B.txt" "JAN0005_P2B.txt" "JAN0005_P3B.txt" "JAN0005_P4B.txt" "JAN0005_P5B.txt"
[36] "JAN0005_P6B.txt" "JAN0005_P7B.txt" "JAN0005_P8B.txt" "JAN0006_P1A.txt" "JAN0006_P2A.txt" "JAN0006_P3A.txt" "JAN0006_P4A.txt"
[43] "JAN0006_P5A.txt" "JAN0006_P6A.txt" "JAN0006_P7A.txt" "JAN0006_P8A.txt" "JAN0007_P1B.txt" "JAN0007_P2B.txt" "JAN0007_P3B.txt"
[50] "JAN0007_P4B.txt" "JAN0007_P5B.txt" "JAN0007_P6B.txt" "JAN0007_P7B.txt" "JAN0007_P8B.txt" "JAN0008_P1B.txt" "JAN0008_P2B.txt"
[57] "JAN0008_P3B.txt" "JAN0008_P4B.txt" "JAN0008_P5B.txt" "JAN0008_P6B.txt" "JAN0008_P7B.txt" "JAN0008_P8B.txt" "JAN0009_P1B.txt"
[64] "JAN0009_P2B.txt" "JAN0009_P3B.txt" "JAN0009_P4B.txt" "JAN0009_P5B.txt" "JAN0009_P6B.txt" "JAN0009_P7B.txt" "JAN0009_P8B.txt"
[71] "JAN0010_P1B.txt" "JAN0010_P2B.txt" "JAN0010_P3B.txt" "JAN0010_P4B.txt" "JAN0010_P5B.txt" "JAN0010_P6B.txt" "JAN0010_P7B.txt"
[78] "JAN0010_P8B.txt"

readtextパッケージの readtext() を使ってテキストを読み込む


install.packages("readtext", dependencies = T)
library(readtext)

nicest.tmp <- readtext("*.txt")
> nicest.tmp
readtext object consisting of 78 documents and 0 docvars.
# Description: df[,2] [78 x 2]
  doc_id          text               
  <chr>           <chr>              
1 JAN0001_P1B.txt "\"Some peopl\"..."
2 JAN0001_P2B.txt "\"You may th\"..."
3 JAN0001_P3B.txt "\"Compared w\"..."
4 JAN0001_P4B.txt "\"You may ha\"..."
5 JAN0001_P5B.txt "\"Elderly pe\"..."
6 JAN0001_P6B.txt "\"Group tour\"..."
# ... with 72 more rows

コーパスデータ化する corpus()

nicestJ1 <- corpus(nicest.tmp)
nicestJ1
Corpus consisting of 78 documents and 0 docvars.

  • 文書変数の追加修正 docvars()
  • 文書属性の追加修正 metadoc() こちらは分析対象とならない備考欄的情報

  • reference

https://github.com/koheiw/workshop-IJTA/blob/master/documents/corpus.md
https://quanteda.io/articles/pkgdown/examples/quickstart_ja.html

コーパスデータから条件で絞り込んで、サブ・セットを作成 corpus_subset(コーパスデータ, 条件)

  • 絞り込みの例
    • 見出し > 数字
    • 見出し == "文字列"

コーパスデータの概要を見る summary()

> summary(nicestJ1)
Corpus consisting of 78 documents:

           Text Types Tokens Sentences
JAN0001_P1B.txt   116    214        12
JAN0001_P2B.txt   138    268        17
JAN0001_P3B.txt    97    169        11
JAN0001_P4B.txt    68     99         8
JAN0001_P5B.txt   120    262        16
JAN0001_P6B.txt   114    224        13
JAN0001_P7B.txt   121    268        18
JAN0001_P8B.txt    71    108         8
JAN0002_P1A.txt    98    170        15

文書変数(属性情報)を追加する docvars(コーパス, 文書変数)


> docvars(nicestJ1, "lang") <- "jpn"
> summary(nicestJ1)
Corpus consisting of 78 documents:

            Text Types Tokens Sentences lang
 JAN0001_P1B.txt   116    214        12  jpn
 JAN0001_P2B.txt   138    268        17  jpn
 JAN0001_P3B.txt    97    169        11  jpn
 JAN0001_P4B.txt    68     99         8  jpn
 JAN0001_P5B.txt   120    262        16  jpn
 JAN0001_P6B.txt   114    224        13  jpn
 JAN0001_P7B.txt   121    268        18  jpn
 JAN0001_P8B.txt    71    108         8  jpn
 JAN0002_P1A.txt    98    170        15  jpn
 JAN0002_P2A.txt   117    216        19  jpn
 JAN0002_P3A.txt   111    179        17  jpn
 JAN0002_P5A.txt   112    203        18  jpn
 JAN0002_P6A.txt   126    222        17  jpn
 JAN0002_P8A.txt   117    203        15  jpn


中身を見る texts()

  • このままだと、すべて出力される
  • 特定のデータを見る場合は、何番目のデータを見るかを [番号] で指定する。
> texts(nicestJ1)[3]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         JAN0001_P3B.txt 
"Compared with past, young people nowadays do not give enough time to (以下略)

KWIC検索する kwic(コーパス, pattern="文字列")

> kwic(nicestJ1, pattern="however")
                                                                                           
 [JAN0001_P1B.txt, 13]          not important for human, | however | , who make todays life
 [JAN0001_P2B.txt, 31]       compared with young people, | however | , I recognize that is 
 [JAN0001_P7B.txt, 71] misunderstanding about that area. | However | , if you only know    

トークン化する tokens()

(実際は、これをしなくても文書行列を直接作成すればよい)

  • 結果は、各文書を要素とするリスト形式で保存される

  • 句読点を除くオプション remove_punct=T
  • 数字を除くオプション remove_numbers=T
  • stopwordsを除くオプション <<remove=stopwords("english")

> tokens(nicestJ1, remove_numbers=T, remove_punct=T)
tokens from 8 documents.
JAN0001_P1B.txt :
  [1] "Some"         "people"       "say"          "that"         "specialized"  "knowledge"    "is"           "not"          "important"    "for"         
 [11] "human"        "however"      "who"          "make"         "todays"       "life"         "such"         "a"            "convenience"  "are"         
 [21] "always"       "a"            "few"          "number"       "of"           "genius"       "with"         "very"         "specific"     "knowledges"  
 [31] "To"           "consider"     "this"         "it"           "can"          "be"           "said"         "that"         "to"           "specialized" 
 [41] "in"           "one"          "specific"     "subject"      "is"           "better"       "than"         "to"           "get"          "bload"       
 [51] "knowledge"    "of"           "many"         "academic"     "subjects"     "There"        "is"           "some"         "more"         "reasons"     
(以下略)
  • token化したものをすべて小文字にする tokens_tolower()
  • token化したものをstem化する tokens_wordstem()

文書行列の作成 document-feature matrix (DFM) dfm()

  • tmより新しいので、こっちのほうが良い(Welbers et al 2017)
  • オプション
    • ステム化 stem=T
    • 句読点の削除 remove_punct=T
  • 対象は、トークン化したオブジェクトだが、
    • トークン化してなくても自動でトークン化してくれる

> dfm(nicestJ1)
> nicestJ1.dfm <- dfm(nicestJ1, stem=T, remove_punct=T)
> nicestJ1.dfm
Document-feature matrix of: 8 documents, 342 features (72.8% sparse).
  • 機能語などの頻出単語(stopwords)を除くオプション remove = stopwords("english")
  • 文書をグループ化することもできる groups = "見出し"

文書行列の閲覧 View()

★大文字 V

文書行列の概要を見る summary(文書行列)

単語頻度一覧表 topfeatures()

  • デフォルトは、上位10語
  • オプションで数字をつけるとそこまで
> topfeatures(nicestJ1.dfm)
 to you not  in  is  of  it the can and 
 60  53  33  30  27  27  25  25  23  22 
> topfeatures(nicestJ1.dfm, 20)
   to   you   not    in    is    of    it   the   can   and     i  this  they peopl  have  that    do think   use   are 
   60    53    33    30    27    27    25    25    23    22    21    19    18    17    17    16    16    15    14    13 


ワードクラウドの作成 textplot_wordcloud()

> textplot_wordcloud(nicestJ1.dfm)

いくつかの単語をまとめてグループにし、そのグループに該当するものの頻度を調べる。

  • 例えば、接続語句のリストのグループを複数作っておいて、それぞれのグループに属する接続語句の頻度をかぞえる。
  • グループを「辞書」と呼ぶ。コマンド名は dictionary()
connectives <- dictionary(list(additive = c("moreover", "furthermore", "and"),
                               adversative =  c("however","but","conversely"),
                               resultative = c("therefore", "thus", "so")))

connectives.dfm <- dfm(コーパスデータ, dictionary = connectives)

  • View(connectives.dfm) で見る。

Document-feature matrix of: 10 documents, 3 features (10.0% sparse).
10 x 3 sparse Matrix of class "dfm"
                 features
docs              additive adversative resultative
  JAN0001_P1B.txt        1           1           2
  JAN0001_P2B.txt        4           4           3
  JAN0001_P3B.txt        3           1           1
  JAN0001_P4B.txt        2           1           1
  JAN0001_P5B.txt        4           1           3
  JAN0001_P6B.txt        2           2           2
  JAN0001_P7B.txt        5           3           0
  JAN0001_P8B.txt        1           1           0
  JAN0002_P1A.txt        3           0           1
  JAN0002_P2A.txt        4           3           2



(★以下、データフレーム名が違うので、注意。そのままコピペでは動きません。)

文書行列を並べ替えて語彙頻度一覧作成 dfm_sort()

dfm_sort(文書行列)[何行目,頻度順に単語を何位まで]
> summary(nicestJPN.dfm)
Length  Class   Mode 
  4780    dfm     S4 

> dfm_sort(nicestJPN.dfm)
Document-feature matrix of: 10 documents, 478 features (79.4% sparse).

> dfm_sort(nicestJPN.dfm)[ , 1:20]

Document-feature matrix of: 10 documents, 20 features (15.5% sparse).

10 x 20 sparse Matrix of class "dfm"
                 features
docs              to you not in is of the people and i can this it have they do that young think are
  JAN0001_P1B.txt 11  11   6  2  5  3   2      1   1 2   3    1  4    2    0  4    3     0     1   2
  JAN0001_P2B.txt  4   2   7  2  3  5   1      3   4 5   8    4  3    6    8  2    4     8     2   4
  JAN0001_P3B.txt  8   0   4  3  4  3   4      3   3 1   1    2  3    1    1  5    0     3     1   0
  JAN0001_P4B.txt  3   3   0  2  3  0   0      1   2 1   1    2  2    1    4  0    0     0     1   0
  JAN0001_P5B.txt  7   0   5  9  3  7   8      7   4 5   2    5  0    3    2  1    6     2     4   3
  JAN0001_P6B.txt 10  14   5  5  5  5   5      1   2 2   2    2  3    1    0  0    1     0     2   3
  JAN0001_P7B.txt 13  16   6  6  4  2   4      0   5 3   5    3  5    1    3  2    1     0     1   0
  JAN0001_P8B.txt  4   7   0  1  0  2   1      1   1 2   1    0  2    1    0  1    1     0     1   1
  JAN0002_P1A.txt  7   1   3  6  4  3   2      7   3 5   0    3  0    1    0  1    2     0     2   0
  JAN0002_P2A.txt  1   0   3  3  4  3   5      6   4 1   1    1  1    5    4  5    1     5     2   3




コーパスの結合 + するだけ

nicest20.corpus <- nicestJAN.corpus + nicestNTV.corpus


文書に特徴的な単語を選び出す keyness


文書の類似度 textstat_simil(文書行列)

> textstat_simil(nicestJPN.dfm)
textstat_simil object; method = "correlation"
                JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt
JAN0001_P1B.txt           1.000           0.339           0.450           0.361           0.344           0.622
JAN0001_P2B.txt           0.339           1.000           0.500           0.324           0.427           0.302
JAN0001_P3B.txt           0.450           0.500           1.000           0.284           0.453           0.377
JAN0001_P4B.txt           0.361           0.324           0.284           1.000           0.267           0.358
JAN0001_P5B.txt           0.344           0.427           0.453           0.267           1.000           0.333
JAN0001_P6B.txt           0.622           0.302           0.377           0.358           0.333           1.000
JAN0001_P7B.txt           0.669           0.319           0.386           0.422           0.310           0.620
JAN0001_P8B.txt           0.521           0.272           0.273           0.343           0.223           0.573
JAN0002_P1A.txt           0.424           0.299           0.425           0.277           0.498           0.360
JAN0002_P2A.txt           0.215           0.429           0.351           0.176           0.350           0.170
                JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt
JAN0001_P1B.txt           0.669           0.521           0.424           0.215
JAN0001_P2B.txt           0.319           0.272           0.299           0.429
JAN0001_P3B.txt           0.386           0.273           0.425           0.351
JAN0001_P4B.txt           0.422           0.343           0.277           0.176
JAN0001_P5B.txt           0.310           0.223           0.498           0.350
JAN0001_P6B.txt           0.620           0.573           0.360           0.170
JAN0001_P7B.txt           1.000           0.554           0.363           0.187
JAN0001_P8B.txt           0.554           1.000           0.262           0.144
JAN0002_P1A.txt           0.363           0.262           1.000           0.310
JAN0002_P2A.txt           0.187           0.144           0.310           1.000
  • 類似度を測る方法には、オプションが何種類もある。

文書間の距離を測る textstat_dist(文書行列)

> textstat_dist(nicestJPN.dfm)
textstat_dist object; method = "euclidean"
                JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt
JAN0001_P1B.txt               0            28.2            22.0            22.5            29.4            21.4
JAN0001_P2B.txt            28.2               0            22.9            25.3            28.5            30.3
JAN0001_P3B.txt            22.0            22.9               0            18.2            25.4            25.2
JAN0001_P4B.txt            22.5            25.3            18.2               0            27.8            24.7
JAN0001_P5B.txt            29.4            28.5            25.4            27.8               0            30.9
JAN0001_P6B.txt            21.4            30.3            25.2            24.7            30.9               0
JAN0001_P7B.txt            22.6            32.6            28.6            28.1            34.0            24.7
JAN0001_P8B.txt            20.4            26.0            18.8            13.9            28.5            21.8
JAN0002_P1A.txt            23.0            27.1            19.5            19.4            24.8            26.0
JAN0002_P2A.txt            27.8            25.2            22.2            22.6            28.3            30.3
                JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt
JAN0001_P1B.txt            22.6            20.4            23.0            27.8
JAN0001_P2B.txt            32.6            26.0            27.1            25.2
JAN0001_P3B.txt            28.6            18.8            19.5            22.2
JAN0001_P4B.txt            28.1            13.9            19.4            22.6
JAN0001_P5B.txt            34.0            28.5            24.8            28.3
JAN0001_P6B.txt            24.7            21.8            26.0            30.3
JAN0001_P7B.txt               0            26.1            29.3            33.3
JAN0001_P8B.txt            26.1               0            19.9            23.3
JAN0002_P1A.txt            29.3            19.9               0            23.5
JAN0002_P2A.txt            33.3            23.3            23.5               0

語彙的分散を観察する textplot_xray()

  • kwic() との組み合わせで使う
textplot_xray(
  kwic(nicestJAN.corpus, pattern = "and"),
  kwic(nicestJAN.corpus, pattern = "but"),
  kwic(nicestJAN.corpus, pattern = "so")
)

  • 分布の図を絶対値にするオプション scale = "absolute"
textplot_xray(
  kwic(nicestJAN.corpus, pattern = "and"),
  kwic(nicestJAN.corpus, pattern = "but"),
  kwic(nicestJAN.corpus, pattern = "so"),
  scale = "absolute"
)

ランダムサンプリング corpus_sample(コーパスデータ, size=数, replace = FALSE)







  • Reference

https://quanteda.io/articles/pkgdown/quickstart_ja.html
http://i.amcat.nl/lda/1_textanalysis.html
https://quanteda.io/articles/pkgdown/examples/plotting.html