R !!!quanteda を使った例 install.packages("quanteda", dependencies = T) library(quanteda) {{outline}} ---- !サンプル: [学習者コーパスNICEST 1.0|http://sgr.gsid.nagoya-u.ac.jp/wordpress/?page_id=1159]の学習者データ78ファイル *フォルダー内に、テキストデータのみのテキストファイルが複数入っている *作業ディレクトリーをそのフォルダーに設定してある状態で {{pre > getwd() [1] "C:/Users/ /Documents/NICEST-samples//NICEST_JP10" > list.files() [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" "JAN0001_P6B.txt" "JAN0001_P7B.txt" [8] "JAN0001_P8B.txt" "JAN0002_P1A.txt" "JAN0002_P2A.txt" "JAN0002_P3A.txt" "JAN0002_P5A.txt" "JAN0002_P6A.txt" "JAN0002_P8A.txt" [15] "JAN0003_P1B.txt" "JAN0003_P2B.txt" "JAN0003_P3B.txt" "JAN0003_P4B.txt" "JAN0003_P5B.txt" "JAN0003_P6B.txt" "JAN0003_P7B.txt" [22] "JAN0003_P8B.txt" "JAN0004_P1B.txt" "JAN0004_P2B.txt" "JAN0004_P3B.txt" "JAN0004_P4B.txt" "JAN0004_P5B.txt" "JAN0004_P6B.txt" [29] "JAN0004_P7B.txt" "JAN0004_P8B.txt" "JAN0005_P1B.txt" "JAN0005_P2B.txt" "JAN0005_P3B.txt" "JAN0005_P4B.txt" "JAN0005_P5B.txt" [36] "JAN0005_P6B.txt" "JAN0005_P7B.txt" "JAN0005_P8B.txt" "JAN0006_P1A.txt" "JAN0006_P2A.txt" "JAN0006_P3A.txt" "JAN0006_P4A.txt" [43] "JAN0006_P5A.txt" "JAN0006_P6A.txt" "JAN0006_P7A.txt" "JAN0006_P8A.txt" "JAN0007_P1B.txt" "JAN0007_P2B.txt" "JAN0007_P3B.txt" [50] "JAN0007_P4B.txt" "JAN0007_P5B.txt" "JAN0007_P6B.txt" "JAN0007_P7B.txt" "JAN0007_P8B.txt" "JAN0008_P1B.txt" "JAN0008_P2B.txt" [57] "JAN0008_P3B.txt" "JAN0008_P4B.txt" "JAN0008_P5B.txt" "JAN0008_P6B.txt" "JAN0008_P7B.txt" "JAN0008_P8B.txt" "JAN0009_P1B.txt" [64] "JAN0009_P2B.txt" "JAN0009_P3B.txt" "JAN0009_P4B.txt" "JAN0009_P5B.txt" "JAN0009_P6B.txt" "JAN0009_P7B.txt" "JAN0009_P8B.txt" [71] "JAN0010_P1B.txt" "JAN0010_P2B.txt" "JAN0010_P3B.txt" "JAN0010_P4B.txt" "JAN0010_P5B.txt" "JAN0010_P6B.txt" "JAN0010_P7B.txt" [78] "JAN0010_P8B.txt" }} !readtextパッケージの <> を使ってテキストを読み込む install.packages("readtext", dependencies = T) library(readtext) {{pre nicest.tmp <- readtext("*.txt") > nicest.tmp readtext object consisting of 78 documents and 0 docvars. # Description: df[,2] [78 x 2] doc_id text 1 JAN0001_P1B.txt "\"Some peopl\"..." 2 JAN0001_P2B.txt "\"You may th\"..." 3 JAN0001_P3B.txt "\"Compared w\"..." 4 JAN0001_P4B.txt "\"You may ha\"..." 5 JAN0001_P5B.txt "\"Elderly pe\"..." 6 JAN0001_P6B.txt "\"Group tour\"..." # ... with 72 more rows }} !コーパスデータ化する <> {{pre nicestJ1 <- corpus(nicest.tmp) nicestJ1 Corpus consisting of 78 documents and 0 docvars. }} *文書変数の追加修正 docvars() *文書属性の追加修正 metadoc() こちらは分析対象とならない備考欄的情報 *reference https://github.com/koheiw/workshop-IJTA/blob/master/documents/corpus.md https://quanteda.io/articles/pkgdown/examples/quickstart_ja.html !コーパスデータから条件で絞り込んで、サブ・セットを作成 <> *絞り込みの例 ** 見出し > 数字 ** 見出し == "文字列" !コーパスデータの概要を見る <> > summary(nicestJ1) Corpus consisting of 78 documents: Text Types Tokens Sentences JAN0001_P1B.txt 116 214 12 JAN0001_P2B.txt 138 268 17 JAN0001_P3B.txt 97 169 11 JAN0001_P4B.txt 68 99 8 JAN0001_P5B.txt 120 262 16 JAN0001_P6B.txt 114 224 13 JAN0001_P7B.txt 121 268 18 JAN0001_P8B.txt 71 108 8 JAN0002_P1A.txt 98 170 15 !文書変数(属性情報)を追加する <> {{pre > docvars(nicestJ1, "lang") <- "jpn" > summary(nicestJ1) Corpus consisting of 78 documents: Text Types Tokens Sentences lang JAN0001_P1B.txt 116 214 12 jpn JAN0001_P2B.txt 138 268 17 jpn JAN0001_P3B.txt 97 169 11 jpn JAN0001_P4B.txt 68 99 8 jpn JAN0001_P5B.txt 120 262 16 jpn JAN0001_P6B.txt 114 224 13 jpn JAN0001_P7B.txt 121 268 18 jpn JAN0001_P8B.txt 71 108 8 jpn JAN0002_P1A.txt 98 170 15 jpn JAN0002_P2A.txt 117 216 19 jpn JAN0002_P3A.txt 111 179 17 jpn JAN0002_P5A.txt 112 203 18 jpn JAN0002_P6A.txt 126 222 17 jpn JAN0002_P8A.txt 117 203 15 jpn }} !中身を見る <> *このままだと、すべて出力される *特定のデータを見る場合は、何番目のデータを見るかを [番号] で指定する。 {{pre > texts(nicestJ1)[3] JAN0001_P3B.txt "Compared with past, young people nowadays do not give enough time to (以下略) }} !KWIC検索する <> {{pre > kwic(nicestJ1, pattern="however") [JAN0001_P1B.txt, 13] not important for human, | however | , who make todays life [JAN0001_P2B.txt, 31] compared with young people, | however | , I recognize that is [JAN0001_P7B.txt, 71] misunderstanding about that area. | However | , if you only know }} !トークン化する <> (実際は、これをしなくても文書行列を直接作成すればよい) *結果は、各文書を要素とするリスト形式で保存される *句読点を除くオプション <> *数字を除くオプション <> *stopwordsを除くオプション <> {{pre > tokens(nicestJ1, remove_numbers=T, remove_punct=T) tokens from 8 documents. JAN0001_P1B.txt : [1] "Some" "people" "say" "that" "specialized" "knowledge" "is" "not" "important" "for" [11] "human" "however" "who" "make" "todays" "life" "such" "a" "convenience" "are" [21] "always" "a" "few" "number" "of" "genius" "with" "very" "specific" "knowledges" [31] "To" "consider" "this" "it" "can" "be" "said" "that" "to" "specialized" [41] "in" "one" "specific" "subject" "is" "better" "than" "to" "get" "bload" [51] "knowledge" "of" "many" "academic" "subjects" "There" "is" "some" "more" "reasons" (以下略) }} *token化したものをすべて小文字にする tokens_tolower() *token化したものをstem化する tokens_wordstem() !文書行列の作成 document-feature matrix (DFM) <> *tmより新しいので、こっちのほうが良い(Welbers et al 2017) *オプション **ステム化 <> **句読点の削除 <> *対象は、トークン化したオブジェクトだが、 **トークン化してなくても自動でトークン化してくれる {{pre > dfm(nicestJ1) > nicestJ1.dfm <- dfm(nicestJ1, stem=T, remove_punct=T) > nicestJ1.dfm Document-feature matrix of: 8 documents, 342 features (72.8% sparse). }} *機能語などの頻出単語(stopwords)を除くオプション remove = stopwords("english") *文書をグループ化することもできる groups = "見出し" !文書行列の閲覧 <> ★大文字 V !文書行列の概要を見る <> !単語頻度一覧表 <> *デフォルトは、上位10語 *オプションで数字をつけるとそこまで {{pre > topfeatures(nicestJ1.dfm) to you not in is of it the can and 60 53 33 30 27 27 25 25 23 22 > topfeatures(nicestJ1.dfm, 20) to you not in is of it the can and i this they peopl have that do think use are 60 53 33 30 27 27 25 25 23 22 21 19 18 17 17 16 16 15 14 13 }} !ワードクラウドの作成 <> {{pre > textplot_wordcloud(nicestJ1.dfm) }} {{ref_image nicestJ1.png}} !いくつかの単語をまとめてグループにし、そのグループに該当するものの頻度を調べる。 *例えば、接続語句のリストのグループを複数作っておいて、それぞれのグループに属する接続語句の頻度をかぞえる。 *グループを「辞書」と呼ぶ。コマンド名は dictionary() {{pre connectives <- dictionary(list(additive = c("moreover", "furthermore", "and"), adversative = c("however","but","conversely"), resultative = c("therefore", "thus", "so"))) connectives.dfm <- dfm(コーパスデータ, dictionary = connectives) }} *View(connectives.dfm) で見る。 {{pre Document-feature matrix of: 10 documents, 3 features (10.0% sparse). 10 x 3 sparse Matrix of class "dfm" features docs additive adversative resultative JAN0001_P1B.txt 1 1 2 JAN0001_P2B.txt 4 4 3 JAN0001_P3B.txt 3 1 1 JAN0001_P4B.txt 2 1 1 JAN0001_P5B.txt 4 1 3 JAN0001_P6B.txt 2 2 2 JAN0001_P7B.txt 5 3 0 JAN0001_P8B.txt 1 1 0 JAN0002_P1A.txt 3 0 1 JAN0002_P2A.txt 4 3 2 }} ---- (★以下、データフレーム名が違うので、注意。そのままコピペでは動きません。) !文書行列を並べ替えて語彙頻度一覧作成 <> dfm_sort(文書行列)[何行目,頻度順に単語を何位まで] {{pre > summary(nicestJPN.dfm) Length Class Mode 4780 dfm S4 > dfm_sort(nicestJPN.dfm) Document-feature matrix of: 10 documents, 478 features (79.4% sparse). > dfm_sort(nicestJPN.dfm)[ , 1:20] Document-feature matrix of: 10 documents, 20 features (15.5% sparse). 10 x 20 sparse Matrix of class "dfm" features docs to you not in is of the people and i can this it have they do that young think are JAN0001_P1B.txt 11 11 6 2 5 3 2 1 1 2 3 1 4 2 0 4 3 0 1 2 JAN0001_P2B.txt 4 2 7 2 3 5 1 3 4 5 8 4 3 6 8 2 4 8 2 4 JAN0001_P3B.txt 8 0 4 3 4 3 4 3 3 1 1 2 3 1 1 5 0 3 1 0 JAN0001_P4B.txt 3 3 0 2 3 0 0 1 2 1 1 2 2 1 4 0 0 0 1 0 JAN0001_P5B.txt 7 0 5 9 3 7 8 7 4 5 2 5 0 3 2 1 6 2 4 3 JAN0001_P6B.txt 10 14 5 5 5 5 5 1 2 2 2 2 3 1 0 0 1 0 2 3 JAN0001_P7B.txt 13 16 6 6 4 2 4 0 5 3 5 3 5 1 3 2 1 0 1 0 JAN0001_P8B.txt 4 7 0 1 0 2 1 1 1 2 1 0 2 1 0 1 1 0 1 1 JAN0002_P1A.txt 7 1 3 6 4 3 2 7 3 5 0 3 0 1 0 1 2 0 2 0 JAN0002_P2A.txt 1 0 3 3 4 3 5 6 4 1 1 1 1 5 4 5 1 5 2 3 }} !コーパスの結合 + するだけ nicest20.corpus <- nicestJAN.corpus + nicestNTV.corpus !文書に特徴的な単語を選び出す keyness !文書の類似度 textstat_simil(文書行列) {{pre > textstat_simil(nicestJPN.dfm) textstat_simil object; method = "correlation" JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt JAN0001_P1B.txt 1.000 0.339 0.450 0.361 0.344 0.622 JAN0001_P2B.txt 0.339 1.000 0.500 0.324 0.427 0.302 JAN0001_P3B.txt 0.450 0.500 1.000 0.284 0.453 0.377 JAN0001_P4B.txt 0.361 0.324 0.284 1.000 0.267 0.358 JAN0001_P5B.txt 0.344 0.427 0.453 0.267 1.000 0.333 JAN0001_P6B.txt 0.622 0.302 0.377 0.358 0.333 1.000 JAN0001_P7B.txt 0.669 0.319 0.386 0.422 0.310 0.620 JAN0001_P8B.txt 0.521 0.272 0.273 0.343 0.223 0.573 JAN0002_P1A.txt 0.424 0.299 0.425 0.277 0.498 0.360 JAN0002_P2A.txt 0.215 0.429 0.351 0.176 0.350 0.170 JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt JAN0001_P1B.txt 0.669 0.521 0.424 0.215 JAN0001_P2B.txt 0.319 0.272 0.299 0.429 JAN0001_P3B.txt 0.386 0.273 0.425 0.351 JAN0001_P4B.txt 0.422 0.343 0.277 0.176 JAN0001_P5B.txt 0.310 0.223 0.498 0.350 JAN0001_P6B.txt 0.620 0.573 0.360 0.170 JAN0001_P7B.txt 1.000 0.554 0.363 0.187 JAN0001_P8B.txt 0.554 1.000 0.262 0.144 JAN0002_P1A.txt 0.363 0.262 1.000 0.310 JAN0002_P2A.txt 0.187 0.144 0.310 1.000 }} *類似度を測る方法には、オプションが何種類もある。 !文書間の距離を測る <> {{pre > textstat_dist(nicestJPN.dfm) textstat_dist object; method = "euclidean" JAN0001_P1B.txt JAN0001_P2B.txt JAN0001_P3B.txt JAN0001_P4B.txt JAN0001_P5B.txt JAN0001_P6B.txt JAN0001_P1B.txt 0 28.2 22.0 22.5 29.4 21.4 JAN0001_P2B.txt 28.2 0 22.9 25.3 28.5 30.3 JAN0001_P3B.txt 22.0 22.9 0 18.2 25.4 25.2 JAN0001_P4B.txt 22.5 25.3 18.2 0 27.8 24.7 JAN0001_P5B.txt 29.4 28.5 25.4 27.8 0 30.9 JAN0001_P6B.txt 21.4 30.3 25.2 24.7 30.9 0 JAN0001_P7B.txt 22.6 32.6 28.6 28.1 34.0 24.7 JAN0001_P8B.txt 20.4 26.0 18.8 13.9 28.5 21.8 JAN0002_P1A.txt 23.0 27.1 19.5 19.4 24.8 26.0 JAN0002_P2A.txt 27.8 25.2 22.2 22.6 28.3 30.3 JAN0001_P7B.txt JAN0001_P8B.txt JAN0002_P1A.txt JAN0002_P2A.txt JAN0001_P1B.txt 22.6 20.4 23.0 27.8 JAN0001_P2B.txt 32.6 26.0 27.1 25.2 JAN0001_P3B.txt 28.6 18.8 19.5 22.2 JAN0001_P4B.txt 28.1 13.9 19.4 22.6 JAN0001_P5B.txt 34.0 28.5 24.8 28.3 JAN0001_P6B.txt 24.7 21.8 26.0 30.3 JAN0001_P7B.txt 0 26.1 29.3 33.3 JAN0001_P8B.txt 26.1 0 19.9 23.3 JAN0002_P1A.txt 29.3 19.9 0 23.5 JAN0002_P2A.txt 33.3 23.3 23.5 0 }} !語彙的分散を観察する <> *kwic() との組み合わせで使う {{pre textplot_xray( kwic(nicestJAN.corpus, pattern = "and"), kwic(nicestJAN.corpus, pattern = "but"), kwic(nicestJAN.corpus, pattern = "so") ) }} {{ref_image dispersion.png}} *分布の図を絶対値にするオプション scale = "absolute" {{pre textplot_xray( kwic(nicestJAN.corpus, pattern = "and"), kwic(nicestJAN.corpus, pattern = "but"), kwic(nicestJAN.corpus, pattern = "so"), scale = "absolute" ) }} {{ref_image dispersion-absolute.png}} !ランダムサンプリング <> ---- *Reference https://quanteda.io/articles/pkgdown/quickstart_ja.html http://i.amcat.nl/lda/1_textanalysis.html https://quanteda.io/articles/pkgdown/examples/plotting.html