R !!!quanteda を使った例 {{outline}} ---- !サンプル: <<学習者コーパスNICEST>>1.0の学習者データ8つ *フォルダー内に、テキストデータのみのテキストファイルが複数入っている *作業ディレクトリーをそのフォルダーに設定してある状態で {{pre > getwd() [1] "C:/Users/ /Documents/NICEST-samples" > list.files() [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" "JAN0001_P6B.txt" [7] "JAN0001_P7B.txt" "JAN0001_P8B.txt" }} !readtextパッケージの <> を使ってテキストを読み込む {{pre nicest.tmp <- readtext("*.txt") > nicest.tmp readtext object consisting of 8 documents and 0 docvars. # Description: df[,2] [8 x 2] doc_id text 1 JAN0001_P1B.txt "\"Some peopl\"..." 2 JAN0001_P2B.txt "\"You may th\"..." 3 JAN0001_P3B.txt "\"Compared w\"..." 4 JAN0001_P4B.txt "\"You may ha\"..." 5 JAN0001_P5B.txt "\"Elderly pe\"..." 6 JAN0001_P6B.txt "\"Group tour\"..." # ... with 2 more rows }} !コーパスデータ化する <> {{pre nicestJ1 <- corpus(nicest.tmp) nicestJ1 Corpus consisting of 8 documents and 0 docvars. }} !概要を見る <> > summary(nicestJ1) Corpus consisting of 8 documents: Text Types Tokens Sentences JAN0001_P1B.txt 116 214 12 JAN0001_P2B.txt 138 268 17 JAN0001_P3B.txt 97 169 11 JAN0001_P4B.txt 68 99 8 JAN0001_P5B.txt 120 262 16 JAN0001_P6B.txt 114 224 13 JAN0001_P7B.txt 121 268 18 JAN0001_P8B.txt 71 108 8 Source: C:/Users/ /Documents/NICEST-samples/* on x86-64 by sugiura Created: Thu Nov 14 15:26:21 2019 Notes: !中身を見る <> *このままだと、すべて出力される *特定のデータを見る場合は、何番目のデータを見るかを [番号] で指定する。 {{pre > texts(nicestJ1)[3] JAN0001_P3B.txt "Compared with past, young people nowadays do not give enough time to (以下略) }} !KWIC検索する < kwic(nicestJ1, pattern="however") [JAN0001_P1B.txt, 13] not important for human, | however | , who make todays life [JAN0001_P2B.txt, 31] compared with young people, | however | , I recognize that is [JAN0001_P7B.txt, 71] misunderstanding about that area. | However | , if you only know }} !トークン化する <> *句読点を除くオプション <> *数字を除くオプション <> {{pre > tokens(nicestJ1, remove_numbers=T, remove_punct=T) tokens from 8 documents. JAN0001_P1B.txt : [1] "Some" "people" "say" "that" "specialized" "knowledge" "is" "not" "important" "for" [11] "human" "however" "who" "make" "todays" "life" "such" "a" "convenience" "are" [21] "always" "a" "few" "number" "of" "genius" "with" "very" "specific" "knowledges" [31] "To" "consider" "this" "it" "can" "be" "said" "that" "to" "specialized" [41] "in" "one" "specific" "subject" "is" "better" "than" "to" "get" "bload" [51] "knowledge" "of" "many" "academic" "subjects" "There" "is" "some" "more" "reasons" (以下略) }} !文書行列の作成 < *オプション **ステム化 <> **句読点の削除 <> {{pre > dfm(nicestJ1) > nicestJ1.dfm <- dfm(nicestJ1, stem=T, remove_punct=T) > nicestJ1.dfm Document-feature matrix of: 8 documents, 342 features (72.8% sparse). }} !単語頻度一覧表 <> *デフォルトは、上位10語 *オプションで数字をつけるとそこまで {{pre > topfeatures(nicestJ1.dfm) to you not in is of it the can and 60 53 33 30 27 27 25 25 23 22 > topfeatures(nicestJ1.dfm, 20) to you not in is of it the can and i this they peopl have that do think use are 60 53 33 30 27 27 25 25 23 22 21 19 18 17 17 16 16 15 14 13 }} !ワードクラウドの作成 <> {{pre > textplot_wordcloud(nicestJ1.dfm) }} {{ref_image nicestJ1.png}}