R
!!!quanteda を使った例
{{outline}}
----
!サンプル： <<学習者コーパスNICEST>>1.0の学習者データ8つ
*フォルダー内に、テキストデータのみのテキストファイルが複数入っている
*作業ディレクトリーをそのフォルダーに設定してある状態で
{{pre
> getwd()
[1] "C:/Users/ /Documents/NICEST-samples"
> list.files()
[1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" "JAN0001_P6B.txt"
[7] "JAN0001_P7B.txt" "JAN0001_P8B.txt"
}}

!readtextパッケージの <<readtext()>> を使ってテキストを読み込む
{{pre
nicest.tmp <- readtext("*.txt")
> nicest.tmp
readtext object consisting of 8 documents and 0 docvars.
# Description: df[,2] [8 x 2]
  doc_id          text               
  <chr>           <chr>              
1 JAN0001_P1B.txt "\"Some peopl\"..."
2 JAN0001_P2B.txt "\"You may th\"..."
3 JAN0001_P3B.txt "\"Compared w\"..."
4 JAN0001_P4B.txt "\"You may ha\"..."
5 JAN0001_P5B.txt "\"Elderly pe\"..."
6 JAN0001_P6B.txt "\"Group tour\"..."
# ... with 2 more rows
}}

!コーパスデータ化する <<corpus()>>
{{pre
nicestJ1 <- corpus(nicest.tmp)
nicestJ1
Corpus consisting of 8 documents and 0 docvars.
}}

!概要を見る <<summary()>>
> summary(nicestJ1)
Corpus consisting of 8 documents:

            Text Types Tokens Sentences
 JAN0001_P1B.txt   116    214        12
 JAN0001_P2B.txt   138    268        17
 JAN0001_P3B.txt    97    169        11
 JAN0001_P4B.txt    68     99         8
 JAN0001_P5B.txt   120    262        16
 JAN0001_P6B.txt   114    224        13
 JAN0001_P7B.txt   121    268        18
 JAN0001_P8B.txt    71    108         8

Source: C:/Users/ /Documents/NICEST-samples/* on x86-64 by sugiura
Created: Thu Nov 14 15:26:21 2019
Notes: 

!中身を見る <<texts()>>
*このままだと、すべて出力される
*特定のデータを見る場合は、何番目のデータを見るかを [番号] で指定する。
{{pre
> texts(nicestJ1)[3]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         JAN0001_P3B.txt 
"Compared with past, young people nowadays do not give enough time to （以下略）
}}

!KWIC検索する <<kwic(コーパス, pattern="文字列")
{{pre
> kwic(nicestJ1, pattern="however")
                                                                                           
 [JAN0001_P1B.txt, 13]          not important for human, | however | , who make todays life
 [JAN0001_P2B.txt, 31]       compared with young people, | however | , I recognize that is 
 [JAN0001_P7B.txt, 71] misunderstanding about that area. | However | , if you only know    

}}

!トークン化する <<tokens()>>
*句読点を除くオプション <<remove_punct=T>>
*数字を除くオプション <<remove_numbers=T>>

{{pre
> tokens(nicestJ1, remove_numbers=T, remove_punct=T)
tokens from 8 documents.
JAN0001_P1B.txt :
  [1] "Some"         "people"       "say"          "that"         "specialized"  "knowledge"    "is"           "not"          "important"    "for"         
 [11] "human"        "however"      "who"          "make"         "todays"       "life"         "such"         "a"            "convenience"  "are"         
 [21] "always"       "a"            "few"          "number"       "of"           "genius"       "with"         "very"         "specific"     "knowledges"  
 [31] "To"           "consider"     "this"         "it"           "can"          "be"           "said"         "that"         "to"           "specialized" 
 [41] "in"           "one"          "specific"     "subject"      "is"           "better"       "than"         "to"           "get"          "bload"       
 [51] "knowledge"    "of"           "many"         "academic"     "subjects"     "There"        "is"           "some"         "more"         "reasons"     
（以下略）
}}

!文書行列の作成 <<dfm()>
*オプション
**ステム化 <<stem=T>>
**句読点の削除 <<remove_punct=T>>
{{pre
> dfm(nicestJ1)
> nicestJ1.dfm <- dfm(nicestJ1, stem=T, remove_punct=T)
> nicestJ1.dfm
Document-feature matrix of: 8 documents, 342 features (72.8% sparse).

}}

!単語頻度一覧表 <<topfeatures()>>
*デフォルトは、上位10語
*オプションで数字をつけるとそこまで
{{pre
> topfeatures(nicestJ1.dfm)
 to you not  in  is  of  it the can and 
 60  53  33  30  27  27  25  25  23  22 
> topfeatures(nicestJ1.dfm, 20)
   to   you   not    in    is    of    it   the   can   and     i  this  they peopl  have  that    do think   use   are 
   60    53    33    30    27    27    25    25    23    22    21    19    18    17    17    16    16    15    14    13 
}}

!ワードクラウドの作成 <<textplot_wordcloud()>>
{{pre
> textplot_wordcloud(nicestJ1.dfm)
}}
{{ref_image nicestJ1.png}}