R
R.package
quanteda
!!!quantedaを使った連語(multi-word expressions)処理
{{outline}}

----
!連語の検索
*具体的な連語をベクトルにまとめておく
 multiword <- c("in addition", "on the other hand", "as a result")

*「フレーズ」という単位で扱われるように指定する
 phrase(multiword)

*処理例：kwic検索
{{pre

> multiword <- c("in addition", "on the other hand", "as a result")

> kwic(nicestJPN.corpus, pattern = phrase(multiword))
                                                                                                        
   [JAN0001_P2B.txt, 12:15]      are active and free, | on the other hand | olders are less active and  
 [JAN0001_P5B.txt, 117:118]    is the biggest reason. |    In addition    | above reason, people will   
 [JAN0001_P7B.txt, 196:199] answer ten questions. But | on the other hand | , if you understanding ideas
}}

!連語の頻度と強さの一覧 <<textstat_collocations(コーパスデータ)>>
{{pre
> textstat_collocations(nicestJPN.corpus)
            collocation count count_nested length    lambda        z
1                do not    10            0      2  4.151357 8.325134
2          young people     8            0      2  4.283437 8.180578
3                if you    10            0      2  5.196013 7.994369
4          young person     6            0      2  4.946850 7.666668
5               can not     9            0      2  3.478602 7.612492
6               i think     7            0      2  3.939177 7.470671
7           enough time     5            0      2  6.849914 7.351494
8               want to     9            0      2  4.083226 6.925760
9            person can     5            0      2  4.331071 6.718328
10           enjoy life     3            0      2  5.297317 6.307093
11          have enough     4            0      2  4.753478 6.300479
12               of all     5            0      2  4.435229 6.243870
}}
*連語の長さ（グラム数）の指定オプション： size = 数字
*頻度の最低回数の指定オプション： min_count = 回数
{{pre
> textstat_collocations(nicestJPN.corpus, size = 3, min_count = 3)
          collocation count count_nested length     lambda          z
1         need not to     3            0      3  3.5022761  1.5067168
2        first of all     5            0      3  2.1345325  0.8123263
3     not have enough     4            0      3  1.4805931  0.6381457
4        can not have     3            0      3  0.9067169  0.5200749
5       to earn money     3            0      3  0.8133261  0.2775042
6   young people have     3            0      3  0.4902593  0.2650525
7          if you are     3            0      3  0.3020849  0.1372883
8       in the social     3            0      3 -0.4990624 -0.1942642
9           to sum up     5            0      3 -0.7190799 -0.2251418
10 ideas and concepts     4            0      3 -0.7777511 -0.2601743
11   young person can     3            0      3 -1.5590397 -0.8279029
12   have enough time     3            0      3 -2.0558982 -0.9839019
13        do not know     3            0      3 -2.1209809 -1.5745493

> textstat_collocations(nicestJPN.corpus, size = 4, min_count = 3)
           collocation count count_nested length    lambda          z
1  can not have enough     3            0      4  1.922831  0.4567948
2 not have enough time     3            0      4 -3.640576 -0.8486199

> textstat_collocations(nicestJPN.corpus, size = 5, min_count = 2)
                         collocation count count_nested length     lambda          z
1          young person can not have     2            0      5  4.5336736  0.7267201
2           can not have enough time     2            0      5  3.3482331  0.4893972
3          the best way of traveling     2            0      5  2.4255540  0.3364245
4         this is the biggest reason     2            0      5  1.5023694  0.2082819
5            not give enough time to     2            0      5  1.2160942  0.1723139
6  time to helping their communities     2            0      5  0.8492974  0.1155970
7            do not give enough time     2            0      5 -1.4512792 -0.2075141
8         person can not have enough     2            0      5 -1.8019596 -0.2712349
9        give enough time to helping     2            0      5 -2.3165466 -0.3140204
10      enough time to helping their     2            0      5 -2.4219786 -0.3338030
}}

!連語をtoken単位にしたリストの作成
*単語のtoken化と同じプログラム(<<tokens()>>)のオプションの指定でできる。
 tokens(コーパスデータ)
 tokens(コーパスデータ, ngrams = グラム数の指定)
*実際には、数字や句読点を削除するオプションを使う。
 remove_numbers=T, remove_punct=T
{{pre
nicest.tokens <- tokens(nicest.corpus, remove_numbers=T, remove_punct=T)

nicest.2gram <- tokens(nicest.corpus, ngrams = 2, remove_numbers=T, remove_punct=T)

nicest.3gram <- tokens(nicest.corpus, ngrams = 3, remove_numbers=T, remove_punct=T)

nicest.4gram <- tokens(nicest.corpus, ngrams = 4, remove_numbers=T, remove_punct=T)

> head(nicest.4gram, 1)
tokens from 1 document.
JAN0001_P1B.txt :
  [1] "Some_people_say_that"              "people_say_that_specialized"      
  [3] "say_that_specialized_knowledge"    "that_specialized_knowledge_is"    
  [5] "specialized_knowledge_is_not"      "knowledge_is_not_important"       
  [7] "is_not_important_for"              "not_important_for_human" 
}}

!途中の単語を飛ばした連語リスト(skip-gram)の作成
*オプションで　skip = 数字 で飛ばす単語数を指定する。
**複数の範囲の指定可
*** skip = 0:1　（スキップなし、つまり、普通のngramおよび、一つ飛ばしのskip-gram）
*** skip = 1:2　（単語一つ、および二つ飛ばしたskip-gram）

{{pre
nicest.4skip <- tokens(nicest.corpus, ngrams = 4, skip = 1, remove_numbers=T, remove_punct=T)
> head(nicest.4skip, 1)
tokens from 1 document.
JAN0001_P1B.txt :
  [1] "Some_say_specialized_is"           "people_that_knowledge_not"        
  [3] "say_specialized_is_important"      "that_knowledge_not_for"           
  [5] "specialized_is_important_human"    "knowledge_not_for_however"        
  [7] "is_important_human_who"            "not_for_however_make"   

nicest.4skip2 <- tokens(nicest.corpus, ngrams = 4, skip = 0:1, remove_numbers=T, remove_punct=T)
> head(nicest.4skip2, 1)
tokens from 1 document.
JAN0001_P1B.txt :
   [1] "Some_people_say_that"                "Some_people_say_specialized"        
   [3] "Some_people_that_specialized"        "Some_people_that_knowledge"         
   [5] "Some_say_that_specialized"           "Some_say_that_knowledge"            
   [7] "Some_say_specialized_knowledge"      "Some_say_specialized_is"  

nicest.4skip3 <- tokens(nicest.corpus, ngrams = 4, skip = 1:2, remove_numbers=T, remove_punct=T)
> head(nicest.4skip3, 1)
tokens from 1 document.
JAN0001_P1B.txt :
   [1] "Some_say_specialized_is"            "Some_say_specialized_not"          
   [3] "Some_say_knowledge_not"             "Some_say_knowledge_important"      
   [5] "Some_that_knowledge_not"            "Some_that_knowledge_important"     
   [7] "Some_that_is_important"             "Some_that_is_for"  
}}
*間に「空欄」のあるn-gram表現のリストの作成が可能
*ただし、どこに「空欄」があるかという情報が結果に残されない点が欠点。


----
* reference https://quanteda.io/articles/pkgdown/examples/phrase.html