トップ 差分 一覧 Farm ソース 検索 ヘルプ PDF RSS ログイン

quanteda3

*disclaimer
93764

R
R.package
quanteda

quantedaを使った連語(multi-word expressions)処理




 連語(n-gram表現)の一覧表

連語の頻度と強さの一覧 textstat_collocations(コーパスデータ)

> textstat_collocations(nicestJPN.corpus)
            collocation count count_nested length    lambda        z
1                do not    10            0      2  4.151357 8.325134
2          young people     8            0      2  4.283437 8.180578
3                if you    10            0      2  5.196013 7.994369
4          young person     6            0      2  4.946850 7.666668
5               can not     9            0      2  3.478602 7.612492
6               i think     7            0      2  3.939177 7.470671
7           enough time     5            0      2  6.849914 7.351494
8               want to     9            0      2  4.083226 6.925760
9            person can     5            0      2  4.331071 6.718328
10           enjoy life     3            0      2  5.297317 6.307093
11          have enough     4            0      2  4.753478 6.300479
12               of all     5            0      2  4.435229 6.243870
  • 連語の長さ(グラム数)の指定オプション: size = 数字
  • 頻度の最低回数の指定オプション: min_count = 回数
> textstat_collocations(nicestJPN.corpus, size = 3, min_count = 3)
          collocation count count_nested length     lambda          z
1         need not to     3            0      3  3.5022761  1.5067168
2        first of all     5            0      3  2.1345325  0.8123263
3     not have enough     4            0      3  1.4805931  0.6381457
4        can not have     3            0      3  0.9067169  0.5200749
5       to earn money     3            0      3  0.8133261  0.2775042
6   young people have     3            0      3  0.4902593  0.2650525
7          if you are     3            0      3  0.3020849  0.1372883
8       in the social     3            0      3 -0.4990624 -0.1942642
9           to sum up     5            0      3 -0.7190799 -0.2251418
10 ideas and concepts     4            0      3 -0.7777511 -0.2601743
11   young person can     3            0      3 -1.5590397 -0.8279029
12   have enough time     3            0      3 -2.0558982 -0.9839019
13        do not know     3            0      3 -2.1209809 -1.5745493

> textstat_collocations(nicestJPN.corpus, size = 4, min_count = 3)
           collocation count count_nested length    lambda          z
1  can not have enough     3            0      4  1.922831  0.4567948
2 not have enough time     3            0      4 -3.640576 -0.8486199

> textstat_collocations(nicestJPN.corpus, size = 5, min_count = 2)
                         collocation count count_nested length     lambda          z
1          young person can not have     2            0      5  4.5336736  0.7267201
2           can not have enough time     2            0      5  3.3482331  0.4893972
3          the best way of traveling     2            0      5  2.4255540  0.3364245
4         this is the biggest reason     2            0      5  1.5023694  0.2082819
5            not give enough time to     2            0      5  1.2160942  0.1723139
6  time to helping their communities     2            0      5  0.8492974  0.1155970
7            do not give enough time     2            0      5 -1.4512792 -0.2075141
8         person can not have enough     2            0      5 -1.8019596 -0.2712349
9        give enough time to helping     2            0      5 -2.3165466 -0.3140204
10      enough time to helping their     2            0      5 -2.4219786 -0.3338030

連語をtoken単位にしたリストの作成

  • 単語のtoken化と同じプログラム(tokens())のオプションの指定でできる。
tokens(コーパスデータ)
tokens(コーパスデータ, ngrams = グラム数の指定)
  • 実際には、数字や句読点を削除するオプションを使う。
remove_numbers=T, remove_punct=T
nicest.tokens <- tokens(nicest.corpus, remove_numbers=T, remove_punct=T)

nicest.2gram <- tokens(nicest.corpus, ngrams = 2, remove_numbers=T, remove_punct=T)

nicest.3gram <- tokens(nicest.corpus, ngrams = 3, remove_numbers=T, remove_punct=T)

nicest.4gram <- tokens(nicest.corpus, ngrams = 4, remove_numbers=T, remove_punct=T)

> head(nicest.4gram, 1)
tokens from 1 document.
JAN0001_P1B.txt :
  [1] "Some_people_say_that"              "people_say_that_specialized"      
  [3] "say_that_specialized_knowledge"    "that_specialized_knowledge_is"    
  [5] "specialized_knowledge_is_not"      "knowledge_is_not_important"       
  [7] "is_not_important_for"              "not_important_for_human" 

  • グラム数に範囲を指定することもできる ngrams = 2:4
nicestJAN2to4gram <- tokens(nicestJAN.corpus, ngrams= 2:4, remove_punct=T)
> str(nicestJAN2to4gram)
List of 10
 $ JAN0001_P1B.txt: chr [1:570] "Some_people" "people_say" "say_that" "that_specialized" ...
 $ JAN0001_P2B.txt: chr [1:705] "You_may" "may_think" "think_that" "that_young" ...
 $ JAN0001_P3B.txt: chr [1:438] "Compared_with" "with_past" "past_young" "young_people" ...
 $ JAN0001_P4B.txt: chr [1:246] "You_may" "may_have" "have_experiences" "experiences_like" ...
 $ JAN0001_P5B.txt: chr [1:681] "Elderly_person" "person_often" "often_says" "says_that" ...
 $ JAN0001_P6B.txt: chr [1:594] "Group_tourisms" "tourisms_are" "are_easy" "easy_to" ...
 $ JAN0001_P7B.txt: chr [1:690] "I_sutudents" "sutudents_must" "must_understand" "understand_ideas" ...
 $ JAN0001_P8B.txt: chr [1:267] "Most_of" "of_people" "people_think" "think_that" ...
 $ JAN0002_P1A.txt: chr [1:441] "I_agree" "agree_this" "this_opinion" "opinion_I" ...
 $ JAN0002_P2A.txt: chr [1:570] "Generary_speaking" "speaking_young" "young_people" "people_enjoy" ...
 - attr(*, "types")= chr [1:4736] "Some_people" "people_say" "say_that" "that_specialized" ...
 - attr(*, "padding")= logi FALSE
 - attr(*, "class")= chr "tokens"
 - attr(*, "what")= chr "word"
 - attr(*, "ngrams")= int [1:3] 2 3 4
 - attr(*, "skip")= int 0
 - attr(*, "concatenator")= chr "_"
 - attr(*, "docvars")='data.frame':	10 obs. of  0 variables

> summary(nicestJAN2to4gram)
                Length Class  Mode     
JAN0001_P1B.txt 570    -none- character
JAN0001_P2B.txt 705    -none- character
JAN0001_P3B.txt 438    -none- character
JAN0001_P4B.txt 246    -none- character
JAN0001_P5B.txt 681    -none- character
JAN0001_P6B.txt 594    -none- character
JAN0001_P7B.txt 690    -none- character
JAN0001_P8B.txt 267    -none- character
JAN0002_P1A.txt 441    -none- character
JAN0002_P2A.txt 570    -none- character

> head(nicestJAN2to4gram[[1]],50)
 [1] "Some_people"           "people_say"            "say_that"              "that_specialized"     
 [5] "specialized_knowledge" "knowledge_is"          "is_not"                "not_important"        
 [9] "important_for"         "for_human"             "human_however"         "however_who"          
[13] "who_make"              "make_todays"           "todays_life"           "life_such"            
[17] "such_a"                "a_convenience"         "convenience_are"       "are_always"           
[21] "always_a"              "a_few"                 "few_number"            "number_of"            
[25] "of_genius"             "genius_with"           "with_very"             "very_specific"        
[29] "specific_knowledges"   "knowledges_To"         "To_consider"           "consider_this"        
[33] "this_it"               "it_can"                "can_be"                "be_said"              
[37] "said_that"             "that_to"               "to_specialized"        "specialized_in"       
[41] "in_one"                "one_specific"          "specific_subject"      "subject_is"           
[45] "is_better"             "better_than"           "than_to"               "to_get"               
[49] "get_bload"             "bload_knowledge" 

> tail(nicestJAN2to4gram[[1]],50)
 [1] "not_know_about_such"               "know_about_such_a"                 "about_such_a_common"              
 [4] "such_a_common_thing"               "a_common_thing_you"                "common_thing_you_only"            
 [7] "thing_you_only_have"               "you_only_have_to"                  "only_have_to_use"                 
[10] "have_to_use_dictionaries"          "to_use_dictionaries_to"            "use_dictionaries_to_understand"   
[13] "dictionaries_to_understand_You"    "to_understand_You_need"            "understand_You_need_not"          
[16] "You_need_not_to"                   "need_not_to_become"                "not_to_become_walking"            
[19] "to_become_walking_dictionary"      "become_walking_dictionary_because" "walking_dictionary_because_you"   
[22] "dictionary_because_you_can"        "because_you_can_use"               "you_can_use_dictionaries"         
[25] "can_use_dictionaries_To"           "use_dictionaries_To_sum"           "dictionaries_To_sum_up"           
[28] "To_sum_up_for"                     "sum_up_for_its"                    "up_for_its_usefulness"            
[31] "for_its_usefulness_you"            "its_usefulness_you_ought"          "usefulness_you_ought_to"          
[34] "you_ought_to_become"               "ought_to_become_specialized"       "to_become_specialized_in"         
[37] "become_specialized_in_one"         "specialized_in_one_subject"        "in_one_subject_It"                
[40] "one_subject_It_must"               "subject_It_must_help"              "It_must_help_your"                
[43] "must_help_your_life"               "help_your_life_much"               "your_life_much_more"              
[46] "life_much_more_than"               "much_more_than_bload"              "more_than_bload_knowledge"        
[49] "than_bload_knowledge_will"         "bload_knowledge_will_do" 

途中の単語を飛ばした連語リスト(skip-gram)の作成

  • オプションで skip = 数字 で飛ばす単語数を指定する。
    • 複数の範囲の指定可
      • skip = 0:1 (スキップなし、つまり、普通のngramおよび、一つ飛ばしのskip-gram)
      • skip = 1:2 (単語一つ、および二つ飛ばしたskip-gram)

nicest.4skip <- tokens(nicest.corpus, ngrams = 4, skip = 1, remove_numbers=T, remove_punct=T)
> head(nicest.4skip, 1)
tokens from 1 document.
JAN0001_P1B.txt :
  [1] "Some_say_specialized_is"           "people_that_knowledge_not"        
  [3] "say_specialized_is_important"      "that_knowledge_not_for"           
  [5] "specialized_is_important_human"    "knowledge_not_for_however"        
  [7] "is_important_human_who"            "not_for_however_make"   

nicest.4skip2 <- tokens(nicest.corpus, ngrams = 4, skip = 0:1, remove_numbers=T, remove_punct=T)
> head(nicest.4skip2, 1)
tokens from 1 document.
JAN0001_P1B.txt :
   [1] "Some_people_say_that"                "Some_people_say_specialized"        
   [3] "Some_people_that_specialized"        "Some_people_that_knowledge"         
   [5] "Some_say_that_specialized"           "Some_say_that_knowledge"            
   [7] "Some_say_specialized_knowledge"      "Some_say_specialized_is"  

nicest.4skip3 <- tokens(nicest.corpus, ngrams = 4, skip = 1:2, remove_numbers=T, remove_punct=T)
> head(nicest.4skip3, 1)
tokens from 1 document.
JAN0001_P1B.txt :
   [1] "Some_say_specialized_is"            "Some_say_specialized_not"          
   [3] "Some_say_knowledge_not"             "Some_say_knowledge_important"      
   [5] "Some_that_knowledge_not"            "Some_that_knowledge_important"     
   [7] "Some_that_is_important"             "Some_that_is_for"  
  • 間に「空欄」のあるn-gram表現のリストの作成が可能
  • ただし、どこに「空欄」があるかという情報が結果に残されない点が欠点。

 連語の検索

  • 具体的な連語をベクトルにまとめておく
multiword <- c("in addition", "on the other hand", "as a result")

  • 「フレーズ」という単位で扱われるように指定する
phrase(multiword)

  • 処理例:kwic検索
> multiword <- c("in addition", "on the other hand", "as a result")

> kwic(nicestJPN.corpus, pattern = phrase(multiword))
                                                                                                        
   [JAN0001_P2B.txt, 12:15]      are active and free, | on the other hand | olders are less active and  
 [JAN0001_P5B.txt, 117:118]    is the biggest reason. |    In addition    | above reason, people will   
 [JAN0001_P7B.txt, 196:199] answer ten questions. But | on the other hand | , if you understanding ideas