R R.package quanteda !!!quantedaを使った連語(multi-word expressions)処理 {{outline}} ---- !!連語(n-gram表現)の一覧表 !連語の頻度と強さの一覧 <> {{pre > textstat_collocations(nicestJPN.corpus) collocation count count_nested length lambda z 1 do not 10 0 2 4.151357 8.325134 2 young people 8 0 2 4.283437 8.180578 3 if you 10 0 2 5.196013 7.994369 4 young person 6 0 2 4.946850 7.666668 5 can not 9 0 2 3.478602 7.612492 6 i think 7 0 2 3.939177 7.470671 7 enough time 5 0 2 6.849914 7.351494 8 want to 9 0 2 4.083226 6.925760 9 person can 5 0 2 4.331071 6.718328 10 enjoy life 3 0 2 5.297317 6.307093 11 have enough 4 0 2 4.753478 6.300479 12 of all 5 0 2 4.435229 6.243870 }} *連語の長さ(グラム数)の指定オプション: size = 数字 *頻度の最低回数の指定オプション: min_count = 回数 {{pre > textstat_collocations(nicestJPN.corpus, size = 3, min_count = 3) collocation count count_nested length lambda z 1 need not to 3 0 3 3.5022761 1.5067168 2 first of all 5 0 3 2.1345325 0.8123263 3 not have enough 4 0 3 1.4805931 0.6381457 4 can not have 3 0 3 0.9067169 0.5200749 5 to earn money 3 0 3 0.8133261 0.2775042 6 young people have 3 0 3 0.4902593 0.2650525 7 if you are 3 0 3 0.3020849 0.1372883 8 in the social 3 0 3 -0.4990624 -0.1942642 9 to sum up 5 0 3 -0.7190799 -0.2251418 10 ideas and concepts 4 0 3 -0.7777511 -0.2601743 11 young person can 3 0 3 -1.5590397 -0.8279029 12 have enough time 3 0 3 -2.0558982 -0.9839019 13 do not know 3 0 3 -2.1209809 -1.5745493 > textstat_collocations(nicestJPN.corpus, size = 4, min_count = 3) collocation count count_nested length lambda z 1 can not have enough 3 0 4 1.922831 0.4567948 2 not have enough time 3 0 4 -3.640576 -0.8486199 > textstat_collocations(nicestJPN.corpus, size = 5, min_count = 2) collocation count count_nested length lambda z 1 young person can not have 2 0 5 4.5336736 0.7267201 2 can not have enough time 2 0 5 3.3482331 0.4893972 3 the best way of traveling 2 0 5 2.4255540 0.3364245 4 this is the biggest reason 2 0 5 1.5023694 0.2082819 5 not give enough time to 2 0 5 1.2160942 0.1723139 6 time to helping their communities 2 0 5 0.8492974 0.1155970 7 do not give enough time 2 0 5 -1.4512792 -0.2075141 8 person can not have enough 2 0 5 -1.8019596 -0.2712349 9 give enough time to helping 2 0 5 -2.3165466 -0.3140204 10 enough time to helping their 2 0 5 -2.4219786 -0.3338030 }} !連語をtoken単位にしたリストの作成 *単語のtoken化と同じプログラム(<>)のオプションの指定でできる。 tokens(コーパスデータ) tokens(コーパスデータ, ngrams = グラム数の指定) *実際には、数字や句読点を削除するオプションを使う。 remove_numbers=T, remove_punct=T {{pre nicest.tokens <- tokens(nicest.corpus, remove_numbers=T, remove_punct=T) nicest.2gram <- tokens(nicest.corpus, ngrams = 2, remove_numbers=T, remove_punct=T) nicest.3gram <- tokens(nicest.corpus, ngrams = 3, remove_numbers=T, remove_punct=T) nicest.4gram <- tokens(nicest.corpus, ngrams = 4, remove_numbers=T, remove_punct=T) > head(nicest.4gram, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_people_say_that" "people_say_that_specialized" [3] "say_that_specialized_knowledge" "that_specialized_knowledge_is" [5] "specialized_knowledge_is_not" "knowledge_is_not_important" [7] "is_not_important_for" "not_important_for_human" }} *グラム数に範囲を指定することもできる ngrams = 2:4 nicestJAN2to4gram <- tokens(nicestJAN.corpus, ngrams= 2:4, remove_punct=T) {{pre > str(nicestJAN2to4gram) List of 10 $ JAN0001_P1B.txt: chr [1:570] "Some_people" "people_say" "say_that" "that_specialized" ... $ JAN0001_P2B.txt: chr [1:705] "You_may" "may_think" "think_that" "that_young" ... $ JAN0001_P3B.txt: chr [1:438] "Compared_with" "with_past" "past_young" "young_people" ... $ JAN0001_P4B.txt: chr [1:246] "You_may" "may_have" "have_experiences" "experiences_like" ... $ JAN0001_P5B.txt: chr [1:681] "Elderly_person" "person_often" "often_says" "says_that" ... $ JAN0001_P6B.txt: chr [1:594] "Group_tourisms" "tourisms_are" "are_easy" "easy_to" ... $ JAN0001_P7B.txt: chr [1:690] "I_sutudents" "sutudents_must" "must_understand" "understand_ideas" ... $ JAN0001_P8B.txt: chr [1:267] "Most_of" "of_people" "people_think" "think_that" ... $ JAN0002_P1A.txt: chr [1:441] "I_agree" "agree_this" "this_opinion" "opinion_I" ... $ JAN0002_P2A.txt: chr [1:570] "Generary_speaking" "speaking_young" "young_people" "people_enjoy" ... - attr(*, "types")= chr [1:4736] "Some_people" "people_say" "say_that" "that_specialized" ... - attr(*, "padding")= logi FALSE - attr(*, "class")= chr "tokens" - attr(*, "what")= chr "word" - attr(*, "ngrams")= int [1:3] 2 3 4 - attr(*, "skip")= int 0 - attr(*, "concatenator")= chr "_" - attr(*, "docvars")='data.frame': 10 obs. of 0 variables > summary(nicestJAN2to4gram) Length Class Mode JAN0001_P1B.txt 570 -none- character JAN0001_P2B.txt 705 -none- character JAN0001_P3B.txt 438 -none- character JAN0001_P4B.txt 246 -none- character JAN0001_P5B.txt 681 -none- character JAN0001_P6B.txt 594 -none- character JAN0001_P7B.txt 690 -none- character JAN0001_P8B.txt 267 -none- character JAN0002_P1A.txt 441 -none- character JAN0002_P2A.txt 570 -none- character > head(nicestJAN2to4gram[[1]],50) [1] "Some_people" "people_say" "say_that" "that_specialized" [5] "specialized_knowledge" "knowledge_is" "is_not" "not_important" [9] "important_for" "for_human" "human_however" "however_who" [13] "who_make" "make_todays" "todays_life" "life_such" [17] "such_a" "a_convenience" "convenience_are" "are_always" [21] "always_a" "a_few" "few_number" "number_of" [25] "of_genius" "genius_with" "with_very" "very_specific" [29] "specific_knowledges" "knowledges_To" "To_consider" "consider_this" [33] "this_it" "it_can" "can_be" "be_said" [37] "said_that" "that_to" "to_specialized" "specialized_in" [41] "in_one" "one_specific" "specific_subject" "subject_is" [45] "is_better" "better_than" "than_to" "to_get" [49] "get_bload" "bload_knowledge" > tail(nicestJAN2to4gram[[1]],50) [1] "not_know_about_such" "know_about_such_a" "about_such_a_common" [4] "such_a_common_thing" "a_common_thing_you" "common_thing_you_only" [7] "thing_you_only_have" "you_only_have_to" "only_have_to_use" [10] "have_to_use_dictionaries" "to_use_dictionaries_to" "use_dictionaries_to_understand" [13] "dictionaries_to_understand_You" "to_understand_You_need" "understand_You_need_not" [16] "You_need_not_to" "need_not_to_become" "not_to_become_walking" [19] "to_become_walking_dictionary" "become_walking_dictionary_because" "walking_dictionary_because_you" [22] "dictionary_because_you_can" "because_you_can_use" "you_can_use_dictionaries" [25] "can_use_dictionaries_To" "use_dictionaries_To_sum" "dictionaries_To_sum_up" [28] "To_sum_up_for" "sum_up_for_its" "up_for_its_usefulness" [31] "for_its_usefulness_you" "its_usefulness_you_ought" "usefulness_you_ought_to" [34] "you_ought_to_become" "ought_to_become_specialized" "to_become_specialized_in" [37] "become_specialized_in_one" "specialized_in_one_subject" "in_one_subject_It" [40] "one_subject_It_must" "subject_It_must_help" "It_must_help_your" [43] "must_help_your_life" "help_your_life_much" "your_life_much_more" [46] "life_much_more_than" "much_more_than_bload" "more_than_bload_knowledge" [49] "than_bload_knowledge_will" "bload_knowledge_will_do" }} !途中の単語を飛ばした連語リスト(skip-gram)の作成 *オプションで skip = 数字 で飛ばす単語数を指定する。 **複数の範囲の指定可 *** skip = 0:1 (スキップなし、つまり、普通のngramおよび、一つ飛ばしのskip-gram) *** skip = 1:2 (単語一つ、および二つ飛ばしたskip-gram) {{pre nicest.4skip <- tokens(nicest.corpus, ngrams = 4, skip = 1, remove_numbers=T, remove_punct=T) > head(nicest.4skip, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_say_specialized_is" "people_that_knowledge_not" [3] "say_specialized_is_important" "that_knowledge_not_for" [5] "specialized_is_important_human" "knowledge_not_for_however" [7] "is_important_human_who" "not_for_however_make" nicest.4skip2 <- tokens(nicest.corpus, ngrams = 4, skip = 0:1, remove_numbers=T, remove_punct=T) > head(nicest.4skip2, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_people_say_that" "Some_people_say_specialized" [3] "Some_people_that_specialized" "Some_people_that_knowledge" [5] "Some_say_that_specialized" "Some_say_that_knowledge" [7] "Some_say_specialized_knowledge" "Some_say_specialized_is" nicest.4skip3 <- tokens(nicest.corpus, ngrams = 4, skip = 1:2, remove_numbers=T, remove_punct=T) > head(nicest.4skip3, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_say_specialized_is" "Some_say_specialized_not" [3] "Some_say_knowledge_not" "Some_say_knowledge_important" [5] "Some_that_knowledge_not" "Some_that_knowledge_important" [7] "Some_that_is_important" "Some_that_is_for" }} *間に「空欄」のあるn-gram表現のリストの作成が可能 *ただし、どこに「空欄」があるかという情報が結果に残されない点が欠点。 !!連語の検索 *具体的な連語をベクトルにまとめておく multiword <- c("in addition", "on the other hand", "as a result") *「フレーズ」という単位で扱われるように指定する phrase(multiword) *処理例:kwic検索 {{pre > multiword <- c("in addition", "on the other hand", "as a result") > kwic(nicestJPN.corpus, pattern = phrase(multiword)) [JAN0001_P2B.txt, 12:15] are active and free, | on the other hand | olders are less active and [JAN0001_P5B.txt, 117:118] is the biggest reason. | In addition | above reason, people will [JAN0001_P7B.txt, 196:199] answer ten questions. But | on the other hand | , if you understanding ideas }} ---- * reference https://quanteda.io/articles/pkgdown/examples/phrase.html