R R.package quanteda !!!quantedaを使った連語(multi-word expressions)処理 {{outline}} ---- !連語の検索 *具体的な連語をベクトルにまとめておく multiword <- c("in addition", "on the other hand", "as a result") *「フレーズ」という単位で扱われるように指定する phrase(multiword) *処理例:kwic検索 {{pre > multiword <- c("in addition", "on the other hand", "as a result") > kwic(nicestJPN.corpus, pattern = phrase(multiword)) [JAN0001_P2B.txt, 12:15] are active and free, | on the other hand | olders are less active and [JAN0001_P5B.txt, 117:118] is the biggest reason. | In addition | above reason, people will [JAN0001_P7B.txt, 196:199] answer ten questions. But | on the other hand | , if you understanding ideas }} !連語の頻度と強さの一覧 <> {{pre > textstat_collocations(nicestJPN.corpus) collocation count count_nested length lambda z 1 do not 10 0 2 4.151357 8.325134 2 young people 8 0 2 4.283437 8.180578 3 if you 10 0 2 5.196013 7.994369 4 young person 6 0 2 4.946850 7.666668 5 can not 9 0 2 3.478602 7.612492 6 i think 7 0 2 3.939177 7.470671 7 enough time 5 0 2 6.849914 7.351494 8 want to 9 0 2 4.083226 6.925760 9 person can 5 0 2 4.331071 6.718328 10 enjoy life 3 0 2 5.297317 6.307093 11 have enough 4 0 2 4.753478 6.300479 12 of all 5 0 2 4.435229 6.243870 }} *連語の長さ(グラム数)の指定オプション: size = 数字 *頻度の最低回数の指定オプション: min_count = 回数 {{pre > textstat_collocations(nicestJPN.corpus, size = 3, min_count = 3) collocation count count_nested length lambda z 1 need not to 3 0 3 3.5022761 1.5067168 2 first of all 5 0 3 2.1345325 0.8123263 3 not have enough 4 0 3 1.4805931 0.6381457 4 can not have 3 0 3 0.9067169 0.5200749 5 to earn money 3 0 3 0.8133261 0.2775042 6 young people have 3 0 3 0.4902593 0.2650525 7 if you are 3 0 3 0.3020849 0.1372883 8 in the social 3 0 3 -0.4990624 -0.1942642 9 to sum up 5 0 3 -0.7190799 -0.2251418 10 ideas and concepts 4 0 3 -0.7777511 -0.2601743 11 young person can 3 0 3 -1.5590397 -0.8279029 12 have enough time 3 0 3 -2.0558982 -0.9839019 13 do not know 3 0 3 -2.1209809 -1.5745493 > textstat_collocations(nicestJPN.corpus, size = 4, min_count = 3) collocation count count_nested length lambda z 1 can not have enough 3 0 4 1.922831 0.4567948 2 not have enough time 3 0 4 -3.640576 -0.8486199 > textstat_collocations(nicestJPN.corpus, size = 5, min_count = 2) collocation count count_nested length lambda z 1 young person can not have 2 0 5 4.5336736 0.7267201 2 can not have enough time 2 0 5 3.3482331 0.4893972 3 the best way of traveling 2 0 5 2.4255540 0.3364245 4 this is the biggest reason 2 0 5 1.5023694 0.2082819 5 not give enough time to 2 0 5 1.2160942 0.1723139 6 time to helping their communities 2 0 5 0.8492974 0.1155970 7 do not give enough time 2 0 5 -1.4512792 -0.2075141 8 person can not have enough 2 0 5 -1.8019596 -0.2712349 9 give enough time to helping 2 0 5 -2.3165466 -0.3140204 10 enough time to helping their 2 0 5 -2.4219786 -0.3338030 }} !連語をtoken単位にしたリストの作成 *単語のtoken化と同じプログラム(<>)のオプションの指定でできる。 tokens(コーパスデータ) tokens(コーパスデータ, ngrams = グラム数の指定) *実際には、数字や句読点を削除するオプションを使う。 remove_numbers=T, remove_punct=T {{pre nicest.tokens <- tokens(nicest.corpus, remove_numbers=T, remove_punct=T) nicest.2gram <- tokens(nicest.corpus, ngrams = 2, remove_numbers=T, remove_punct=T) nicest.3gram <- tokens(nicest.corpus, ngrams = 3, remove_numbers=T, remove_punct=T) nicest.4gram <- tokens(nicest.corpus, ngrams = 4, remove_numbers=T, remove_punct=T) > head(nicest.4gram, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_people_say_that" "people_say_that_specialized" [3] "say_that_specialized_knowledge" "that_specialized_knowledge_is" [5] "specialized_knowledge_is_not" "knowledge_is_not_important" [7] "is_not_important_for" "not_important_for_human" }} !途中の単語を飛ばした連語リスト(skip-gram)の作成 *オプションで skip = 数字 で飛ばす単語数を指定する。 **複数の範囲の指定可 *** skip = 0:1 (スキップなし、つまり、普通のngramおよび、一つ飛ばしのskip-gram) *** skip = 1:2 (単語一つ、および二つ飛ばしたskip-gram) {{pre nicest.4skip <- tokens(nicest.corpus, ngrams = 4, skip = 1, remove_numbers=T, remove_punct=T) > head(nicest.4skip, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_say_specialized_is" "people_that_knowledge_not" [3] "say_specialized_is_important" "that_knowledge_not_for" [5] "specialized_is_important_human" "knowledge_not_for_however" [7] "is_important_human_who" "not_for_however_make" nicest.4skip2 <- tokens(nicest.corpus, ngrams = 4, skip = 0:1, remove_numbers=T, remove_punct=T) > head(nicest.4skip2, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_people_say_that" "Some_people_say_specialized" [3] "Some_people_that_specialized" "Some_people_that_knowledge" [5] "Some_say_that_specialized" "Some_say_that_knowledge" [7] "Some_say_specialized_knowledge" "Some_say_specialized_is" nicest.4skip3 <- tokens(nicest.corpus, ngrams = 4, skip = 1:2, remove_numbers=T, remove_punct=T) > head(nicest.4skip3, 1) tokens from 1 document. JAN0001_P1B.txt : [1] "Some_say_specialized_is" "Some_say_specialized_not" [3] "Some_say_knowledge_not" "Some_say_knowledge_important" [5] "Some_that_knowledge_not" "Some_that_knowledge_important" [7] "Some_that_is_important" "Some_that_is_for" }} *間に「空欄」のあるn-gram表現のリストの作成が可能 *ただし、どこに「空欄」があるかという情報が結果に残されない点が欠点。 ---- * reference https://quanteda.io/articles/pkgdown/examples/phrase.html