R !!!keyness {{outline}} ---- *複数の文書からなるコーパスがあったとして、 *その中の特定の文書が、ほかの残りと比べて、際立って違う言葉遣いをしていることを探る。 *二種類の文書に対して行うことで、二つの文書の相違を観察できる。 **targetとreference group *符号付き2x2の関連性スコア(association score) quanteda::<> *target で、注目したい文書を指定する。 **数字を入れれば、文書行列中の要素番号 **ほかには、文書行列に、各文書の属性情報をつけて置き、その属性でグループを指定する。 ***例えば、JANとNTVという属性をつけて、JANをターゲットに指定して、残りのNTVをreferenceとして比較する。 *measure = オプションで統計値を選べる。(signed というのは、プラスマイナスの符号を使うから) ** chi2 は、χ二乗 ** exact は、Fisher's exact test ** lr は、likelihood ratio(G2) {{pre nicestJAN.1st <- textstat_keyness(nicestJAN.dfm, 1) nicestJAN.1st feature chi2 p n_target n_reference 1 specialized 3.295107e+01 9.450749e-09 5 0 2 knowledge 1.741756e+01 3.000417e-05 5 3 3 bload 1.656914e+01 4.690807e-05 3 0 4 knowledges 8.652160e+00 3.266739e-03 2 0 5 subject 8.652160e+00 3.266739e-03 2 0 6 walking 8.652160e+00 3.266739e-03 2 0 7 dictionary 8.652160e+00 3.266739e-03 2 0 8 dictionaries 8.652160e+00 3.266739e-03 2 0 9 become 8.652160e+00 3.266739e-03 2 0 10 get 8.090752e+00 4.449170e-03 3 2 11 you 5.415006e+00 1.996438e-02 11 43 12 such 4.849584e+00 2.765279e-02 2 1 13 specific 4.849584e+00 2.765279e-02 2 1 14 does 4.849584e+00 2.765279e-02 2 1 15 something 4.849584e+00 2.765279e-02 2 1 16 better 3.007704e+00 8.286961e-02 2 2 17 to 2.199027e+00 1.380979e-01 11 57 18 one 1.950172e+00 1.625683e-01 2 3 textplot_keyness(nicestJAN.1st) }} {{ref_image keynes1st.png}} !!keynessの分析例:学習者コーパス NICESTのサンプルデータを使って !必要なパッケージ {{pre install.packages("quanteda", dependencies=T) library(quanteda) install.packages("readtext", dependencies=T) library(readtext) }} !データの構成 *NICEST sample files というディレクトリー内に二つのサブディレクトリー **NTV sample 10 **JPN sample 10 *それぞれサンプルエッセイの本文部分のみ、10個ずつ入れてある。 !データの読み込み <> *作業ディレクトリーを「NICEST sample files」に設定 <> **そこをもとに、二つのサブディレクトリー内のファイを読み込む <> *まず母語話者データ {{pre setwd("C:/Users/sugiura/.../NICEST sample files") # テキストファイルを読み込む。まずは、L1の母語話者データ nicestNTV.tmp <- readtext("NTV sample 10/*.txt") head(nicestNTV.tmp) ## readtext object consisting of 6 documents and 0 docvars. ## # Description: df[,2] [6 x 2] ## doc_id text ## * ## 1 ENG0002_1P1A.txt "\"There is a\"..." ## 2 ENG0002_2P5A.txt "\"By the yea\"..." ## 3 ENG0002_3P2A.txt "\"The questi\"..." ## 4 ENG0002_4P7A.txt "\"The questi\"..." ## 5 ENG0002_5P6A.txt "\"In the las\"..." ## 6 ENG0002_6P3A.txt "\"The questi\"..." }} !コーパスデータ化する <> {{pre nicestNTV.corpus <- corpus(nicestNTV.tmp) str(nicestNTV.corpus) ## List of 4 ## $ documents:'data.frame': 10 obs. of 1 variable: ## ..$ texts: chr [1:10] "There is a well known phrase, frequently employed in denigration rather than compliment, which describes a pers"| __truncated__ "By the year 2026, will there be fewer cars in use than there are in 2016?\nThis is a difficult question to answ"| __truncated__ "The question of whether young people enjoy life more than older people do is one fraught with potentially insur"| __truncated__ "The question of whether it is concepts or facts that are more important foci for students is a hot topic for ma"| __truncated__ ... ## $ metadata :List of 2 ## ..$ source : chr "C:/Users/sugiura/Documents/* on x86-64 by sugiura" ## ..$ created: chr "Sun Dec 01 12:04:54 2019" ## $ settings :List of 12 ## ..$ stopwords : NULL ## ..$ collocations : NULL ## ..$ dictionary : NULL ## ..$ valuetype : chr "glob" ## ..$ stem : logi FALSE ## ..$ delimiter_word : chr " " ## ..$ delimiter_sentence : chr ".!?" ## ..$ delimiter_paragraph: chr "\n\n" ## ..$ clean_tolower : logi TRUE ## ..$ clean_remove_digits: logi TRUE ## ..$ clean_remove_punct : logi TRUE ## ..$ units : chr "documents" ## ..- attr(*, "class")= chr [1:2] "settings" "list" ## $ tokens : NULL }} *コーパスの概要を確認 <> {{pre summary(nicestNTV.corpus) ## Corpus consisting of 10 documents: ## ## Text Types Tokens Sentences ## ENG0002_1P1A.txt 290 717 30 ## ENG0002_2P5A.txt 361 899 28 ## ENG0002_3P2A.txt 388 987 33 ## ENG0002_4P7A.txt 438 1047 38 ## ENG0002_5P6A.txt 415 969 35 ## ENG0002_6P3A.txt 257 697 23 ## ENG0002_7P4A.txt 320 851 55 ## ENG0002_8P8A.txt 260 612 30 ## ENG0003_1P8B.txt 185 456 24 ## ENG0003_2P4B.txt 223 531 22 ## ## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura ## Created: Sun Dec 01 12:04:54 2019 ## Notes: }} !言語情報を属性として追加する <> docvars(コーパスデータ, 文書属性) <- 値 summary(コーパスデータ) で内容を確認 *母語話者データはL1、学習者データはL2 {{pre docvars(nicestNTV.corpus, "lang") <- "L1" summary(nicestNTV.corpus) ## Corpus consisting of 10 documents: ## ## Text Types Tokens Sentences lang ## ENG0002_1P1A.txt 290 717 30 L1 ## ENG0002_2P5A.txt 361 899 28 L1 ## ENG0002_3P2A.txt 388 987 33 L1 ## ENG0002_4P7A.txt 438 1047 38 L1 ## ENG0002_5P6A.txt 415 969 35 L1 ## ENG0002_6P3A.txt 257 697 23 L1 ## ENG0002_7P4A.txt 320 851 55 L1 ## ENG0002_8P8A.txt 260 612 30 L1 ## ENG0003_1P8B.txt 185 456 24 L1 ## ENG0003_2P4B.txt 223 531 22 L1 ## ## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura ## Created: Sun Dec 01 12:04:54 2019 ## Notes: }} !同様に、L2の学習者データのテキストファイルを読み込む。 {{pre setwd("C:/Users/sugiura/.../NICEST sample files") nicestJPN.tmp <- readtext("JPN sample 10/*.txt") head(nicestJPN.tmp) ## readtext object consisting of 6 documents and 0 docvars. ## # Description: df[,2] [6 x 2] ## doc_id text ## * ## 1 JAN0001_P1B.txt "\"Some peopl\"..." ## 2 JAN0001_P2B.txt "\"You may th\"..." ## 3 JAN0001_P3B.txt "\"Compared w\"..." ## 4 JAN0001_P4B.txt "\"You may ha\"..." ## 5 JAN0001_P5B.txt "\"Elderly pe\"..." ## 6 JAN0001_P6B.txt "\"Group tour\"..." }} !コーパスデータ化する nicestJPN.corpus <- corpus(nicestJPN.tmp) str(nicestJPN.corpus) ## List of 4 ## $ documents:'data.frame': 10 obs. of 1 variable: ## ..$ texts: chr [1:10] "Some people say that specialized knowledge is not important for human, however, who make todays life such a con"| __truncated__ "You may think that young people are active and free, on the other hand olders are less active and they have muc"| __truncated__ "Compared with past, young people nowadays do not give enough time to helping their communities.\nI guess there "| __truncated__ "You may have experiences like this, feel nice at products in some advertisement but you buy and see it, you dis"| __truncated__ ... ## $ metadata :List of 2 ## ..$ source : chr "C:/Users/sugiura/Documents/* on x86-64 by sugiura" ## ..$ created: chr "Sun Dec 01 12:04:54 2019" ## $ settings :List of 12 ## ..$ stopwords : NULL ## ..$ collocations : NULL ## ..$ dictionary : NULL ## ..$ valuetype : chr "glob" ## ..$ stem : logi FALSE ## ..$ delimiter_word : chr " " ## ..$ delimiter_sentence : chr ".!?" ## ..$ delimiter_paragraph: chr "\n\n" ## ..$ clean_tolower : logi TRUE ## ..$ clean_remove_digits: logi TRUE ## ..$ clean_remove_punct : logi TRUE ## ..$ units : chr "documents" ## ..- attr(*, "class")= chr [1:2] "settings" "list" ## $ tokens : NULL summary(nicestJPN.corpus) ## Corpus consisting of 10 documents: ## ## Text Types Tokens Sentences ## JAN0001_P1B.txt 116 214 12 ## JAN0001_P2B.txt 138 268 17 ## JAN0001_P3B.txt 97 169 11 ## JAN0001_P4B.txt 68 99 8 ## JAN0001_P5B.txt 120 262 16 ## JAN0001_P6B.txt 114 224 13 ## JAN0001_P7B.txt 121 268 18 ## JAN0001_P8B.txt 71 108 8 ## JAN0002_P1A.txt 98 170 15 ## JAN0002_P2A.txt 117 216 19 ## ## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura ## Created: Sun Dec 01 12:04:54 2019 ## Notes: }} !言語情報を属性として追加する {{pre docvars(nicestJPN.corpus, "lang") <- "L2" summary(nicestJPN.corpus) ## Corpus consisting of 10 documents: ## ## Text Types Tokens Sentences lang ## JAN0001_P1B.txt 116 214 12 L2 ## JAN0001_P2B.txt 138 268 17 L2 ## JAN0001_P3B.txt 97 169 11 L2 ## JAN0001_P4B.txt 68 99 8 L2 ## JAN0001_P5B.txt 120 262 16 L2 ## JAN0001_P6B.txt 114 224 13 L2 ## JAN0001_P7B.txt 121 268 18 L2 ## JAN0001_P8B.txt 71 108 8 L2 ## JAN0002_P1A.txt 98 170 15 L2 ## JAN0002_P2A.txt 117 216 19 L2 ## ## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura ## Created: Sun Dec 01 12:04:54 2019 ## Notes: }} !二種類のコーパスデータの統合 統合コーパスAB <- 統合前コーパスA <<+>> 統合前コーパスB {{pre nicest20.corpus <- nicestJPN.corpus + nicestNTV.corpus summary(nicest20.corpus) ## Corpus consisting of 20 documents: ## ## Text Types Tokens Sentences lang ## JAN0001_P1B.txt 116 214 12 L2 ## JAN0001_P2B.txt 138 268 17 L2 ## JAN0001_P3B.txt 97 169 11 L2 ## JAN0001_P4B.txt 68 99 8 L2 ## JAN0001_P5B.txt 120 262 16 L2 ## JAN0001_P6B.txt 114 224 13 L2 ## JAN0001_P7B.txt 121 268 18 L2 ## JAN0001_P8B.txt 71 108 8 L2 ## JAN0002_P1A.txt 98 170 15 L2 ## JAN0002_P2A.txt 117 216 19 L2 ## ENG0002_1P1A.txt 290 717 30 L1 ## ENG0002_2P5A.txt 361 899 28 L1 ## ENG0002_3P2A.txt 388 987 33 L1 ## ENG0002_4P7A.txt 438 1047 38 L1 ## ENG0002_5P6A.txt 415 969 35 L1 ## ENG0002_6P3A.txt 257 697 23 L1 ## ENG0002_7P4A.txt 320 851 55 L1 ## ENG0002_8P8A.txt 260 612 30 L1 ## ENG0003_1P8B.txt 185 456 24 L1 ## ENG0003_2P4B.txt 223 531 22 L1 ## ## Source: Combination of corpuses nicestJPN.corpus and nicestNTV.corpus ## Created: Sun Dec 01 12:04:55 2019 ## Notes: }} !文書行列を作成。句読点の削除。(小文字化は自動)<> {{pre nicest20.dfm <- dfm(nicest20.corpus, remove_punct=T) str(nicest20.dfm) ## Formal class 'dfm' [package "quanteda"] with 15 slots ## ..@ settings : list() ## ..@ weightTf :List of 3 ## .. ..$ scheme: chr "count" ## .. ..$ base : NULL ## .. ..$ K : NULL ## ..@ weightDf :List of 5 ## .. ..$ scheme : chr "unary" ## .. ..$ base : NULL ## .. ..$ c : NULL ## .. ..$ smoothing: NULL ## .. ..$ threshold: NULL ## ..@ smooth : num 0 ## ..@ ngrams : int 1 ## ..@ skip : int 0 ## ..@ concatenator: chr "_" ## ..@ version : int [1:3] 1 5 2 ## ..@ docvars :'data.frame': 20 obs. of 1 variable: ## .. ..$ lang: chr [1:20] "L2" "L2" "L2" "L2" ... ## ..@ i : int [1:3859] 0 2 3 4 5 6 7 9 12 13 ... ## ..@ p : int [1:1870] 0 13 28 35 53 54 60 79 97 101 ... ## ..@ Dim : int [1:2] 20 1869 ## ..@ Dimnames :List of 2 ## .. ..$ docs : chr [1:20] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" ... ## .. ..$ features: chr [1:1869] "some" "people" "say" "that" ... ## ..@ x : num [1:3859] 2 1 1 2 1 3 1 1 1 1 ... ## ..@ factors : list() }} !言語属性をもとに、グループ分け <> docvars(コーパスデータ, 文書属性) == 文書属性該当情報 *keynessの考え方として、lang属性がL2のものを「target」、その他のもの、つまりL1を「reference」と位置付ける {{pre jpn.group <- (docvars(nicest20.corpus, "lang") == "L2") jpn.group ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE }} !特徴語 keynessの算出 <> textstat_keyness(文書行列, ターゲット情報) ,feature(特徴語), χ二乗値, p値, ターゲットでの頻度, レファレンスでの頻度 *結果は、符号付(プラスマイナス)χ二乗値の高い順に出力される **<>で、ターゲットの方での「上位」20語の観察 **<>で、レファレンスの方での「上位」20語の観察 {{pre nicest20.keys <- textstat_keyness(nicest20.dfm, jpn.group) head(nicest20.keys, 20) ## feature chi2 p n_target n_reference ## 1 i 46.66002 8.443468e-12 27 16 ## 2 think 46.65550 8.462897e-12 17 3 ## 3 not 26.67282 2.409881e-07 39 53 ## 4 reasons 25.22333 5.106078e-07 9 1 ## 5 reason 25.22333 5.106078e-07 9 1 ## 6 young 23.12722 1.516293e-06 18 15 ## 7 person 22.25553 2.386705e-06 11 4 ## 8 old 21.48057 3.574323e-06 9 2 ## 9 know 21.34540 3.835402e-06 13 7 ## 10 ginious 18.42598 1.766336e-05 6 0 ## 11 you 17.37500 3.068359e-05 54 105 ## 12 some 16.54467 4.751730e-05 12 8 ## 13 people 16.47472 4.930307e-05 30 46 ## 14 want 15.91245 6.634097e-05 13 10 ## 15 specialized 14.64053 1.300867e-04 5 0 ## 16 sum 14.64053 1.300867e-04 5 0 ## 17 do 13.11234 2.933565e-04 21 30 ## 18 so 12.37194 4.358338e-04 14 16 ## 19 future 11.97323 5.397032e-04 7 3 ## 20 too 11.97323 5.397032e-04 7 3 tail(nicest20.keys, 20) ## feature chi2 p n_target n_reference ## 1850 take -2.000469 1.572505e-01 0 12 ## 1851 be -2.050220 1.521842e-01 10 62 ## 1852 also -2.127000 1.447238e-01 1 19 ## 1853 fact -2.252301 1.334159e-01 0 13 ## 1854 education -2.252301 1.334159e-01 0 13 ## 1855 we -2.347080 1.255179e-01 5 39 ## 1856 would -2.505589 1.134431e-01 0 14 ## 1857 within -2.505589 1.134431e-01 0 14 ## 1858 how -2.505589 1.134431e-01 0 14 ## 1859 of -2.921923 8.738367e-02 33 174 ## 1860 may -3.418485 6.447015e-02 4 39 ## 1861 question -3.528659 6.031656e-02 0 18 ## 1862 better -3.629432 5.676618e-02 4 40 ## 1863 on -3.708875 5.412387e-02 5 46 ## 1864 based -3.786172 5.167772e-02 0 19 ## 1865 whether -5.080234 2.420011e-02 0 24 ## 1866 an -6.806180 9.084289e-03 0 26 ## 1867 a -8.911901 2.833182e-03 14 121 ## 1868 and -15.267883 9.328931e-05 29 233 ## 1869 the -23.815514 1.060245e-06 32 291 }} !プロットする {{pre textplot_keyness(nicest20.keys) }} {{ref_image keynessNICEST20.png}}