R !!!判別分析の例 {{outline}} ---- !テキストファイルから言語特徴量の算出 *学習者データの処理 **作業ディレクトリーをデータの入っているディレクトリーに設定 **list.files() でファイルが入っていることを確認 {{pre > list.files() [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" [6] "JAN0001_P6B.txt" "JAN0001_P7B.txt" "JAN0001_P8B.txt" "JAN0002_P1A.txt" "JAN0002_P2A.txt" }} *基礎的言語特徴量の算出プログラム(myTextIndex.R)の実行 myTextIndex() *結果の保存ファイルを、一つ上のディレクトリーに作成する *保存された結果のデータを読み込む {{pre > myTextIndex() Read 12 items Read 17 items Read 11 items Read 8 items Read 16 items Read 13 items Read 18 items Read 8 items Read 15 items Read 19 items > JPindex <- read.table(choose.files()) }} *カラム名をつける {{pre > JPindex V1 V2 V3 V4 V5 V6 V7 V8 V9 1 JAN0001_P1B.txt 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JAN0001_P2B.txt 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JAN0001_P3B.txt 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JAN0001_P4B.txt 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JAN0001_P5B.txt 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JAN0001_P6B.txt 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JAN0001_P7B.txt 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JAN0001_P8B.txt 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JAN0002_P1A.txt 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JAN0002_P2A.txt 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 > names(JPindex) <- c("filename", "Token", "Type", "NoS", "TTR", "GI", "MATTR", "AWL", "ASL") > JPindex filename Token Type NoS TTR GI MATTR AWL ASL 1 JAN0001_P1B.txt 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JAN0001_P2B.txt 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JAN0001_P3B.txt 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JAN0001_P4B.txt 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JAN0001_P5B.txt 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JAN0001_P6B.txt 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JAN0001_P7B.txt 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JAN0001_P8B.txt 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JAN0002_P1A.txt 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JAN0002_P2A.txt 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 }} *同様に母語話者データも処理する {{pre > list.files() > myTextIndex() > NSindex <- read.table(choose.files()) > names(NSindex) <- c("filename", "Token", "Type", "NoS", "TTR", "GI", "MATTR", "AWL", "ASL") > NSindex filename Token Type NoS TTR GI MATTR AWL ASL 1 ENG0002_1P1A.txt 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.26667 2 ENG0002_2P5A.txt 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.42857 3 ENG0002_3P2A.txt 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.20588 4 ENG0002_4P7A.txt 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.31579 5 ENG0002_5P6A.txt 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.52778 6 ENG0002_6P3A.txt 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.52174 7 ENG0002_7P4A.txt 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.21818 8 ENG0002_8P8A.txt 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.93333 9 ENG0003_1P8B.txt 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.16667 10 ENG0003_2P4B.txt 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.90909 }} !ファイル名の代わりに、カテゴリーを代入(バックアップを作り作業) {{pre > JPindex2 <- JPindex > JPindex2$filename <- "JP" > JPindex2 filename Token Type NoS TTR GI MATTR AWL ASL 1 JP 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JP 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JP 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JP 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JP 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JP 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JP 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JP 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JP 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JP 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 > NSindex2 <- NSindex > NSindex2$filename <- "NS" > NSindex2 filename Token Type NoS TTR GI MATTR AWL ASL 1 NS 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.26667 2 NS 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.42857 3 NS 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.20588 4 NS 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.31579 5 NS 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.52778 6 NS 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.52174 7 NS 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.21818 8 NS 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.93333 9 NS 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.16667 10 NS 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.90909 }} !二種類のファイルを統合 {{pre > JPNSindex <- rbind(JPindex2, NSindex2) > JPNSindex filename Token Type NoS TTR GI MATTR AWL ASL 1 JP 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JP 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JP 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JP 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JP 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JP 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JP 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JP 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JP 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JP 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 11 NS 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.266670 12 NS 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.428570 13 NS 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.205880 14 NS 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.315790 15 NS 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.527780 16 NS 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.521740 17 NS 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.218180 18 NS 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.933330 19 NS 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.166670 20 NS 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.909090 }} !VIFを計算してみる {{pre > JPNS_LDA.model.cor <- cor(JPNSindex[2:length(JPNSindex)]) > JPNS_LDA.model.cor Token Type NoS TTR GI MATTR AWL ASL Token 1.0000000 0.9913503 0.8650134 -0.7713369 0.9669768 0.4936503 0.5040331 0.8377107 Type 0.9913503 1.0000000 0.8307555 -0.7155131 0.9910430 0.5588513 0.5403353 0.8340614 NoS 0.8650134 0.8307555 1.0000000 -0.7555247 0.7890872 0.3418692 0.1846240 0.4845738 TTR -0.7713369 -0.7155131 -0.7555247 1.0000000 -0.6463215 -0.2186894 -0.1535317 -0.6873747 GI 0.9669768 0.9910430 0.7890872 -0.6463215 1.0000000 0.6281433 0.5746082 0.8165488 MATTR 0.4936503 0.5588513 0.3418692 -0.2186894 0.6281433 1.0000000 0.3823200 0.4490040 AWL 0.5040331 0.5403353 0.1846240 -0.1535317 0.5746082 0.3823200 1.0000000 0.5874034 ASL 0.8377107 0.8340614 0.4845738 -0.6873747 0.8165488 0.4490040 0.5874034 1.0000000 > JPNS_LDA.model.vif <- 1/(1-JPNS_LDA.model.cor^2) > JPNS_LDA.model.vif Token Type NoS TTR GI MATTR AWL ASL Token Inf 58.056236 3.972167 2.468896 15.395085 1.322210 1.340571 3.352996 Type 58.056236 Inf 3.227416 2.049008 56.073371 1.454154 1.412354 3.285783 NoS 3.972167 3.227416 Inf 2.330012 2.650120 1.132342 1.035289 1.306868 TTR 2.468896 2.049008 2.330012 Inf 1.717421 1.050227 1.024141 1.895677 GI 15.395085 56.073371 2.650120 1.717421 Inf 1.651702 1.492926 3.000767 MATTR 1.322210 1.454154 1.132342 1.050227 1.651702 Inf 1.171191 1.252512 AWL 1.340571 1.412354 1.035289 1.024141 1.492926 1.171191 Inf 1.526817 ASL 3.352996 3.285783 1.306868 1.895677 3.000767 1.252512 1.526817 Inf > }} *TokenとType, GIが10以上 *Tokenを残して、TypeとGIを除く {{pre > JPNSindex2 <- JPNSindex[, c(1,2,4,5,7,8,9)] > JPNS_LDA.cor2 <- cor(JPNSindex2[2:length(JPNSindex2)]) > JPNS_LDA.cor2 Token NoS TTR MATTR AWL ASL Token 1.0000000 0.8650134 -0.7713369 0.4936503 0.5040331 0.8377107 NoS 0.8650134 1.0000000 -0.7555247 0.3418692 0.1846240 0.4845738 TTR -0.7713369 -0.7555247 1.0000000 -0.2186894 -0.1535317 -0.6873747 MATTR 0.4936503 0.3418692 -0.2186894 1.0000000 0.3823200 0.4490040 AWL 0.5040331 0.1846240 -0.1535317 0.3823200 1.0000000 0.5874034 ASL 0.8377107 0.4845738 -0.6873747 0.4490040 0.5874034 1.0000000 > JPNS_LDA.vif2 <- 1/(1-JPNS_LDA.cor2^2) > JPNS_LDA.vif2 Token NoS TTR MATTR AWL ASL Token Inf 3.972167 2.468896 1.322210 1.340571 3.352996 NoS 3.972167 Inf 2.330012 1.132342 1.035289 1.306868 TTR 2.468896 2.330012 Inf 1.050227 1.024141 1.895677 MATTR 1.322210 1.132342 1.050227 Inf 1.171191 1.252512 AWL 1.340571 1.035289 1.024141 1.171191 Inf 1.526817 ASL 3.352996 1.306868 1.895677 1.252512 1.526817 Inf }} !!判別分析 !lda() *Leave-One-Out Cross Validation付きで判別分析を行う モデル <- lda(カテゴリー ~ ., data = データ) {{pre > JPNS_LDA.model <- lda(filename ~ ., data=JPNSindex) > JPNS_LDA.model Call: lda(filename ~ ., data = JPNSindex) Prior probabilities of groups: JP NS 0.5 0.5 Group means: Token Type NoS TTR GI MATTR AWL ASL JP 175.4 98.5 13.7 0.5872091 7.491144 0.6698087 4.404382 12.78953 NS 680.1 288.8 32.0 0.4234451 10.971703 0.6870171 4.598315 21.84937 Coefficients of linear discriminants: LD1 Token 0.1264180 Type -0.4157560 NoS -0.5039002 TTR -13.5498322 GI 10.3108076 MATTR -25.6348057 AWL -3.8158570 ASL -0.6479296 }} !青木先生のdisc() {{pre > JPNS_LDA.model.aoki <- disc(JPNSindex[2:length(JPNSindex)], JPNSindex[1]) > JPNS_LDA.model.aoki 判別関数 JP:NS Partial F p-value Token -0.82405 6.83105 0.02410 Type 2.71010 8.72961 0.01310 NoS 3.28466 2.21374 0.16488 TTR 88.32434 2.05427 0.17959 GI -67.21082 8.15125 0.01565 MATTR 167.10003 1.77124 0.21016 AWL 24.87360 3.09877 0.10609 ASL 4.22352 1.43948 0.25543 constant 29.96605 分類関数 JP NS Token 1.76864 0.12053 Type 2.65903 8.07922 NoS -38.88947 -32.32014 TTR -880.73144 -704.08275 GI -62.69086 -197.11249 MATTR -2288.39994 -1954.19988 AWL -298.49685 -248.74966 ASL -58.85016 -50.40312 constant 2273.80143 2333.73353 判別結果 prediction group JP NS JP 10 0 NS 0 10 正判別率 = 100.0 % }} !!判別分析 VIFを考慮した場合 !lda() {{pre > JPNS_LDA.vif.model <- lda(filename ~ ., data=JPNSindex2) > JPNS_LDA.vif.model Call: lda(filename ~ ., data = JPNSindex2) Prior probabilities of groups: JP NS 0.5 0.5 Group means: Token NoS TTR MATTR AWL ASL JP 175.4 13.7 0.5872091 0.6698087 4.404382 12.78953 NS 680.1 32.0 0.4234451 0.6870171 4.598315 21.84937 Coefficients of linear discriminants: LD1 Token -0.003894369 NoS 0.198857467 TTR 2.385307428 MATTR -12.141584109 AWL -0.166544319 ASL 0.403659096 }} !disc() {{pre > JPNS_LDA.vif.aoki <- sdis(JPNSindex2[2:length(JPNSindex2)], JPNSindex2[1]) > JPNS_LDA.vif.aoki 判別関数 JP:NS Partial F p-value Token 0.01830 0.13474 0.71948 NoS -0.93440 1.19669 0.29384 TTR -11.20814 0.10929 0.74622 MATTR 57.05118 0.86904 0.36821 AWL 0.78256 0.00953 0.92372 ASL -1.89672 1.70798 0.21389 constant 9.81073 分類関数 JP NS Token 2.63641 2.67301 NoS -43.33272 -45.20151 TTR -976.01015 -998.42644 MATTR -2356.08836 -2241.98599 AWL -322.25610 -320.69098 ASL -65.33740 -69.13085 constant 2268.72790 2288.34935 判別結果 prediction group JP NS JP 10 0 NS 0 10 正判別率 = 100.0 % }} *プロットしてみる {{pre > plot(JPNS_LDA.vif.aoki, which="scatter", xpos="topright") }} {{ref_image JPNS_LDA.vif.aoki.png}} !!Stepwiseの判別分析 !青木先生のsdis() source("http://aoki2.si.gunma-u.ac.jp/R/src/sdis.R", encoding="euc-jp") {{pre > sdis(JPNSindex[2:length(JPNSindex)], JPNSindex[1], predict=T) 有効ケース数: 20 群を表す変数: filename ***** 平均値 ***** JP NS 全体 Token 175.4000000 680.1000000 427.7500000 Type 98.5000000 288.8000000 193.6500000 NoS 13.7000000 32.0000000 22.8500000 TTR 0.5872091 0.4234451 0.5053271 GI 7.4911437 10.9717031 9.2314234 MATTR 0.6698087 0.6870171 0.6784129 AWL 4.4043823 4.5983151 4.5013487 ASL 12.7895333 21.8493700 17.3194517 ***** プールされた群内相関係数行列 ***** Token Type NoS TTR GI MATTR AWL ASL Token 1.0000000 0.9739040 0.5720733 -0.23363842 0.90883399 0.4842178 0.4106696 0.4696170 Type 0.9739040 1.0000000 0.4773368 -0.11393202 0.97695625 0.5811482 0.4625755 0.4876764 NoS 0.5720733 0.4773368 1.0000000 -0.34924830 0.39194068 0.1451774 -0.1987344 -0.3842340 TTR -0.2336384 -0.1139320 -0.3492483 1.00000000 0.01325558 0.0654989 0.2489666 -0.1674558 GI 0.9088340 0.9769563 0.3919407 0.01325558 1.00000000 0.6735375 0.5067630 0.4714371 MATTR 0.4842178 0.5811482 0.1451774 0.06549890 0.67353754 1.0000000 0.2968052 0.3308693 AWL 0.4106696 0.4625755 -0.1987344 0.24896656 0.50676299 0.2968052 1.0000000 0.5132310 ASL 0.4696170 0.4876764 -0.3842340 -0.16745583 0.47143709 0.3308693 0.5132310 1.0000000 変数編入基準 Pin: 0.05 変数除去基準 Pout: 0.05 編入候補変数: Token P : <0.001 ***** 編入されました ***** ステップ 1 ***** 編入変数: Token ***** 分類関数 ***** JP NS 偏F値 P値 Token -0.020795 -0.080632 75.499 <0.001 定数項 1.823737 27.418812 ウィルクスのΛ: 0.19252 等価なF値:   75.499 自由度:     (1, 18.00) P値:      <0.001 除去候補変数: Token P : <0.001 ***** 除去されませんでした 編入候補変数: Type P : 0.035 ***** 編入されました ***** ステップ 2 ***** 編入変数: Type ***** 分類関数 ***** JP NS 偏F値 P値 Token 0.084392 -0.13424 12.5280 0.00252 Type -0.238763 0.12169 5.2506 0.03498 定数項 4.357888 28.07709 ウィルクスのΛ: 0.14709 等価なF値:   49.289 自由度:     (2, 17.00) P値:      <0.001 除去候補変数: Type P : 0.035 ***** 除去されませんでした 編入候補変数: GI P : 0.0986 ***** 編入されませんでした ===================== 結果 ===================== ***** 分類関数 ***** JP NS 偏F値 P値 Token 0.084392 -0.13424 12.5280 0.00252 Type -0.238763 0.12169 5.2506 0.03498 定数項 4.357888 28.07709 ***** 判別関数 ***** JP と NS の判別 マハラノビスの汎距離: 4.56895 理論的誤判別率:    0.0112 判別係数 標準化判別係数 Token -0.10932 -30.699 Type 0.18023 19.874 定数項 11.85960 ***** 各ケースの判別結果 ***** 実際の群 判別された群 正否 二乗距離1 二乗距離2 P値1 P値2 判別値 1 JP JP 0.043231 20.71354 1.00000 0.00795 10.3352 2 JP JP 0.235372 17.91573 0.99999 0.02187 8.8402 3 JP JP 0.219057 24.74204 0.99999 0.00172 12.2615 4 JP JP 0.574997 28.27700 0.99977 < 0.001 13.8510 5 JP JP 0.637415 15.02050 0.99967 0.05875 7.1915 6 JP JP 0.141685 17.98155 1.00000 0.02137 8.9199 7 JP JP 1.205056 13.85088 0.99659 0.08574 6.3229 8 JP JP 0.603872 28.57770 0.99973 < 0.001 13.9869 9 JP JP 0.189353 24.49370 1.00000 0.00189 12.1522 10 JP JP 0.073452 21.10421 1.00000 0.00688 10.5154 11 NS NS 15.218648 0.44673 0.05503 0.99991 -7.3860 12 NS NS 29.683533 0.84229 < 0.001 0.99906 -14.4206 13 NS NS 36.439409 2.19142 < 0.001 0.97457 -17.1240 14 NS NS 36.016144 4.06094 < 0.001 0.85158 -15.9776 15 NS NS 26.769427 6.60275 < 0.001 0.58003 -10.0833 16 NS NS 25.803552 2.30393 0.00113 0.97025 -11.7498 17 NS NS 42.170636 6.42752 < 0.001 0.59946 -17.8716 18 NS NS 9.466264 2.42925 0.30450 0.96495 -3.5185 19 NS NS 9.718567 4.27688 0.28534 0.83132 -2.7208 20 NS NS 9.543689 2.49481 0.29852 0.96197 -3.5244 メモ:「二乗距離」とは,各群の重心までのマハラノビスの汎距離の二乗です。 P値は各群に属する確率です。 ***** 判別結果集計表 **** 判別された群 実際の群 JP NS JP 10 0 NS 0 10 }} !グラフの描写 {{pre > plot(sdis.model) > plot(sdis.model, which="scatterplot", xpos="topright") }} {{ref_image sdis.model.png}} {{ref_image sdis.scatterplot.png}}