R !!!判別分析の例 {{outline}} !テキストファイルから言語特徴量の算出 *学習者データの処理 **作業ディレクトリーをデータの入っているディレクトリーに設定 **list.files() でファイルが入っていることを確認 {{pre > list.files() [1] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" "JAN0001_P5B.txt" [6] "JAN0001_P6B.txt" "JAN0001_P7B.txt" "JAN0001_P8B.txt" "JAN0002_P1A.txt" "JAN0002_P2A.txt" }} *基礎的言語特徴量の算出プログラム(myTextIndex.R)の実行 myTextIndex() *結果の保存ファイルを、一つ上のディレクトリーに作成する *保存された結果のデータを読み込む {{pre > myTextIndex() Read 12 items Read 17 items Read 11 items Read 8 items Read 16 items Read 13 items Read 18 items Read 8 items Read 15 items Read 19 items > JPindex <- read.table(choose.files()) }} *カラム名をつける {{pre > JPindex V1 V2 V3 V4 V5 V6 V7 V8 V9 1 JAN0001_P1B.txt 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JAN0001_P2B.txt 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JAN0001_P3B.txt 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JAN0001_P4B.txt 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JAN0001_P5B.txt 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JAN0001_P6B.txt 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JAN0001_P7B.txt 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JAN0001_P8B.txt 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JAN0002_P1A.txt 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JAN0002_P2A.txt 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 > names(JPindex) <- c("filename", "Token", "Type", "NoS", "TTR", "GI", "MATTR", "AWL", "ASL") > JPindex filename Token Type NoS TTR GI MATTR AWL ASL 1 JAN0001_P1B.txt 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JAN0001_P2B.txt 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JAN0001_P3B.txt 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JAN0001_P4B.txt 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JAN0001_P5B.txt 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JAN0001_P6B.txt 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JAN0001_P7B.txt 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JAN0001_P8B.txt 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JAN0002_P1A.txt 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JAN0002_P2A.txt 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 }} *同様に母語話者データも処理する {{pre > list.files() > myTextIndex() > NSindex <- read.table(choose.files()) > names(NSindex) <- c("filename", "Token", "Type", "NoS", "TTR", "GI", "MATTR", "AWL", "ASL") > NSindex filename Token Type NoS TTR GI MATTR AWL ASL 1 ENG0002_1P1A.txt 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.26667 2 ENG0002_2P5A.txt 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.42857 3 ENG0002_3P2A.txt 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.20588 4 ENG0002_4P7A.txt 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.31579 5 ENG0002_5P6A.txt 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.52778 6 ENG0002_6P3A.txt 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.52174 7 ENG0002_7P4A.txt 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.21818 8 ENG0002_8P8A.txt 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.93333 9 ENG0003_1P8B.txt 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.16667 10 ENG0003_2P4B.txt 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.90909 }} !ファイル名の代わりに、カテゴリーを代入(バックアップを作り作業) {{pre > JPindex2 <- JPindex > JPindex2$filename <- "JP" > JPindex2 filename Token Type NoS TTR GI MATTR AWL ASL 1 JP 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JP 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JP 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JP 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JP 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JP 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JP 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JP 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JP 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JP 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 > NSindex2 <- NSindex > NSindex2$filename <- "NS" > NSindex2 filename Token Type NoS TTR GI MATTR AWL ASL 1 NS 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.26667 2 NS 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.42857 3 NS 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.20588 4 NS 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.31579 5 NS 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.52778 6 NS 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.52174 7 NS 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.21818 8 NS 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.93333 9 NS 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.16667 10 NS 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.90909 }} !二種類のファイルを統合 {{pre > JPNSindex <- rbind(JPindex2, NSindex2) > JPNSindex filename Token Type NoS TTR GI MATTR AWL ASL 1 JP 192 108 12 0.5625000 7.794229 0.6929687 4.562500 16.000000 2 JP 237 127 17 0.5358650 8.249536 0.6880591 4.329114 13.941180 3 JP 148 92 11 0.6216216 7.562353 0.7097297 4.391892 13.454550 4 JP 84 62 8 0.7380952 6.764755 0.6200000 4.547619 10.500000 5 JP 229 113 16 0.4934498 7.467250 0.6471179 4.161572 14.312500 6 JP 200 105 13 0.5250000 7.424621 0.6608500 4.170000 15.384620 7 JP 232 110 18 0.4741379 7.221854 0.6408190 4.353448 12.888890 8 JP 91 67 8 0.7362637 7.023508 0.6700000 4.318681 11.375000 9 JP 149 92 15 0.6174497 7.536934 0.6869799 4.630872 9.933333 10 JP 192 109 19 0.5677083 7.866397 0.6815625 4.578125 10.105260 11 NS 608 262 30 0.4309211 10.625500 0.6851645 4.817434 20.266670 12 NS 796 337 28 0.4233668 11.944650 0.6943090 5.026382 28.428570 13 NS 857 359 34 0.4189032 12.263210 0.7148191 4.868145 25.205880 14 NS 924 406 38 0.4393939 13.356420 0.6975649 4.961039 24.315790 15 NS 847 392 36 0.4628099 13.469280 0.7370248 4.472255 23.527780 16 NS 610 239 23 0.3918033 9.676827 0.6592951 4.645902 26.521740 17 NS 727 276 55 0.3796424 10.236270 0.6694360 4.093535 13.218180 18 NS 538 241 30 0.4479554 10.390250 0.6657435 4.486989 17.933330 19 NS 412 169 24 0.4101942 8.326032 0.6609223 4.296117 17.166670 20 NS 482 207 22 0.4294606 9.428592 0.6858921 4.315353 21.909090 }} !VIFを計算してみる {{pre > JPNS_LDA.model.cor <- cor(JPNSindex[2:length(JPNSindex)]) > JPNS_LDA.model.cor Token Type NoS TTR GI MATTR AWL ASL Token 1.0000000 0.9913503 0.8650134 -0.7713369 0.9669768 0.4936503 0.5040331 0.8377107 Type 0.9913503 1.0000000 0.8307555 -0.7155131 0.9910430 0.5588513 0.5403353 0.8340614 NoS 0.8650134 0.8307555 1.0000000 -0.7555247 0.7890872 0.3418692 0.1846240 0.4845738 TTR -0.7713369 -0.7155131 -0.7555247 1.0000000 -0.6463215 -0.2186894 -0.1535317 -0.6873747 GI 0.9669768 0.9910430 0.7890872 -0.6463215 1.0000000 0.6281433 0.5746082 0.8165488 MATTR 0.4936503 0.5588513 0.3418692 -0.2186894 0.6281433 1.0000000 0.3823200 0.4490040 AWL 0.5040331 0.5403353 0.1846240 -0.1535317 0.5746082 0.3823200 1.0000000 0.5874034 ASL 0.8377107 0.8340614 0.4845738 -0.6873747 0.8165488 0.4490040 0.5874034 1.0000000 > JPNS_LDA.model.vif <- 1/(1-JPNS_LDA.model.cor^2) > JPNS_LDA.model.vif Token Type NoS TTR GI MATTR AWL ASL Token Inf 58.056236 3.972167 2.468896 15.395085 1.322210 1.340571 3.352996 Type 58.056236 Inf 3.227416 2.049008 56.073371 1.454154 1.412354 3.285783 NoS 3.972167 3.227416 Inf 2.330012 2.650120 1.132342 1.035289 1.306868 TTR 2.468896 2.049008 2.330012 Inf 1.717421 1.050227 1.024141 1.895677 GI 15.395085 56.073371 2.650120 1.717421 Inf 1.651702 1.492926 3.000767 MATTR 1.322210 1.454154 1.132342 1.050227 1.651702 Inf 1.171191 1.252512 AWL 1.340571 1.412354 1.035289 1.024141 1.492926 1.171191 Inf 1.526817 ASL 3.352996 3.285783 1.306868 1.895677 3.000767 1.252512 1.526817 Inf > }} *TokenとType, GIが10以上 *Tokenを残して、TypeとGIを除く {{pre > JPNSindex2 <- JPNSindex[, c(1,2,4,5,7,8,9)] > JPNS_LDA.cor2 <- cor(JPNSindex2[2:length(JPNSindex2)]) > JPNS_LDA.cor2 Token NoS TTR MATTR AWL ASL Token 1.0000000 0.8650134 -0.7713369 0.4936503 0.5040331 0.8377107 NoS 0.8650134 1.0000000 -0.7555247 0.3418692 0.1846240 0.4845738 TTR -0.7713369 -0.7555247 1.0000000 -0.2186894 -0.1535317 -0.6873747 MATTR 0.4936503 0.3418692 -0.2186894 1.0000000 0.3823200 0.4490040 AWL 0.5040331 0.1846240 -0.1535317 0.3823200 1.0000000 0.5874034 ASL 0.8377107 0.4845738 -0.6873747 0.4490040 0.5874034 1.0000000 > JPNS_LDA.vif2 <- 1/(1-JPNS_LDA.cor2^2) > JPNS_LDA.vif2 Token NoS TTR MATTR AWL ASL Token Inf 3.972167 2.468896 1.322210 1.340571 3.352996 NoS 3.972167 Inf 2.330012 1.132342 1.035289 1.306868 TTR 2.468896 2.330012 Inf 1.050227 1.024141 1.895677 MATTR 1.322210 1.132342 1.050227 Inf 1.171191 1.252512 AWL 1.340571 1.035289 1.024141 1.171191 Inf 1.526817 ASL 3.352996 1.306868 1.895677 1.252512 1.526817 Inf }} !!判別分析 !lda() *Leave-One-Out Cross Validation付きで判別分析を行う モデル <- lda(カテゴリー ~ ., data = データ) {{pre > JPNS_LDA.model <- lda(filename ~ ., data=JPNSindex) > JPNS_LDA.model Call: lda(filename ~ ., data = JPNSindex) Prior probabilities of groups: JP NS 0.5 0.5 Group means: Token Type NoS TTR GI MATTR AWL ASL JP 175.4 98.5 13.7 0.5872091 7.491144 0.6698087 4.404382 12.78953 NS 680.1 288.8 32.0 0.4234451 10.971703 0.6870171 4.598315 21.84937 Coefficients of linear discriminants: LD1 Token 0.1264180 Type -0.4157560 NoS -0.5039002 TTR -13.5498322 GI 10.3108076 MATTR -25.6348057 AWL -3.8158570 ASL -0.6479296 }} !青木先生のdisc() {{pre > JPNS_LDA.model.aoki <- disc(JPNSindex[2:length(JPNSindex)], JPNSindex[1]) > JPNS_LDA.model.aoki 判別関数 JP:NS Partial F p-value Token -0.82405 6.83105 0.02410 Type 2.71010 8.72961 0.01310 NoS 3.28466 2.21374 0.16488 TTR 88.32434 2.05427 0.17959 GI -67.21082 8.15125 0.01565 MATTR 167.10003 1.77124 0.21016 AWL 24.87360 3.09877 0.10609 ASL 4.22352 1.43948 0.25543 constant 29.96605 分類関数 JP NS Token 1.76864 0.12053 Type 2.65903 8.07922 NoS -38.88947 -32.32014 TTR -880.73144 -704.08275 GI -62.69086 -197.11249 MATTR -2288.39994 -1954.19988 AWL -298.49685 -248.74966 ASL -58.85016 -50.40312 constant 2273.80143 2333.73353 判別結果 prediction group JP NS JP 10 0 NS 0 10 正判別率 = 100.0 % }}