頻度の検定

x <- c(75, 68, 45, 40, 39, 39, 38, 33, 24, 24)
names(x) <- c("the", "to", "of", "you", "is", "a", "and", "in", "that", "data")
x
chisq.test(x)

 the   to   of  you   is    a  and   in that data 
  75   68   45   40   39   39   38   33   24   24 

	Chi-squared test for given probabilities

data:  x
X-squared = 59.729, df = 9, p-value = 1.512e-09

同じとは言えない。（dataが機能語に入っているのは？）

例：イギリス英語とアメリカ英語で、 therefore の生起位置に違いがあるか。

位置	文頭	文中
英	15	96
米	38	53

therefore.data <- matrix(c(15,38,96,53), nrow=2, ncol=2)

     [,1] [,2]
[1,]   15   96
[2,]   38   53

chisq.test(therefore.data)

	Pearson's Chi-squared test with Yates' continuity correction

data:  therefore.data
X-squared = 19.179, df = 1, p-value = 1.19e-05

有意

適合度の検定

TOP ↑ ↓

理論的に想定される「期待頻度」にあっているか（適合しているか）の検定
総語数が違うコーパスデータ内での頻度の違い
- 例：100万語のコーパスデータ内の36回と、50万語のコーパスデータ内の20回で頻度に差があるか
比率をもとに期待確率を設定する
- 比率は 100万 vs. 50万なので、2:1
  - そのまま p=c(1000000,500000) としておけばよい。
- 全体が 1 になるように比率のスケールを調整するオプション rescale=T

sample.data <- c(36, 20)

chisq.test(sample.data, p=c(2,1), rescale=T)

	Chi-squared test for given probabilities

data:  sample.data
X-squared = 0.14286, df = 1, p-value = 0.7055

有意ではない

G検定 Log-likelihood ratio test（対数尤度比検定）

TOP ↑ ↓

install.packages("Deducer")
library(Deducer)

likelihood.test(therefore.data)

	Log likelihood ratio (G-test) test of independence without correction

data:  therefore.data
Log likelihood ratio statistic (G) = 20.925, X-squared df = 1, p-value = 4.776e-06

Fisher's exact probability test（正確確率法・正確率検定）

TOP ↑ ↓

昔、手作業で計算していた時は、2x2の表が事実上の限界だったが、今は昔。

therefore.data
fisher.test(therefore.data)

     [,1] [,2]
[1,]   15   96
[2,]   38   53

	Fisher's Exact Test for Count Data

data:  therefore.data
p-value = 9.958e-06
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.1021974 0.4526589
sample estimates:
odds ratio 
 0.2196899

オッヅ比

TOP ↑ ↓

出現する確率を比べた値
何倍多く出現するか

嶋田・阿部 (2017) に載っているわかりやすい例
- 「交通事故にあったとき、シートベルトをしていたかどうかで、死ぬ確率がどのくらい高くなるか。」

シートベルト	死亡	ケガ
なし	54	10325
あり	25	51790

d <- matrix(c(54, 10325, 25, 51790), ncol=2, nrow=2, byrow=T)
d
fisher.test(d)

     [,1]  [,2]
[1,]   54 10325
[2,]   25 51790

	Fisher's Exact Test for Count Data

data:  d
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  6.623513 18.173941
sample estimates:
odds ratio 
  10.83069

オッヅ比 10.83069 ということで、死ぬ確率が10倍以上高くなった
95%の信頼区間 6.623513 18.173941 ということで、約6倍から約18倍（本にはなぜか17倍と書いてあるが）

分割表の準備には、xtabs() が便利

TOP ↑ ↓

頻度の検定

頻度の検定

Reference

χ二乗検定

独立性の検定

適合度の検定

G検定 Log-likelihood ratio test（対数尤度比検定）

Fisher's exact probability test（正確確率法・正確率検定）

オッヅ比

分割表の準備には、xtabs() が便利

https://sugiura-ken.org/wiki/

Menu

keyword

category

更新履歴