R !!!table(): クロス集計 {{outline}} ---- !Reference *http://eau.uijin.com/stats/frequencies.html *https://www1.doshisha.ac.jp/~mjin/R/46/46.html !!例 !サンプル・データ *https://www.openintro.org/data/index.php?data=acs12 **Results from the US Census American Community Survey, 2012 **2012年のアメリカ人口統計からの2,000のサンプル !データの取り込み {{pre acs.dat <- read.csv("https://www.openintro.org/data/csv/acs12.csv") head(acs.dat) income employment hrs_work race age gender citizen time_to_work lang married edu disability birth_qrtr 1 60000 not in labor force 40 white 68 female yes NA english no college no jul thru sep 2 0 not in labor force NA white 88 male yes NA english no hs or lower yes jan thru mar 3 NA NA white 12 female yes NA english no hs or lower no oct thru dec 4 0 not in labor force NA white 17 male yes NA other no hs or lower no oct thru dec 5 0 not in labor force NA white 77 female yes NA other no hs or lower yes jul thru sep 6 1700 employed 40 other 35 female yes 15 other yes hs or lower yes jul thru sep }} {{pre str(acs.dat) 'data.frame': 2000 obs. of 13 variables: $ income : int 60000 0 NA 0 0 1700 NA NA NA 45000 ... $ employment : chr "not in labor force" "not in labor force" NA "not in labor force" ... $ hrs_work : int 40 NA NA NA NA 40 NA NA NA 84 ... $ race : chr "white" "white" "white" "white" ... $ age : int 68 88 12 17 77 35 11 7 6 27 ... $ gender : chr "female" "male" "female" "male" ... $ citizen : chr "yes" "yes" "yes" "yes" ... $ time_to_work: int NA NA NA NA NA 15 NA NA NA 40 ... $ lang : chr "english" "english" "english" "other" ... $ married : chr "no" "no" "no" "no" ... $ edu : chr "college" "hs or lower" "hs or lower" "hs or lower" ... $ disability : chr "no" "yes" "no" "no" ... $ birth_qrtr : chr "jul thru sep" "jan thru mar" "oct thru dec" "oct thru dec" ... }} !クロス集計:table(データフレームのカラム, データフレームのカラム) *人種と教育 {{pre table(acs.dat$race, acs.dat$edu) college grad hs or lower asian 22 11 53 black 26 9 162 other 12 9 125 white 299 115 1099 }} !クロス集計:table(dataframe[,c(カラム番号を並べる)]) !NAも集計に入れる: useNA = "ifany" {{pre table(acs.dat$race, acs.dat$edu, useNA="ifany") college grad hs or lower asian 22 11 53 1 black 26 9 162 9 other 12 9 125 6 white 299 115 1099 42 }} *useNA = "always"とすると、縦横ですべて表示 {{pre table(acs.dat$race, acs.dat$edu, useNA="always") college grad hs or lower asian 22 11 53 1 black 26 9 162 9 other 12 9 125 6 white 299 115 1099 42 0 0 0 0 }} !縦横の集計も加える: addmargins() {{pre addmargins(table(acs.dat$race, acs.dat$edu)) college grad hs or lower Sum asian 22 11 53 86 black 26 9 162 197 other 12 9 125 146 white 299 115 1099 1513 Sum 359 144 1439 1942 addmargins(table(acs.dat$race, acs.dat$edu, useNA="ifany")) college grad hs or lower Sum asian 22 11 53 1 87 black 26 9 162 9 206 other 12 9 125 6 152 white 299 115 1099 42 1555 Sum 359 144 1439 58 2000 }} *NAの数を確認: is.na() でTRUEがNAの数 {{pre table(is.na(acs.dat$race)) table(is.na(acs.dat$edu)) FALSE 2000 FALSE TRUE 1942 58 }} !比率で表示: prop.table(table()) *全体の内訳の比率 {{pre prop.table(table(acs.dat$race, acs.dat$edu)) college grad hs or lower asian 0.011328527 0.005664264 0.027291452 black 0.013388260 0.004634398 0.083419156 other 0.006179197 0.004634398 0.064366632 white 0.153964985 0.059217302 0.565911432 }} !比率で表示(四捨五入): round(prop.table(table()),2) *小数点以下2桁 {{pre round(prop.table(table(acs.dat$race, acs.dat$edu)),2) college grad hs or lower asian 0.01 0.01 0.03 black 0.01 0.00 0.08 other 0.01 0.00 0.06 white 0.15 0.06 0.57 }} !全体の比率ではなく、横で集計した比率:margin = 1 *この場合は、人種ごとに学歴の比率を出す {{pre round(prop.table(table(acs.dat$race, acs.dat$edu), margin=1),2) college grad hs or lower asian 0.26 0.13 0.62 black 0.13 0.05 0.82 other 0.08 0.06 0.86 white 0.20 0.08 0.73 }} *白人の73%は高卒 *大卒の比率はアジア系が一番高い *大学院修了者の比率もアジア系が一番高い **アジア系の13%の人は大学院修了 !全体の比率ではなく、縦で集計した比率:margin = 2 *学歴ごとに、それを構成する人種の割合 {{pre round(prop.table(table(acs.dat$race, acs.dat$edu), margin=2),2) college grad hs or lower asian 0.06 0.08 0.04 black 0.07 0.06 0.11 other 0.03 0.06 0.09 white 0.83 0.80 0.76 }} *大卒の83%は白人 !!モザイク・グラフ vcd: Visualizing Categorical Data install.packages("vcd") https://cran.r-project.org/web/packages/vcd/index.html https://stats.stackexchange.com/questions/147863/how-to-interpret-the-residual-colors-on-a-mosaic-plot {{pre library(vcd) mosaic(xtabs(~acs.dat$race+acs.dat$edu)) }} {{ref_image mosaic.png}} * オプションで shade=T つけると、色がつく。 ** 残差分析 *** 青:期待値より観察値が多い *** 赤:期待値より観察値が少ない !!結果をデータフレームに変換: as.data.frame() !変換前 table(nns.bigram.df) %>% head() ngram.all 0 years 05 euro 1 2019 1 euro 1 grandfather 1 hour 327 209 23 418 295 159 !変換後 table(nns.bigram.df) %>% as.data.frame() %>% head() ngram.all Freq 1 0 years 327 2 05 euro 209 3 1 2019 23 4 1 euro 418 5 1 grandfather 295 6 1 hour 159