トップ履歴一覧 Farm ソース検索ヘルプ PDF RSS ログイン

TypeとToken

TypeとToken

Rのパッケージ corpus を参考に、TypeとTokenの振る舞いを見てみる

オズの魔法使いのテキストを取ってきて本文だけにする。
TypeとTokenの分布を見てみる

多様性を見てみる

TTRを追加する
Guiraud Index（ギロー・インデックス）を追加してみる

1000語までで見てみる
2千語までで見てみる
1万語までで見てみる

TypeとToken

Rのパッケージ corpus を参考に、TypeとTokenの振る舞いを見てみる

http://corpustext.com/articles/corpus.html

オズの魔法使いのテキストを取ってきて本文だけにする。

oz.text <- gsub("\\n", " ", text)
oz.text.nopunct <- gsub("\\W+", " ", oz.text)
oz.words <- strsplit(oz.text.nopunct, "\\W")
oz.words <- unlist(oz.words)
write(oz.words, file="ozWords.txt")
length(oz.words)

39,456語

> head(sort(table(oz.words), decreasing=T), 10)
oz.words
 the  and   to   of    a    I  was   in  you   he 
2731 1593 1096  811  795  647  501  463  448  410

TypeとTokenの分布を見てみる

394行2列の行列、0で初期化

> oztt <- matrix(0, nrow=394, ncol=2)

100語ずつ累積して39,400語までのTypeとTokenを見てみる。

i <- 1
y <- 0
while (i <= 394) {
	y <- i * 100
	tmp <- oz.words[1:y]
	oztt[i,1] <- length(tmp)
	oztt[i,2] <- length(unique(tmp))
	i <- i+1
}

データフレーム化して、見出しをつける

> oztt <- as.data.frame(oztt)
> colnames(oztt) <- c("token","type")
> head(oztt)
  token type
1   100   63
2   200  122
3   300  167
4   400  209
5   500  248
6   600  280

>plot(oztt$token, oztt$type)

多様性を見てみる

TTRを追加する

> oztt$ttr <- oztt$type / oztt$token
> head(oztt)
  token type       ttr
1   100   63 0.6300000
2   200  122 0.6100000
3   300  167 0.5566667
4   400  209 0.5225000
5   500  248 0.4960000
6   600  280 0.4666667

tokenの増加に伴うtypeの増加、および、TTRの減少をグラフにプロット

y軸のスケールは無視してグラフが重なるようにプロットしている

> plot(oztt$token, oztt$type)
> par(new=T)
> plot(oztt$token, oztt$ttr, col = "red")

Guiraud Index（ギロー・インデックス）を追加してみる

> oztt$gi <- oztt$type / sqrt(oztt$token)
> par(new=T)
> plot(oztt$token, oztt$gi, col = "blue")

およそ4000語を越えたあたりからほぼ水平で意外と安定している。

1000語までで見てみる

ozttt <- matrix(0, nrow=1000, ncol=2)
i <- 1
while (i <= 1000) {
	tmp <- oz.words[1:i]
	ozttt[i,1] <- length(tmp)
	ozttt[i,2] <- length(unique(tmp))
	i <- i+1
}

ozttt <- as.data.frame(ozttt)
colnames(ozttt) <- c("token","type")

cor.test(ozttt$type, ozttt$token)

       Pearson's product-moment correlation

data:  ozttt$type and ozttt$token
t = 234.34, df = 998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9898566 0.9920780
sample estimates:
      cor 
0.9910356

lm(ozttt$type ~ ozttt$token)
Call:
lm(formula = ozttt$type ~ ozttt$token)

Coefficients:
(Intercept)  ozttt$token  
    43.8034       0.3761

#plot(ozttt$token, ozttt$type)
#abline(lm(ozttt$type ~ ozttt$token))

pred1000 <- predict(lm(ozttt$type ~ ozttt$token), interval = "prediction")

    結果を保存したpred1000のデータををデータフレーム型に変更  
pred1000 <- as.data.frame(pred1000)

    データをプロット  
plot(ozttt$token, ozttt$type)

    フィット（回帰直線）を黒で描く  
lines(ozttt$token, pred1000$fit, col = "black")

    上限値を赤で描く  
lines(ozttt$token, pred1000$upr, col = "red")

    下限値を青で描く  
lines(ozttt$token, pred1000$lwr, col = "blue")

2千語までで見てみる

1万語までで見てみる

oztt10t <- matrix(0, nrow=10000, ncol=2)
i <- 1
while (i <= 10000) {
	tmp <- oz.words[1:i]
	oztt10t[i,1] <- length(tmp)
	oztt10t[i,2] <- length(unique(tmp))
	i <- i+1
}

oztt10t <- as.data.frame(oztt10t)
colnames(oztt10t) <- c("token","type")

plot(oztt10t$token, oztt10t$type)

oztt10t$ttr <- oztt10t$type / oztt10t$token

par(new=T)
plot(oztt10t$token, oztt10t$ttr, col = "red")

oztt10t$gi <- oztt10t$type / sqrt(oztt10t$token)
par(new=T)
plot(oztt10t$token, oztt10t$gi, col = "blue")

https://sugiura-ken.org/wiki/

Menu

keyword

category

[GLMM]
[Linux]
[HSP]
[R]
[Python]
[Corpus]
[Google]

更新履歴

2024/5/5

2024/5/3

Excel.tips

2024/4/29

case_when

2024/4/28

R.tips

2024/4/26

2024/4/24

2024/4/22

2024/4/16

sugiura-ken

2024/4/6

sjPlot

2024/4/5

MicrosoftTeams

2024/3/23

順序ロジスティック回帰

2024/3/22

ggplot2

2024/3/20

emmeans

2024/3/16

Windows11

2024/3/8

R

2024/1/28

sakura editor

2024/1/17

spacyr

2024/1/11

2024/1/8

sum contrasts

2024/1/6

jtools

2024/1/3

contrast coding

2023/12/29

abline()

2023/12/22

Word.tips

2023/12/21

easystats

2023/12/20

table()

2023/12/15

Constructional Diversity Analyzer

2023/12/11

Kivy

2023/12/8

多重比較

2023/12/1

Multiple Regression Analysis

2023/11/30

performance

2023/11/26

2023/11/23

相関係数の検定

2023/11/19

2023/11/18

SCA

2023/11/15

Emacs

2023/11/10

inkscape

2023/11/8

forward digit span test

2023/10/29

2023/10/28

ChatGPT

2023/10/26

Thunderbird

2023/10/23

IPSyn

2023/10/21

AntConc

2023/10/16

Outlook

2023/10/15

GoogleSlides

2023/10/9

HSP

2023/10/7

Edge

2023/9/29

TeX

2023/9/27

Discord

2023/9/26

NUCT

2023/9/25

Freemind

2023/9/19

2023/9/18

fitdistrplus

2023/9/17

ICC

2023/9/16

2023/9/14

ownCloud

2023/8/22

xtabs()

2023/8/20

モデル選択

2023/8/8

2023/7/29

Git

2023/7/28

Overleaf

2023/7/21

Zotero

2023/7/13

Beamer

2023/7/11

simpleboot

2023/7/10

2023/7/9

lingpsych

2023/6/28

Brunner-Munzel検定

2023/6/21

2023/6/18

DALL.E

2023/6/17

glmmTMB
Menu

2023/6/13

Bing Image Creator

2023/6/12

ggeffects

2023/6/7

ExpbyHSP

2023/6/2

ifelse

2023/5/30

TextForHSP04

2023/5/24

TextForHSP03

2023/5/16

antisaccade test

2023/5/10

TextForHSP

2023/5/9

2023/5/2

geom_boxplot()

2023/4/28

mutate()

2023/4/14

CloudLaTeX

2023/4/10

MicrosoftOffice

2023/3/26

反応時間の分析

2023/3/24

p.adjust

2023/3/22

Tobii

2023/3/11

2023/3/7

CHAT

2023/2/21

SQL

2023/2/20

TACT

2023/1/25

COCA

2023/1/19

R.package

2023/1/15

stopwords

2022/12/28

2022/12/26

flexplot

2022/12/25

2022/12/23

VIF

2022/12/20

2022/12/16

Zoom

2022/12/14

PowerPoint