トップ 差分 一覧 Farm ソース 検索 ヘルプ PDF RSS ログイン

keyness

*disclaimer
93427

R

keyness


  • 複数の文書からなるコーパスがあったとして、
  • その中の特定の文書が、ほかの残りと比べて、際立って違う言葉遣いをしていることを探る。
  • 二種類の文書に対して行うことで、二つの文書の相違を観察できる。
    • targetとreference group

  • 符号付き2x2の関連性スコア(association score)

quanteda::textstat_keyness(文書行列, target)

  • target で、注目したい文書を指定する。
    • 数字を入れれば、文書行列中の要素番号
    • ほかには、文書行列に、各文書の属性情報をつけて置き、その属性でグループを指定する。
      • 例えば、JANとNTVという属性をつけて、JANをターゲットに指定して、残りのNTVをreferenceとして比較する。
  • measure = オプションで統計値を選べる。(signed というのは、プラスマイナスの符号を使うから)
    • chi2 は、χ二乗
    • exact は、Fisher's exact test
    • lr は、likelihood ratio(G2)

nicestJAN.1st <- textstat_keyness(nicestJAN.dfm, 1)
nicestJAN.1st
          feature          chi2            p n_target n_reference
1      specialized  3.295107e+01 9.450749e-09        5           0
2        knowledge  1.741756e+01 3.000417e-05        5           3
3            bload  1.656914e+01 4.690807e-05        3           0
4       knowledges  8.652160e+00 3.266739e-03        2           0
5          subject  8.652160e+00 3.266739e-03        2           0
6          walking  8.652160e+00 3.266739e-03        2           0
7       dictionary  8.652160e+00 3.266739e-03        2           0
8     dictionaries  8.652160e+00 3.266739e-03        2           0
9           become  8.652160e+00 3.266739e-03        2           0
10             get  8.090752e+00 4.449170e-03        3           2
11             you  5.415006e+00 1.996438e-02       11          43
12            such  4.849584e+00 2.765279e-02        2           1
13        specific  4.849584e+00 2.765279e-02        2           1
14            does  4.849584e+00 2.765279e-02        2           1
15       something  4.849584e+00 2.765279e-02        2           1
16          better  3.007704e+00 8.286961e-02        2           2
17              to  2.199027e+00 1.380979e-01       11          57
18             one  1.950172e+00 1.625683e-01        2           3


textplot_keyness(nicestJAN.1st)

 keynessの分析例:学習者コーパス NICESTのサンプルデータを使って

必要なパッケージ

install.packages("quanteda", dependencies=T)
library(quanteda)
install.packages("readtext", dependencies=T)
library(readtext)

データの構成

  • NICEST sample files というディレクトリー内に二つのサブディレクトリー
    • NTV sample 10
    • JPN sample 10
  • それぞれサンプルエッセイの本文部分のみ、10個ずつ入れてある。

データの読み込み readtext()

  • 作業ディレクトリーを「NICEST sample files」に設定 setwd()
    • そこをもとに、二つのサブディレクトリー内のファイを読み込む readtext()
  • まず母語話者データ
setwd("C:/Users/sugiura/.../NICEST sample files")
# テキストファイルを読み込む。まずは、L1の母語話者データ
nicestNTV.tmp <- readtext("NTV sample 10/*.txt")
head(nicestNTV.tmp)

## readtext object consisting of 6 documents and 0 docvars.
## # Description: df[,2] [6 x 2]
##   doc_id           text               
## * <chr>            <chr>              
## 1 ENG0002_1P1A.txt "\"There is a\"..."
## 2 ENG0002_2P5A.txt "\"By the yea\"..."
## 3 ENG0002_3P2A.txt "\"The questi\"..."
## 4 ENG0002_4P7A.txt "\"The questi\"..."
## 5 ENG0002_5P6A.txt "\"In the las\"..."
## 6 ENG0002_6P3A.txt "\"The questi\"..."

コーパスデータ化する corpus()

nicestNTV.corpus <- corpus(nicestNTV.tmp)
str(nicestNTV.corpus)

## List of 4
##  $ documents:'data.frame':   10 obs. of  1 variable:
##   ..$ texts: chr [1:10] "There is a well known phrase, frequently employed in denigration rather than compliment, which describes a pers"| __truncated__ "By the year 2026, will there be fewer cars in use than there are in 2016?\nThis is a difficult question to answ"| __truncated__ "The question of whether young people enjoy life more than older people do is one fraught with potentially insur"| __truncated__ "The question of whether it is concepts or facts that are more important foci for students is a hot topic for ma"| __truncated__ ...
##  $ metadata :List of 2
##   ..$ source : chr "C:/Users/sugiura/Documents/* on x86-64 by sugiura"
##   ..$ created: chr "Sun Dec 01 12:04:54 2019"
##  $ settings :List of 12
##   ..$ stopwords          : NULL
##   ..$ collocations       : NULL
##   ..$ dictionary         : NULL
##   ..$ valuetype          : chr "glob"
##   ..$ stem               : logi FALSE
##   ..$ delimiter_word     : chr " "
##   ..$ delimiter_sentence : chr ".!?"
##   ..$ delimiter_paragraph: chr "\n\n"
##   ..$ clean_tolower      : logi TRUE
##   ..$ clean_remove_digits: logi TRUE
##   ..$ clean_remove_punct : logi TRUE
##   ..$ units              : chr "documents"
##   ..- attr(*, "class")= chr [1:2] "settings" "list"
##  $ tokens   : NULL
  • コーパスの概要を確認 summary()
summary(nicestNTV.corpus)

## Corpus consisting of 10 documents:
## 
##              Text Types Tokens Sentences
##  ENG0002_1P1A.txt   290    717        30
##  ENG0002_2P5A.txt   361    899        28
##  ENG0002_3P2A.txt   388    987        33
##  ENG0002_4P7A.txt   438   1047        38
##  ENG0002_5P6A.txt   415    969        35
##  ENG0002_6P3A.txt   257    697        23
##  ENG0002_7P4A.txt   320    851        55
##  ENG0002_8P8A.txt   260    612        30
##  ENG0003_1P8B.txt   185    456        24
##  ENG0003_2P4B.txt   223    531        22
## 
## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura
## Created: Sun Dec 01 12:04:54 2019
## Notes:

言語情報を属性として追加する docvars()

docvars(コーパスデータ, 文書属性) <- 値
summary(コーパスデータ) で内容を確認

  • 母語話者データはL1、学習者データはL2
docvars(nicestNTV.corpus, "lang") <- "L1"

summary(nicestNTV.corpus)

## Corpus consisting of 10 documents:
## 
##              Text Types Tokens Sentences lang
##  ENG0002_1P1A.txt   290    717        30   L1
##  ENG0002_2P5A.txt   361    899        28   L1
##  ENG0002_3P2A.txt   388    987        33   L1
##  ENG0002_4P7A.txt   438   1047        38   L1
##  ENG0002_5P6A.txt   415    969        35   L1
##  ENG0002_6P3A.txt   257    697        23   L1
##  ENG0002_7P4A.txt   320    851        55   L1
##  ENG0002_8P8A.txt   260    612        30   L1
##  ENG0003_1P8B.txt   185    456        24   L1
##  ENG0003_2P4B.txt   223    531        22   L1
## 
## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura
## Created: Sun Dec 01 12:04:54 2019
## Notes:

同様に、L2の学習者データのテキストファイルを読み込む。

setwd("C:/Users/sugiura/.../NICEST sample files")
nicestJPN.tmp <- readtext("JPN sample 10/*.txt")
head(nicestJPN.tmp)

## readtext object consisting of 6 documents and 0 docvars.
## # Description: df[,2] [6 x 2]
##   doc_id          text               
## * <chr>           <chr>              
## 1 JAN0001_P1B.txt "\"Some peopl\"..."
## 2 JAN0001_P2B.txt "\"You may th\"..."
## 3 JAN0001_P3B.txt "\"Compared w\"..."
## 4 JAN0001_P4B.txt "\"You may ha\"..."
## 5 JAN0001_P5B.txt "\"Elderly pe\"..."
## 6 JAN0001_P6B.txt "\"Group tour\"..."

コーパスデータ化する

nicestJPN.corpus <- corpus(nicestJPN.tmp)
str(nicestJPN.corpus)

## List of 4
## $ documents:'data.frame': 10 obs. of 1 variable:
## ..$ texts: chr [1:10] "Some people say that specialized knowledge is not important for human, however, who make todays life such a con"| truncated "You may think that young people are active and free, on the other hand olders are less active and they have muc"| truncated "Compared with past, young people nowadays do not give enough time to helping their communities.\nI guess there "| truncated "You may have experiences like this, feel nice at products in some advertisement but you buy and see it, you dis"| truncated ...
## $ metadata :List of 2
## ..$ source : chr "C:/Users/sugiura/Documents/* on x86-64 by sugiura"
## ..$ created: chr "Sun Dec 01 12:04:54 2019"
## $ settings :List of 12
## ..$ stopwords : NULL
## ..$ collocations : NULL
## ..$ dictionary : NULL
## ..$ valuetype : chr "glob"
## ..$ stem : logi FALSE
## ..$ delimiter_word : chr " "
## ..$ delimiter_sentence : chr ".!?"
## ..$ delimiter_paragraph: chr "\n\n"
## ..$ clean_tolower : logi TRUE
## ..$ clean_remove_digits: logi TRUE
## ..$ clean_remove_punct : logi TRUE
## ..$ units : chr "documents"
## ..- attr(*, "class")= chr [1:2] "settings" "list"
## $ tokens : NULL

summary(nicestJPN.corpus)

## Corpus consisting of 10 documents:
##
## Text Types Tokens Sentences
## JAN0001_P1B.txt 116 214 12
## JAN0001_P2B.txt 138 268 17
## JAN0001_P3B.txt 97 169 11
## JAN0001_P4B.txt 68 99 8
## JAN0001_P5B.txt 120 262 16
## JAN0001_P6B.txt 114 224 13
## JAN0001_P7B.txt 121 268 18
## JAN0001_P8B.txt 71 108 8
## JAN0002_P1A.txt 98 170 15
## JAN0002_P2A.txt 117 216 19
##
## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura
## Created: Sun Dec 01 12:04:54 2019
## Notes:
}}

言語情報を属性として追加する

docvars(nicestJPN.corpus, "lang") <- "L2"
summary(nicestJPN.corpus)

## Corpus consisting of 10 documents:
## 
##             Text Types Tokens Sentences lang
##  JAN0001_P1B.txt   116    214        12   L2
##  JAN0001_P2B.txt   138    268        17   L2
##  JAN0001_P3B.txt    97    169        11   L2
##  JAN0001_P4B.txt    68     99         8   L2
##  JAN0001_P5B.txt   120    262        16   L2
##  JAN0001_P6B.txt   114    224        13   L2
##  JAN0001_P7B.txt   121    268        18   L2
##  JAN0001_P8B.txt    71    108         8   L2
##  JAN0002_P1A.txt    98    170        15   L2
##  JAN0002_P2A.txt   117    216        19   L2
## 
## Source: C:/Users/sugiura/Documents/* on x86-64 by sugiura
## Created: Sun Dec 01 12:04:54 2019
## Notes:

二種類のコーパスデータの統合 統合コーパスAB <- 統合前コーパスA + 統合前コーパスB

nicest20.corpus <- nicestJPN.corpus + nicestNTV.corpus
summary(nicest20.corpus)

## Corpus consisting of 20 documents:
## 
##              Text Types Tokens Sentences lang
##   JAN0001_P1B.txt   116    214        12   L2
##   JAN0001_P2B.txt   138    268        17   L2
##   JAN0001_P3B.txt    97    169        11   L2
##   JAN0001_P4B.txt    68     99         8   L2
##   JAN0001_P5B.txt   120    262        16   L2
##   JAN0001_P6B.txt   114    224        13   L2
##   JAN0001_P7B.txt   121    268        18   L2
##   JAN0001_P8B.txt    71    108         8   L2
##   JAN0002_P1A.txt    98    170        15   L2
##   JAN0002_P2A.txt   117    216        19   L2
##  ENG0002_1P1A.txt   290    717        30   L1
##  ENG0002_2P5A.txt   361    899        28   L1
##  ENG0002_3P2A.txt   388    987        33   L1
##  ENG0002_4P7A.txt   438   1047        38   L1
##  ENG0002_5P6A.txt   415    969        35   L1
##  ENG0002_6P3A.txt   257    697        23   L1
##  ENG0002_7P4A.txt   320    851        55   L1
##  ENG0002_8P8A.txt   260    612        30   L1
##  ENG0003_1P8B.txt   185    456        24   L1
##  ENG0003_2P4B.txt   223    531        22   L1
## 
## Source: Combination of corpuses nicestJPN.corpus and nicestNTV.corpus
## Created: Sun Dec 01 12:04:55 2019
## Notes:

文書行列を作成。句読点の削除。(小文字化は自動)dfm(コーパスデータ, オプション)

nicest20.dfm <- dfm(nicest20.corpus, remove_punct=T)
str(nicest20.dfm)

## Formal class 'dfm' [package "quanteda"] with 15 slots
##   ..@ settings    : list()
##   ..@ weightTf    :List of 3
##   .. ..$ scheme: chr "count"
##   .. ..$ base  : NULL
##   .. ..$ K     : NULL
##   ..@ weightDf    :List of 5
##   .. ..$ scheme   : chr "unary"
##   .. ..$ base     : NULL
##   .. ..$ c        : NULL
##   .. ..$ smoothing: NULL
##   .. ..$ threshold: NULL
##   ..@ smooth      : num 0
##   ..@ ngrams      : int 1
##   ..@ skip        : int 0
##   ..@ concatenator: chr "_"
##   ..@ version     : int [1:3] 1 5 2
##   ..@ docvars     :'data.frame': 20 obs. of  1 variable:
##   .. ..$ lang: chr [1:20] "L2" "L2" "L2" "L2" ...
##   ..@ i           : int [1:3859] 0 2 3 4 5 6 7 9 12 13 ...
##   ..@ p           : int [1:1870] 0 13 28 35 53 54 60 79 97 101 ...
##   ..@ Dim         : int [1:2] 20 1869
##   ..@ Dimnames    :List of 2
##   .. ..$ docs    : chr [1:20] "JAN0001_P1B.txt" "JAN0001_P2B.txt" "JAN0001_P3B.txt" "JAN0001_P4B.txt" ...
##   .. ..$ features: chr [1:1869] "some" "people" "say" "that" ...
##   ..@ x           : num [1:3859] 2 1 1 2 1 3 1 1 1 1 ...
##   ..@ factors     : list()

言語属性をもとに、グループ分け docvars()

docvars(コーパスデータ, 文書属性) == 文書属性該当情報

  • keynessの考え方として、lang属性がL2のものを「target」、その他のもの、つまりL1を「reference」と位置付ける
jpn.group <- (docvars(nicest20.corpus, "lang") == "L2")
jpn.group

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

特徴語 keynessの算出 textstat_keyness()

textstat_keyness(文書行列, ターゲット情報)
feature(特徴語) χ二乗値 p値 ターゲットでの頻度 レファレンスでの頻度
  • 結果は、符号付(プラスマイナス)χ二乗値の高い順に出力される
    • head(keyness情報, 20)で、ターゲットの方での「上位」20語の観察
    • tail(keyness情報, 20)で、レファレンスの方での「上位」20語の観察
nicest20.keys <- textstat_keyness(nicest20.dfm, jpn.group)
head(nicest20.keys, 20)

##        feature     chi2            p n_target n_reference
## 1            i 46.66002 8.443468e-12       27          16
## 2        think 46.65550 8.462897e-12       17           3
## 3          not 26.67282 2.409881e-07       39          53
## 4      reasons 25.22333 5.106078e-07        9           1
## 5       reason 25.22333 5.106078e-07        9           1
## 6        young 23.12722 1.516293e-06       18          15
## 7       person 22.25553 2.386705e-06       11           4
## 8          old 21.48057 3.574323e-06        9           2
## 9         know 21.34540 3.835402e-06       13           7
## 10     ginious 18.42598 1.766336e-05        6           0
## 11         you 17.37500 3.068359e-05       54         105
## 12        some 16.54467 4.751730e-05       12           8
## 13      people 16.47472 4.930307e-05       30          46
## 14        want 15.91245 6.634097e-05       13          10
## 15 specialized 14.64053 1.300867e-04        5           0
## 16         sum 14.64053 1.300867e-04        5           0
## 17          do 13.11234 2.933565e-04       21          30
## 18          so 12.37194 4.358338e-04       14          16
## 19      future 11.97323 5.397032e-04        7           3
## 20         too 11.97323 5.397032e-04        7           3

tail(nicest20.keys, 20)

##        feature       chi2            p n_target n_reference
## 1850      take  -2.000469 1.572505e-01        0          12
## 1851        be  -2.050220 1.521842e-01       10          62
## 1852      also  -2.127000 1.447238e-01        1          19
## 1853      fact  -2.252301 1.334159e-01        0          13
## 1854 education  -2.252301 1.334159e-01        0          13
## 1855        we  -2.347080 1.255179e-01        5          39
## 1856     would  -2.505589 1.134431e-01        0          14
## 1857    within  -2.505589 1.134431e-01        0          14
## 1858       how  -2.505589 1.134431e-01        0          14
## 1859        of  -2.921923 8.738367e-02       33         174
## 1860       may  -3.418485 6.447015e-02        4          39
## 1861  question  -3.528659 6.031656e-02        0          18
## 1862    better  -3.629432 5.676618e-02        4          40
## 1863        on  -3.708875 5.412387e-02        5          46
## 1864     based  -3.786172 5.167772e-02        0          19
## 1865   whether  -5.080234 2.420011e-02        0          24
## 1866        an  -6.806180 9.084289e-03        0          26
## 1867         a  -8.911901 2.833182e-03       14         121
## 1868       and -15.267883 9.328931e-05       29         233
## 1869       the -23.815514 1.060245e-06       32         291

プロットする

textplot_keyness(nicest20.keys)