単語リスト作成のまとめ

必要な処理

  1. ファイルの読み込み
  2. 本文部分のみの抽出
  3. 話者記号の削除
  4. 句読点をスペースに置換
  5. 余分なスペースの削除
  6. すべて小文字に
  7. スペースを区切りに分割
  8. リスト形式をやめる
  9. アルファベット順に並べ替え
  10. 重複をやめる
  11. 数を数える

NS501.txtを例に

#getwd()
list.files()
setwd("../NICER1_3_2/NICER_NS")
list.files()
setwd("../NICER1_3_2/NICER_NS")
tmp1 <- scan("NS501.txt", what="char", sep="\n")
tmp2 <- grep("^\\*NS501:\t", tmp1, value=T)
tmp3 <- gsub("^\\*NS501:\t", "", tmp2)
tmp4 <- gsub("[[:punct:]]", " ", tmp3)
tmp5 <- gsub("  +", " ", tmp4)
tmp6 <- tolower(tmp5)
tmp7 <- strsplit(tmp6, " ")
tmp8 <- unlist(tmp7)
token <- sort(tmp8)
type <- unique(token)
length(token)
## [1] 742
length(type)
## [1] 359

ファイルを選択できるように:choose.files()

MacOSは、file.choose() 例として、NS501.txt を、開いたウィンドウ内で選択し、結果が同じことを確認する。

setwd("../NICER1_3_2/NICER_NS")
tmp1 <- scan(choose.files(), what="char", sep="\n")
tmp2 <- grep("^\\*\\w+:\t", tmp1, value=T)
tmp3 <- gsub("^\\*\\w+:\t", "", tmp2)
tmp4 <- gsub("[[:punct:]]", " ", tmp3)
tmp5 <- gsub("  +", " ", tmp4)
tmp6 <- tolower(tmp5)
tmp7 <- strsplit(tmp6, " ")
tmp8 <- unlist(tmp7)
token <- sort(tmp8)
type <- unique(token)
length(token)
## [1] 742
length(type)
## [1] 359

結果の保存(出力): write(変数名, “ファイル名”)

write(token, "token.txt")
write(type, "type.txt")

単語頻度一覧表の作成:table()

table(token)

単語頻度一覧表の保存:write.table(変数名, “ファイル名”)

write.table(table(token), "token_table.txt")
  • これをエクセルで読み込むことができる。(区切り文字はスペース)

区切り文字(セパレーター)を「タブ(“\t”)」にして保存

  • オプションで設定 sep=“\t”
write.table(table(token), "token_table_tab.txt", sep="\t")

一連の命令をまとめて一つの命令(関数)にする:function()

自分の新しい命令 <- function(){
    ここに実行したい命令を順に書いていく
    ここに実行したい命令を順に書いていく
    ここに実行したい命令を順に書いていく
    ここに実行したい命令を順に書いていく
    }

ファイルを読み込み、本文のデータ部分だけを取り出す

myData <- function(){
  lines.tmp <- scan(choose.files(), what="char", sep="\n")
  body.tmp <- grep("^\\*\\w+:\t", lines.tmp, value=T)
  data.tmp <- gsub("^\\*\\w+:\t", "", body.tmp)
  data.tmp <- data.tmp[data.tmp != ""]
  # 空でない要素のみを残す「イディオム」
}
  • これを実行すると、myData() という新しい命令(関数)ができる。

  • function()内に書いた一連の処理が、myData()だけで実行できる。

  • これは、ファイルを読み込んで、本文のデータ部分だけを取り出す命令。その結果を変数に保存する。

NS502.txtを読み込んで、データ部分を ns502.dataという変数に保存する例

ns502.data <- myData()
  • 実行すると、choose.files()で、窓が開いてファイルを選ぶ。結果は、ns502.dataとしてR内に保存される。
  • 中身を見るには、ns502.dataといれてEnterキーを押す。
ns502.data
##  [1] "An Assumed Role"                                                                                                                                                                                                                                                                                                              
##  [2] "Considering the heightened role maintained by many in education, it's strangely rare to fin many willing to question this status quo."                                                                                                                                                                                        
##  [3] "Of course, many aware of the dynamic contrast between student and teacher are perfectly willing to perpetuate and even strengthen this relationship."                                                                                                                                                                         
##  [4] "It may seem entirely silly, even, to consider that anything needs to be changed."                                                                                                                                                                                                                                             
##  [5] "However, with growing competition in workplaces and with newer jobs being developed on a regular basis, it may be necessary to reexamine this two-dimensional hierarchy in order to better prepare students for the changing world."                                                                                          
##  [6] "For years, American schools have maintained a strict adherence to an invisible, ghostly code, one that encourages respect and honor in a manner reminiscent of both the power structure found in factories and the military."                                                                                                 
##  [7] "Of course, American schools don't echo this tendency nearly as intensely as those found in Japan, but the elementary through high school experience is one defined by an old-fashioned methodology."                                                                                                                          
##  [8] "Teachers are called by their last name, friendliness is discouraged, and contact, above all else, is highly feared."                                                                                                                                                                                                          
##  [9] "And yet, all of this changes once Americans enter universities, where the professor-student roles often evaporate, save for the minority of more conservative circumstances."                                                                                                                                                 
## [10] "Many professors are willing to acknowledge that their students have long since passed the line into adulthood and are willing to share their normal array of social tendencies with them."                                                                                                                                    
## [11] "It's not uncommon for professors to invite students out for drinks to discuss course material, or even to engage in openly accepted romantic relationships."                                                                                                                                                                  
## [12] "To many foreigners studying in America, this may seem wildly anarchistic and highly vulnerable to abuse."                                                                                                                                                                                                                     
## [13] "Whenever I tell stories of professor-student relationships to French students, especially, their first reaction is to assume that this means many professors must abuse this open attitude to gain \"special favors\" in return for favoritism."                                                                              
## [14] "While I can't, of course, claim that the chances of this occurring are nested at a safe zero percent, I can say that this attitude stems more from professors' willingness to see their students as intellectual equals."                                                                                                     
## [15] "So why, then, are we leaving our students so unprepared for such an atmosphere?"                                                                                                                                                                                                                                              
## [16] "Perhaps it's because grade school teachers are so far removed from the college experience that they've forgotten, or perhaps proceeded, the loose roles found in the college setting."                                                                                                                                        
## [17] "Speaking from experience, I can absolutely say that I'm far more eager to engage with material when the voice educating me treats me like a sentient, thinking individual, and not just a sieve or receptor ready to accept input but offer no active response."                                                              
## [18] "Students are not merely plants to be grown but people to be challenged and to challenge, and the first step to instituting this change in the educational environment is to stop the formalities and create teaching atmospheres in which students are welcome to interact with the teachers from a common, shared viewpoint."
## [19] "Now this is not to say that teachers need to abandon authority."                                                                                                                                                                                                                                                              
## [20] "Psychologists have speculated for years that the proper parenting method involves both active discussion and steady rules."                                                                                                                                                                                                   
## [21] "But rules must serve a distinct purpose, beyond merely establishing dominance or perpetuating the fallacy that rules, by their own virtue, are the end-all source of compliance."                                                                                                                                             
## [22] "Allow rules to be challenged, and even changed on occasion."                                                                                                                                                                                                                                                                  
## [23] "Let the wealth of perspectives a teacher has access to help establish the right course of action, not tradition or assumed patronage."                                                                                                                                                                                        
## [24] "Education should be defined by the fluidity of the human experience, not built to oppose it."                                                                                                                                                                                                                                 
## [25] "Supreme authority without equal exchange is as unnatural as it is futile, and both educators and the educated would benefit from avoiding it."                                                                                                                                                                                
## [26] "Instead, we should yearn to construct a bridge between student and teacher, create closer, more equal power structures to create wholehearted inspiration and higher rates of success."

♪演習

1:単語頻度一覧表を作る関数を作ってみよう

2:学習者データと母語話者データそれぞれ、10個の単語頻度一覧表を作成してみよう


TypeとTokenをもとに言語指標を出してみる

  1. テキスト長: テキストの総語数(含まれる単語数=Token)
  2. 語彙の多様性指標 TTR (Type/Token Ratio)
length(type)/length(token)
## [1] 0.4838275

単語頻度一覧表を作る過程でtypeとtokenは出てくる。

  • それを応用して、length(type)/length(token)を計算すればTTRが出る。

  • データ部分のみが入っている変数(例えば ns502.data)があるとして、それを対象に実行したらTTRが出るようにする。

  • この関数の場合、「目的語」を取る。

  • 目的語として()内に記入したものを対象として処理が行われる。例:目的語を d と書いておき、それが処理されるとする。

myTTR <- function(d){
  tmp4 <- gsub("[[:punct:]]", " ", d)
  tmp5 <- gsub("  +", " ", tmp4)
  tmp6 <- tolower(tmp5)
  tmp7 <- strsplit(tmp6, " ")
  tmp8 <- unlist(tmp7)
  token <- sort(tmp8)
  type <- unique(token)
  length(type)/length(token)
}
myTTR(ns502.data)
## [1] 0.529321

TypeとTokenとTTRが並んで出るようにする:cat()

myTTR2 <- function(d){
  tmp4 <- gsub("[[:punct:]]", " ", d)
  tmp5 <- gsub("  +", " ", tmp4)
  tmp6 <- tolower(tmp5)
  tmp7 <- strsplit(tmp6, " ")
  tmp8 <- unlist(tmp7)
  token <- sort(tmp8)
  type <- unique(token)
  ttr <- length(type)/length(token)
  cat(length(type), length(token), ttr)
}
myTTR2(ns502.data)
## 343 648 0.529321

♪演習

1. 学習者・母語話者それぞれ10個のファイルについて、総語数とTTRを出してみる。

2. 総語数とTTRの関係について調べてみる。

3. 英文テキスト一般を対象に、語彙頻度一覧表を作成する独自関数を作ってみましょう。