1 復習 Review

1.1 ファイルの読み込み

1.1.1 準備:今どこにいるか?

  • メニュー「Session」> Set Working Directory > Choose Directory で目的のフォルダを開く
  • Knit ▼> Knit Diretory > Current Working Directory

1.1.2 場所の確認:getwd()

getwd()

[1] “C:/(中略)/NICER1_3_2/2020-11-24NICER1_3_2/NICER_NNS”

1.1.3 ファイル一覧表示:list.files()

  • Tip: 処理結果をRMarkdownには表示させないオプション 
{r, results = 'hide'}
list.files()

1.1.4 ファイルの読み込み: readLines()

  • ファイル名は文字列なので、ダブルクオートで囲む
  • 読み込んだものを変数に入れる
    • 例: JPN501.txtのファイルを読み込んで、jpn501という変数に入れる

1.2 課題

* 学習者のデータと母語話者のデータそれぞれ10ファイルくらい読み込んで、表現を検索して比べてみる。
jpn511 <- readLines("JPN511.txt")
## Warning in readLines("JPN511.txt"): 'JPN511.txt'
## で不完全な最終行が見つかりました
 警告:  incomplete final line found on 'JPN511.txt'
  • 警告が出るのを無視するには、 warn=F オプションを付ける。
jpn511 <- readLines("JPN511.txt", warn=F)
  • コードチャンクにwarning=Fを付けてもよい。
{r, warning=F}
jpn501 <- readLines("JPN501.txt")
jpn502 <- readLines("JPN502.txt")
jpn503 <- readLines("JPN503.txt")
jpn504 <- readLines("JPN504.txt")
jpn505 <- readLines("JPN505.txt")
jpn506 <- readLines("JPN506.txt")
jpn507 <- readLines("JPN507.txt")
jpn508 <- readLines("JPN508.txt")
jpn509 <- readLines("JPN509.txt")
jpn510 <- readLines("JPN510.txt")
setwd("../NICER_NS")

ns501 <- readLines("NS501.txt")
ns502 <- readLines("NS502.txt")
ns503 <- readLines("NS503.txt")
ns504 <- readLines("NS504.txt")
ns505 <- readLines("NS505.txt")
ns506 <- readLines("NS506.txt")
ns507 <- readLines("NS507.txt")
ns508 <- readLines("NS508.txt")
ns509 <- readLines("NS509.txt")
ns510 <- readLines("NS510.txt")

1.2.1 できたかどうか確認:ls()

ls()
##  [1] "jpn501" "jpn502" "jpn503" "jpn504" "jpn505" "jpn506" "jpn507" "jpn508"
##  [9] "jpn509" "jpn510" "jpn511" "ns501"  "ns502"  "ns503"  "ns504"  "ns505" 
## [17] "ns506"  "ns507"  "ns508"  "ns509"  "ns510"

1.3 文字列の検索:grep(“検索文字列”, 変数名)

1.4 grepで正規表現を使う

  • Rでは正規表現「*」の「エスケープ」に「\」を二重に使う点に注意)

    https://stats.biopapyrus.jp/r/devel/regex.html

  • [ ] 内は、いずれかの文字を意味する。

    • [Hh]owever は、However もしくは however
  • 要素番号ではなく、中身を表示するオプション value=T

  • 大文字小文字を区別しないオプション ignore.case=T

    • ただしこれだと、HOWever 等もヒットしてしまう。
grep("[hH]owever", ns501, value=T)
## [1] "*NS501:\tHowever in the French educational system instead of a head or a body there is a thesis and an anti-thesis or point and counter point in which the writer must oppose his or her original statements."
## [2] "*NS501:\tThis makes the facts easy to access, however, it does not force the writer to challenge his or her own logic in the process, leaving the ideas themselves rigid."                                    
## [3] "*NS501:\tHowever what the French lose in logical flow they gain in critical thinking."                                                                                                                        
## [4] "*NS501:\tHowever, sadly with the continuous failings of the American educational system, these lofty dreams yet remain dreams for a generation of potential Newtons and Einsteins."

1.5 残された課題

  1. 一つずつデータを検索するのは面倒
  2. 必要な行だけに絞り込みたい
  3. 一つずつファイルを読み込むのは面倒
  4. 一つずつ目で数えるのは大変

2 大量のファイルから、必要な行だけ検索し、一覧表を作る

2.1 方法その1:複数のデータを一つにまとめて検索

2.1.1 読み込んだデータを一つにまとめる

  • 読み込んだ個々のデータを要素とした「配列」を作る
jpn.10 <- c(jpn501, jpn502, jpn503, jpn504, jpn505, jpn506, jpn507, jpn508, jpn509, jpn510)
length(jpn.10)
## [1] 993
ns.10 <- c(ns501, ns502, ns503, ns504, ns505, ns506, ns507, ns508, ns509, ns510)
length(ns.10)
## [1] 945

2.1.2 まとめたデータで検索

grep("[hH]owever", ns.10, value=T)
##  [1] "*NS501:\tHowever in the French educational system instead of a head or a body there is a thesis and an anti-thesis or point and counter point in which the writer must oppose his or her original statements."                                                                                                                
##  [2] "*NS501:\tThis makes the facts easy to access, however, it does not force the writer to challenge his or her own logic in the process, leaving the ideas themselves rigid."                                                                                                                                                    
##  [3] "*NS501:\tHowever what the French lose in logical flow they gain in critical thinking."                                                                                                                                                                                                                                        
##  [4] "*NS501:\tHowever, sadly with the continuous failings of the American educational system, these lofty dreams yet remain dreams for a generation of potential Newtons and Einsteins."                                                                                                                                           
##  [5] "*NS502:\tHowever, with growing competition in workplaces and with newer jobs being developed on a regular basis, it may be necessary to reexamine this two-dimensional hierarchy in order to better prepare students for the changing world."                                                                                 
##  [6] "*NS503:\tHowever, I worry that in today's increasingly global society, in which scientific developments are often explicitly prioritized over humanities-based education and research around the world, our global society is perhaps sacrificing crucial analysis of the potential consequences of such scientific research."
##  [7] "*NS503:\tHumanities-based education and analysis, however, has the potential to challenge such ideology and, thereby, transform contemporary global society for the better."                                                                                                                                                  
##  [8] "*NS504:\tHowever, both systems are not completely different, as they both take into account the importance of academic achievement and also the base of the curriculum, albeit having its differences, remains based on a language, social science, natural science and mathematics core."                                    
##  [9] "*NS505:\tHowever Australians have sought to distinguish themselves from the Brits by assuming the role of the scrapper, the underdog."                                                                                                                                                                                        
## [10] "*NS505:\tHowever, there are also some negative aspects to Australia's sporting identity."                                                                                                                                                                                                                                     
## [11] "*NS505:\tAustralia presents itself to the world as a sporting nation, however I challenge the validity of this representation."                                                                                                                                                                                               
## [12] "*NS507:\tHowever, the situation in the United States is much different with most children beginning their first foreign language classes only in high school, if at all."                                                                                                                                                     
## [13] "*NS507:\tHowever, a similar attitude is displayed when an American finds themselves abroad: Why can't they just speak English?"                                                                                                                                                                                               
## [14] "*NS508:\tHowever, as English began to increase in popularity worldwide its influence also took hold of Scotland."                                                                                                                                                                                                             
## [15] "*NS508:\tHowever, even thought the education method is quite successful the lack of interest and importance on Gaelic means the number of students attending these schools are limited and Gaelic is not being used in the world outside the classroom."                                                                      
## [16] "*NS508:\tHowever, that is most certainly easier said than done."                                                                                                                                                                                                                                                              
## [17] "*NS509:\tHowever, today paper money is made of the same materials and the only thing that distinguishes one bill from another is the digit printed on each one."                                                                                                                                                              
## [18] "*NS509:\tHowever, I think that people need to reassess what is important to them at what is valuable."                                                                                                                                                                                                                        
## [19] "*NS509:\tHowever, when it comes down to it, it is all just paper."                                                                                                                                                                                                                                                            
## [20] "*NS509:\tHowever, the value of material objects is completely up to us as individuals."                                                                                                                                                                                                                                       
## [21] "*NS510:\tHowever, there are those that claim that any opposition towards these actions by the Australian Federal Government are in fact based on an underlying racial issue rather than an issue of economical practicality or fairness"
  • 10個のファイルに、21例発見された。
grep("[hH]owever", jpn.10, value=T)
##  [1] "%NTV:\tHowever, many budo have no teammate; instead, you must play against yourself."                        
##  [2] "%COM:\tAvoid starting sentences with coordinating conjunctions. You can change \"but\" to \"however\"."      
##  [3] "%NTV:\tHowever, because of rei, budo provides many additional good points, such as mental strength."         
##  [4] "*JPN502:\tHowever, we cannot study in advance because of the less time."                                     
##  [5] "%NTV:\tHowever, we could not study enough because we had less time. "                                        
##  [6] "%NTV:\tHowever, I think we should not view it as an entirely useless and incorrect policy."                  
##  [7] "*JPN506:\tHowever, I have heard one family story ever before."                                               
##  [8] "*JPN507:\tHowever, L make think to educational systems."                                                     
##  [9] "*JPN507:\tHowever, we want more high quality working in one area, it need longer time to enhance that skill."
## [10] "%NTV:\tHowever, there are sports where the player's genetics is non-relative."                               
## [11] "%COM:\tDon't start sentences with coordinating conjunctions. \"But\" can often be replaced with \"however\"."
## [12] "%NTV:\tHowever, when they enter university or get a job, the situation changes."
  • 10個のファイルに12例発見されたが、関係ない行も入っている(%NTV、%COM)。

2.2 必要な行だけに絞り込みたい:正規表現によるパターンの指定

  • 必要な行は、*JPNで始まる行だけ。=> 正規表現でパターンを指定する。
  • 必要な文字列パターンを考える。
    • 必要な部分とは、共通するパターン
    • 「*JPN」で始まる行で、「However」もしくは「however」を含む
      • 「「However」もしくは「however」を含む」は、既習。"[hH]owever"
      • 「*JPN」で始まるは?
      • その他の部分はどうするの?
記号 説明
^ 先頭 ^\\*JPN
. なんでも一文字 .
* 直前の0回以上の繰り返し .*
+ 直前の1回以上の繰り返し .+
\b 単語の境界 \\bhow\\b

2.2.1 模範解答例

grep("^\\*JPN.+[hH]owever", jpn.10, value=T)
## [1] "*JPN502:\tHowever, we cannot study in advance because of the less time."                                     
## [2] "*JPN506:\tHowever, I have heard one family story ever before."                                               
## [3] "*JPN507:\tHowever, L make think to educational systems."                                                     
## [4] "*JPN507:\tHowever, we want more high quality working in one area, it need longer time to enhance that skill."
  • 学習者は4例
grep("^\\*NS.+[hH]owever", ns.10, value=T)
##  [1] "*NS501:\tHowever in the French educational system instead of a head or a body there is a thesis and an anti-thesis or point and counter point in which the writer must oppose his or her original statements."                                                                                                                
##  [2] "*NS501:\tThis makes the facts easy to access, however, it does not force the writer to challenge his or her own logic in the process, leaving the ideas themselves rigid."                                                                                                                                                    
##  [3] "*NS501:\tHowever what the French lose in logical flow they gain in critical thinking."                                                                                                                                                                                                                                        
##  [4] "*NS501:\tHowever, sadly with the continuous failings of the American educational system, these lofty dreams yet remain dreams for a generation of potential Newtons and Einsteins."                                                                                                                                           
##  [5] "*NS502:\tHowever, with growing competition in workplaces and with newer jobs being developed on a regular basis, it may be necessary to reexamine this two-dimensional hierarchy in order to better prepare students for the changing world."                                                                                 
##  [6] "*NS503:\tHowever, I worry that in today's increasingly global society, in which scientific developments are often explicitly prioritized over humanities-based education and research around the world, our global society is perhaps sacrificing crucial analysis of the potential consequences of such scientific research."
##  [7] "*NS503:\tHumanities-based education and analysis, however, has the potential to challenge such ideology and, thereby, transform contemporary global society for the better."                                                                                                                                                  
##  [8] "*NS504:\tHowever, both systems are not completely different, as they both take into account the importance of academic achievement and also the base of the curriculum, albeit having its differences, remains based on a language, social science, natural science and mathematics core."                                    
##  [9] "*NS505:\tHowever Australians have sought to distinguish themselves from the Brits by assuming the role of the scrapper, the underdog."                                                                                                                                                                                        
## [10] "*NS505:\tHowever, there are also some negative aspects to Australia's sporting identity."                                                                                                                                                                                                                                     
## [11] "*NS505:\tAustralia presents itself to the world as a sporting nation, however I challenge the validity of this representation."                                                                                                                                                                                               
## [12] "*NS507:\tHowever, the situation in the United States is much different with most children beginning their first foreign language classes only in high school, if at all."                                                                                                                                                     
## [13] "*NS507:\tHowever, a similar attitude is displayed when an American finds themselves abroad: Why can't they just speak English?"                                                                                                                                                                                               
## [14] "*NS508:\tHowever, as English began to increase in popularity worldwide its influence also took hold of Scotland."                                                                                                                                                                                                             
## [15] "*NS508:\tHowever, even thought the education method is quite successful the lack of interest and importance on Gaelic means the number of students attending these schools are limited and Gaelic is not being used in the world outside the classroom."                                                                      
## [16] "*NS508:\tHowever, that is most certainly easier said than done."                                                                                                                                                                                                                                                              
## [17] "*NS509:\tHowever, today paper money is made of the same materials and the only thing that distinguishes one bill from another is the digit printed on each one."                                                                                                                                                              
## [18] "*NS509:\tHowever, I think that people need to reassess what is important to them at what is valuable."                                                                                                                                                                                                                        
## [19] "*NS509:\tHowever, when it comes down to it, it is all just paper."                                                                                                                                                                                                                                                            
## [20] "*NS509:\tHowever, the value of material objects is completely up to us as individuals."                                                                                                                                                                                                                                       
## [21] "*NS510:\tHowever, there are those that claim that any opposition towards these actions by the Australian Federal Government are in fact based on an underlying racial issue rather than an issue of economical practicality or fairness"
  • 母語話者は21例
    • 文中の用法もある。
    • 文中だけ別に調べてみる
grep("^\\*NS.+however", ns.10, value=T)
## [1] "*NS501:\tThis makes the facts easy to access, however, it does not force the writer to challenge his or her own logic in the process, leaving the ideas themselves rigid."  
## [2] "*NS503:\tHumanities-based education and analysis, however, has the potential to challenge such ideology and, thereby, transform contemporary global society for the better."
## [3] "*NS505:\tAustralia presents itself to the world as a sporting nation, however I challenge the validity of this representation."
  • 文中は3例

2.3 方法その2:複数のファイルを一つに読み込む

2.3.1 準備

  • ファイルの一覧: list.files()
  • 必要なファイルを一つのフォルダーにまとめておく(パソコンの中で、手動で)
  • Working Directoryとしてそのフォルダーを指定する

2.3.2 フォルダー内のファイルを順番に読み込む:「forを使った繰返し制御」

for (条件){
  すること
  すること
}
  • 実行する前に確認
    • Working Directory
    • ファイル一覧: list.files()
getwd()
list.files()
  • 学習者のファイル381個が一つのフォルダーに入っているので、これ全部を対象にしてみましょう。
file.zenbu <- list.files()           # ディレクトリー内の全ファイルのリスト作成

ruiseki <- ""                        # 結果を入れる入れ物を準備(文字列)

for (i in file.zenbu){               # ファイルのリストから一つずつ取り出して i に入れる
  yomikomi <- readLines(i, warn=F)   # i から読み込んだものを、yomikomi に入れる
  ruiseki <- c(ruiseki, yomikomi)    # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
}

2.3.3 中身を確認

ruiseki

2.3.4 要素の数を確認

length(ruiseki)
## [1] 37949

2.3.5 中身の先頭部分を確認: head()

  • デフォルトで6つ。
  • 数を指定すれば、指定しただけ表示: head(変数, 数字)
head(ruiseki)
## [1] ""                       "@Begin"                 "@Participants:\tJPN501"
## [4] "@PID:\tPIDJP501"        "@Age:\t21"              "@Sex:\tF"
head(ruiseki, 20)
##  [1] ""                                            
##  [2] "@Begin"                                      
##  [3] "@Participants:\tJPN501"                      
##  [4] "@PID:\tPIDJP501"                             
##  [5] "@Age:\t21"                                   
##  [6] "@Sex:\tF"                                    
##  [7] "@YearInSchool:\tU2"                          
##  [8] "@Major:\tagriculture"                        
##  [9] "@StudyHistory:\t8"                           
## [10] "@OtherLanguage:\tChinese=1.0;none="          
## [11] "@Qualification:\tTOEIC=590(2013);none=;none="
## [12] "@Abroad:\tnone=;none="                       
## [13] "@Reading:\t3"                                
## [14] "@Writing:\t2"                                
## [15] "@Listening:\t2"                              
## [16] "@Speaking:\t1"                               
## [17] "@JapaneseEssay:\t4"                          
## [18] "@EnglishEssayEx:\t3"                         
## [19] "@EnglishEssay:\t2"                           
## [20] "@Difficulty:\t"

2.3.6 下の方から確認: tail()

tail(ruiseki)
## [1] "%NTV:\tAs a result, classes will take longer and students' abilities will decrease."             
## [2] "%COM:\t"                                                                                         
## [3] "*JPN881:\tI think it should be abolished Internet added in the elementary school's classes."     
## [4] "%NTV:\tI think the idea of adding the Internet to elementary school classes should be abolished."
## [5] "%COM:\t"                                                                                         
## [6] "@End"

2.3.7 要素番号で指定して確認: 変数[始めの番号 : 終わりの番号]

ruiseki[201:220]
##  [1] "%COM:\t"                                                                                                                                            
##  [2] "*JPN502:\tSo, our world ranking of study level    was fell down ."                                                                                  
##  [3] "%NTV:\tTherefore, our world ranking in terms of study level slipped down. "                                                                         
##  [4] "%COM:\tAvoid starting sentences with coordinating conjunctions. You can change \"so\" to \"therefore\". "                                           
##  [5] "%par:"                                                                                                                                              
##  [6] "*JPN502:\tSecond,I'll tell what happen to us now."                                                                                                  
##  [7] "%NTV:\tSecond, I'll explain what happens now."                                                                                                      
##  [8] "%COM:\t"                                                                                                                                            
##  [9] "*JPN502:\tMany people who are older than us sometimes call us \"YUTORI SEDAI\" that include that negative image."                                   
## [10] "%NTV:\tMany people who are older than us sometimes call us \"YUTORI sedai\", which has a negative connotation."                                     
## [11] "%COM:\t"                                                                                                                                            
## [12] "*JPN502:\tThere image is student who grew in \"YUTORI\" is slow to decide and have less effort and so on."                                          
## [13] "%NTV:\tTheir image of a student who grew up under the YUTORI system is of someone who is slow to decide, doesn't make much effort, and so on."      
## [14] "%COM:\t"                                                                                                                                            
## [15] "*JPN502:\tEven if I was played tennis in lesson of  university, the teacher said that \"You are YUTORI, so you should make your self more better.\""
## [16] "%NTV:\tEven if I had a tennis lesson in university, the teacher said, \"You are YUTORI, so you should make yourself better.\""                      
## [17] "%COM:\tUse quote marks to indicate quoted speech."                                                                                                  
## [18] "*JPN502:\tI was shocked."                                                                                                                           
## [19] "%NTV:\tI was shocked."                                                                                                                              
## [20] "%COM:\t"

2.3.8 サンプルデータ:学習者のhoweverの使用例

file.zenbu <- list.files()           # ディレクトリー内の全ファイルのリスト作成

ruiseki <- ""                        # 結果を入れる入れ物を準備(文字列)

for (i in file.zenbu){               # ファイルのリストから一つずつ取り出して i に入れる
  yomikomi <- readLines(i, warn=F)   # i から読み込んだものを、yomikomi に入れる
  ruiseki <- c(ruiseki, yomikomi)    # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
}

jpn.however <- grep("^\\*JPN.+[hH]owever", ruiseki, value=T)
  • 中身を見てみる
head(jpn.however, 20)
##  [1] "*JPN502:\tHowever, we cannot study in advance because of the less time."                                                                                         
##  [2] "*JPN506:\tHowever, I have heard one family story ever before."                                                                                                   
##  [3] "*JPN507:\tHowever, L make think to educational systems."                                                                                                         
##  [4] "*JPN507:\tHowever, we want more high quality working in one area, it need longer time to enhance that skill."                                                    
##  [5] "*JPN511:\tHowever, there are limited teacher who have teaching license."                                                                                         
##  [6] "*JPN512:\tHowever, most healthy people don't recognize what kind of disadvantages are there and how to deal with the cases which they meet disadvantaged people."
##  [7] "*JPN512:\tThe answer is, however, not about education."                                                                                                          
##  [8] "*JPN516:\tHowever, since bubble economy was collapsed, Japan had to adapt its crisis."                                                                           
##  [9] "*JPN516:\tHowever, Japanese students' educational levels were declined under this system."                                                                       
## [10] "*JPN516:\tHowever, I think the way the Japanese government chosen was not necessarily true."                                                                     
## [11] "*JPN517:\tHowever, I am too busy to work these days because I attend many lecture and I have to study for the test."                                             
## [12] "*JPN517:\tHowever, it is very hard."                                                                                                                             
## [13] "*JPN520:\tHowever, we have to admit that not all of the teachers can teach \"speaking English\" well enough."                                                    
## [14] "*JPN526:\tHowever, the other day my father said to me, \"It's the time you should pay some money  for family per month.\""                                       
## [15] "*JPN528:\tHowever,  Finnish education system also have some bad points."                                                                                         
## [16] "*JPN529:\tHowever, my life has changed by my belonging a sports club in this university."                                                                        
## [17] "*JPN529:\tHowever, I continue to play lacrosse thank for my teammates."                                                                                          
## [18] "*JPN532:\tHowever, high school students don't have to select them."                                                                                              
## [19] "*JPN536:\tHowever, there are various and stimulus sports in the world."                                                                                          
## [20] "*JPN536:\tHowever, it is also important to do sports."

2.4 結果をファイルに保存してExcelで読み込む

2.4.1 データをファイルで保存する: write.table(変数, “ファイル名.txt”)

  • 保存するディレクトリーに注意
    • データファイルの一つ上を指定した方が良い
    • ファイル名の前に ../ を付ける
      • 例: “../result.txt”`
write.table(jpn.however, "../jpn.however.txt")
  • 結果はテキストファイルで保存される。

  • エクセルで、テキストファイルを読み込む

    • 読み込むときのファイル形式「テキスト ファイル」
    • データのファイル形式
      • コンマやタブ > 次へ
        • ☑ タブ
        • その他 [ " ]
  • 研究としては、エクセルの一覧表でデータを見ながら、研究目的に合わせた分析コードを記入していく。 howeverの生起位置の分析

  • 分析結果を集計する

  • 参考(howeverの生起位置) https://nuss.nagoya-u.ac.jp/s/42p2ZX6eRdpxq7q

3 語彙リストの作成

3.1 データの確認

jpn501

3.1.1 データの「構造」(どういう形式で保存されているか)

  • 非構造化テキスト: フォーマットなし、ベタ打ちテキスト

    • データの属性情報などは別になっている(?)
  • 構造化テキスト: フォーマットに従って整形されている(例:CHILDESのCHATフォーマット) https://www.sugiura-ken.org/wiki/wiki.cgi/exp?page=CHAT

    • 構造化テキストは、その構造を踏まえて処理するようになっているプログラムで分析する。
      • CHILDESの場合は、CLAN

3.1.1.1 CHATフォーマット

  • 属性情報のヘッダー部分と、本文部分とからなる
  • 属性情報は行頭が @ で始まる
@Begin
@Participants:  CHI
*CHI:   発話データはこの部分に、一発話一行で表記.行頭は *で始まる
%COD:   発話の分析コードなど行頭は %で始まる。
  (この発話部分が発話の数だけ繰り返される。)
@End

3.2 データ処理

  1. 必要なところだけ取る
  2. 不要な部分を削除する
  3. メインの処理(例:語彙リスト作成)

3.2.1 必要なところだけ取る

  • 本文部分の *JPNで始まる行を取る
tmp1 <- grep("\\*JPN", jpn501, value=T)

head(tmp1)
## [1] "*JPN501:\tWhat kind of sports do you like?"                      
## [2] "*JPN501:\tDo you like soccer, base ball or swimming?"            
## [3] "*JPN501:\tThere are many and variety sports around the world."   
## [4] "*JPN501:\tA country has some traditional sports."                
## [5] "*JPN501:\tOf course, there are some traditional sports in Japan."
## [6] "*JPN501:\tThey are called \"BUDO\"."

3.2.2 不要な部分の削除: 置き換えを使う gsub()

  • 行頭の話者記号 “*JPN501:部分が不要

  • テキスト処理では、「削除」は何もないもので「置き換える」

    • 文字列はダブルクオートで囲む
    • 何もない、は、ダブルクオートを二つ並べる(間に「何もない」)
  • 置き換えるコマンドは gsub()

tmp2 <- gsub("\\*JPN501:\t", "", tmp1)

head(tmp2)
## [1] "What kind of sports do you like?"                      
## [2] "Do you like soccer, base ball or swimming?"            
## [3] "There are many and variety sports around the world."   
## [4] "A country has some traditional sports."                
## [5] "Of course, there are some traditional sports in Japan."
## [6] "They are called \"BUDO\"."

3.3 語彙リストの作成

  1. 現状のデータの確認:一文一行
  2. ほしい結果の確認: 単語のリスト
  3. 現状と結果の間に「処理」が入る。
  • 処理手順を一つずつ具体的に考えて、「つなぐ」

3.3.1 文を単語ごとに切り分ける:strsplit()

  • 文字列を切り分ける命令: strsplit(変数, “セパレータ”)
  • セパレータを半角スペース(” “)にすることで、文が単語ごとに切り分けられる。
tmp3 <- strsplit(tmp2, " ")

head(tmp3)
## [[1]]
## [1] "What"   "kind"   "of"     "sports" "do"     "you"    "like?" 
## 
## [[2]]
## [1] "Do"        "you"       "like"      "soccer,"   "base"      "ball"     
## [7] "or"        "swimming?"
## 
## [[3]]
## [1] "There"   "are"     "many"    "and"     "variety" "sports"  "around" 
## [8] "the"     "world." 
## 
## [[4]]
## [1] "A"           "country"     "has"         "some"        "traditional"
## [6] "sports."    
## 
## [[5]]
## [1] "Of"          "course,"     "there"       "are"         "some"       
## [6] "traditional" "sports"      "in"          "Japan."     
## 
## [[6]]
## [1] "They"      "are"       "called"    "\"BUDO\"."
  • 文ごとに「リスト」になっている
    • 「リスト」というのは、二次元の配列(エクセルの表のイメージ)
      • 行ごとに、要素が単語の数だけ並んでいて
      • 文の数だけの行が一塊のデータとなっている

3.3.2 「リスト」の「枠を」はずして、すべての単語をフラットに並べる: unlist()

tmp4 <- unlist(tmp3)

head(tmp4, 20)
##  [1] "What"      "kind"      "of"        "sports"    "do"        "you"      
##  [7] "like?"     "Do"        "you"       "like"      "soccer,"   "base"     
## [13] "ball"      "or"        "swimming?" "There"     "are"       "many"     
## [19] "and"       "variety"

3.3.3 アルファベット順に並べ替える: sort()

tmp5 <- sort(tmp4)

head(tmp5, 20)
##  [1] ""          ""          ""          "\"BUDO\"." "\"BUDO\"." "\"REI\""  
##  [7] "\"REI\""   "\"REI\""   "\"REI\"."  "\"REI\"."  "(the"      "a"        
## [13] "A"         "about"     "about"     "also"      "an"        "an"       
## [19] "and"       "and"
  • ちょっと「変」なものもあるが、それは後で、例外処理する。

3.3.4 単語リストの要素数を数える:「述べ語数」(Token)

length(tmp5)
## [1] 322

3.3.5 一覧表の作成: table()

  • ベクトルの各要素の数を集計する命令 table()
tmp6 <- table(tmp5)

tmp6
## tmp5
##                 "BUDO".       "REI"      "REI".        (the           a 
##           3           2           3           2           1           1 
##           A       about        also          an         and         And 
##           1           2           1           2           7           1 
##         are      around      awful,        ball        base          be 
##           5           3           1           1           1           1 
##        beat     because        body         bow        BUDO       BUDO, 
##           2           1           1           3           7           2 
##     BUDOJYO         but         But      called         can      cannot 
##           3           1           2           1           3           1 
##       clean       could     country     course,     deeply.          do 
##           1           1           1           1           1           1 
##          Do        each    efforts.      ended,       enemy      enemy. 
##           1           3           1           1           1           1 
##       enter    example,  expression        feel     feeling       fight 
##           1           3           1           2           3           1 
##    Finally,      First,         for         For        from        game 
##           1           1           4           2           1           3 
##       game.        give       give.        good       grow.         has 
##           4           1           1           2           1           1 
##        have        him,         how          If   important          in 
##           2           1           1           2           1           4 
##          In    involved          is          It      Japan.    Japanese 
##           1           1           7           1           1           1 
##      JYUDO,      KENDO,        kind       KYUDO       leave        like 
##           1           1           1           1           1           1 
##       like?       loves        make        many       mate,      mates. 
##           1           1           1           6           1           1 
##      mental        more        much        must          no         not 
##           1           1           1           7           1           1 
##          of          Of          on         on.        only          or 
##           2           1           1           1           1           1 
##       other      other,      other.        pain      people       place 
##           3           1           1           1           3           1 
##      place)        play     players     Players     playing       plays 
##           1           4           5           1           3           1 
##       point     points,       proud   remember.        sad,        same 
##           1           1           1           1           1           1 
##   Secondly,      should          so          So         So,     soccer, 
##           1           1           2           1           1           2 
##        some      sports     sports.      start,     strong,     strong. 
##           2           5           1           1           1           1 
##   swimming?      taught       teach        team       thank        that 
##           1           1           1           2           6           5 
##         the       Then,       there       There      there.        they 
##          19           1           2           1           2           2 
##        They       thing       think        this        This      tired, 
##           1           1           1           1           2           1 
##      today,        too. traditional     variety          We        weak 
##           1           1           3           1           1           1 
##        What        when       where       which         who        will 
##           1           4           1           1           2           1 
##        with     without       world      world.       would         you 
##           1           1           1           2           1          18 
##         You        your    yourself   yourself. 
##           1           2           3           3

3.3.6 一覧表の見出し語の数が「異なり語数」(Type)

length(tmp6)
## [1] 166

3.3.7 一覧表をファイルに保存する: write.table()

  • write.table(一覧表, “ファイル名.txt”)

  • 現在のWorking Directoryに保存される。

  • NICER_NNSの中

    • その中に保存されると、データと処理結果が混在してとるブルのもとになる。
    • 一つ上(フォルダーの外)に保存した方が良い。
      • “../ファイル名.txt”
write.table(tmp6, "../tmp6.txt")
  • エクセルで開いてみよう

4 今日の課題

4.1 ほかのファイルでも語彙リストを作り、比較してみる。

  • 何と何を比較するかは、おたのしみ。

4.2 今日の授業で説明した処理過程だけでは、語彙リストとして不完全ではないか?

  • 問題点を指摘してみよう。
  • 解決策を考えてみよう。

4.3 フォルダー内のすべてのファイルをまとめて語彙リストを作ってみよう。