トップ 差分 一覧 Farm ソース 検索 ヘルプ PDF RSS ログイン

tagger

*disclaimer
178100

[R]

tagger


https://github.com/trinker/tagger

  • openNLPの品詞タグ付与
tagger wraps the NLP and openNLP packages for easier part of 
speech tagging. tagger uses the openNLP annotator to compute 
"Penn Treebank parse annotations using the Apache OpenNLP 
chunking parser for English."

 必要なパッケージをインストール

install.packages("pacman")
pacman::p_load_gh(c(
    "trinker/termco", 
    "trinker/coreNLPsetup",        
    "trinker/tagger"
))

library(dplyr)
library(tagger)

install.packages('rJava')
library(rJava)

 タグ一覧

penn_tags()

   Tag  Description                                 
1  $    dollar                                      
2  ``   opening quotation mark                      
3  ''   closing quotation mark                      
4  (    opening parenthesis                         
5  )    closing parenthesis                         
6  ,    comma                                       
7  -    dash                                        
8  .    sentence terminator                         
9  :    colon or ellipsis                           
10 CC   conjunction, coordinating                   
11 CD   numeral, cardinal                           
12 DT   determiner                                  
13 EX   existential there                           
14 FW   foreign word                                
15 IN   preposition or conjunction, subordinating   
16 JJ   adjective or numeral, ordinal               
17 JJR  adjective, comparative                      
18 JJS  adjective, superlative                      
19 LS   list item marker                            
20 MD   modal auxiliary                             
21 NN   noun, common, singular or mass              
22 NNP  noun, proper, singular                      
23 NNPS noun, proper, plural                        
24 NNS  noun, common, plural                        
25 PDT  pre-determiner                              
26 POS  genitive marker                             
27 PRP  pronoun, personal                           
28 PRP$ pronoun, possessive                         
29 RB   adverb                                      
30 RBR  adverb, comparative                         
31 RBS  adverb, superlative                         
32 RP   particle                                    
33 SYM  symbol                                      
34 TO   "to" as preposition or infinitive marker    
35 UH   interjection                                
36 VB   verb, base form                             
37 VBD  verb, past tense                            
38 VBG  verb, present participle or gerund          
39 VBN  verb, past participle                       
40 VBP  verb, present tense, not 3rd person singular
41 VBZ  verb, present tense, 3rd person singular    
42 WDT  WH-determiner                               
43 WP   WH-pronoun                                  
44 WP$  WH-pronoun, possessive                      
45 WRB  Wh-adverb                                   

Penn Tree Bank式のタグではなく、一般的な品詞記号にまとめることもできる: as_universial()


> plot(as_universal(ns502.pos))

品詞を明記することもできる: as_basic()

> plot(as_basic(ns502.pos))

 コマンド


  • NICERよりNS502のデータの本文部分だけを取り出したものを例に
> str(ns502)
 chr [1:26] "An Assumed Role" ...
> head(ns502) 
[1] "An Assumed Role"                                                                                                                                                                                                                    
[2] "Considering the heightened role maintained by many in education, it's strangely rare to fin many willing to question this status quo."                                                                                              
[3] "Of course, many aware of the dynamic contrast between student and teacher are perfectly willing to perpetuate and even strengthen this relationship."                                                                               
[4] "It may seem entirely silly, even, to consider that anything needs to be changed."                                                                                                                                                   
[5] "However, with growing competition in workplaces and with newer jobs being developed on a regular basis, it may be necessary to reexamine this two-dimensional hierarchy in order to better prepare students for the changing world."
[6] "For years, American schools have maintained a strict adherence to an invisible, ghostly code, one that encourages respect and honor in a manner reminiscent of both the power structure found in factories and the military." 

タグ付与: tag_pos()


> tag_pos(ns502)   # POSタグ付け
1.  An/DT Assumed/NNP Role/NNP
2.  Considering/VBG the/DT heightened/VBN role/NN maintained/VBN by/IN many/JJ in/IN ...
3.  Of/IN course/NN ,/, many/JJ aware/JJ of/IN the/DT dynamic/JJ contrast/NN between/IN ...
4.  It/PRP may/MD seem/VB entirely/RB silly/JJ ,/, even/RB ,/, to/TO consider/VB ...
5.  However/RB ,/, with/IN growing/VBG competition/NN in/IN workplaces/NNS and/CC ...
.
.
.
22. Allow/NN rules/NNS to/TO be/VB challenged/VBN ,/, and/CC even/RB changed/VBD on/IN ...
23. Let/VB the/DT wealth/NN of/IN perspectives/NNS a/DT teacher/NN has/VBZ access/NN ...
24. Education/NNP should/MD be/VB defined/VBN by/IN the/DT fluidity/NN of/IN the/DT ...
25. Supreme/NNP authority/NN without/IN equal/JJ exchange/NN is/VBZ as/RB unnatural/JJ ...
26. Instead/RB ,/, we/PRP should/MD yearn/VB to/TO construct/VB a/DT bridge/NN ... 


タグ頻度 count_tags()


> ns502.pos <- tag_pos(ns502) 
> count_tags(ns502.pos)
   n.tokens      ''        ,       .      ``      CC      CD       DT       IN       JJ     JJR      MD       NN      NNP
1         3       0        0       0       0       0       0 1(33.3%)        0        0       0       0        0 2(66.7%)
2        24       0  1(4.2%) 1(4.2%)       0       0       0  2(8.3%)  2(8.3%) 4(16.7%)       0       0 4(16.7%)        0
3        24       0  1(4.2%) 1(4.2%)       0 2(8.3%)       0  2(8.3%) 3(12.5%) 4(16.7%)       0       0 5(20.8%)        0
4        17       0 2(11.8%) 1(5.9%)       0       0       0        0  1(5.9%)  1(5.9%)       0 1(5.9%)  1(5.9%)        0
5        38       0  2(5.3%) 1(2.6%)       0 1(2.6%)       0  3(7.9%) 6(15.8%)  3(7.9%) 1(2.6%) 1(2.6%) 5(13.2%)        0
6        39       0  3(7.7%) 1(2.6%)       0 2(5.1%) 1(2.6%) 6(15.4%) 4(10.3%) 6(15.4%)       0       0 7(17.9%)        0

タグ頻度のプロット: plot()

> plot(ns502.pos) 

タグ付きのテキストの出力: as_word_tag()

> ns502.pos.tagged <- as_word_tag(ns502.pos)
> 
> head(ns502.pos.tagged)
[1] "An/DT Assumed/NNP Role/NNP"                                                                                                                                                                                                                                                                                                                                    
[2] "Considering/VBG the/DT heightened/VBN role/NN maintained/VBN by/IN many/JJ in/IN education/NN ,/, it/PRP 's/VBZ strangely/RB rare/JJ to/TO fin/VBG many/JJ willing/JJ to/TO question/VB this/DT status/NN quo/NN ./."                                                                                                                                          
[3] "Of/IN course/NN ,/, many/JJ aware/JJ of/IN the/DT dynamic/JJ contrast/NN between/IN student/NN and/CC teacher/NN are/VBP perfectly/RB willing/JJ to/TO perpetuate/VB and/CC even/RB strengthen/VB this/DT relationship/NN ./."                                                                                                                                 
[4] "It/PRP may/MD seem/VB entirely/RB silly/JJ ,/, even/RB ,/, to/TO consider/VB that/IN anything/NN needs/VBZ to/TO be/VB changed/VBN ./."                                                                                                                                                                                                                        
[5] "However/RB ,/, with/IN growing/VBG competition/NN in/IN workplaces/NNS and/CC with/IN newer/JJR jobs/NNS being/VBG developed/VBN on/IN a/DT regular/JJ basis/NN ,/, it/PRP may/MD be/VB necessary/JJ to/TO reexamine/VB this/DT two-dimensional/JJ hierarchy/NN in/IN order/NN to/TO better/RB prepare/VB students/NNS for/IN the/DT changing/VBG world/NN ./."
[6] "For/IN years/NNS ,/, American/JJ schools/NNS have/VBP maintained/VBN a/DT strict/JJ adherence/NN to/TO an/DT invisible/JJ ,/, ghostly/JJ code/NN ,/, one/CD that/WDT encourages/VBZ respect/NN and/CC honor/NN in/IN a/DT manner/NN reminiscent/JJ of/IN both/DT the/DT power/NN structure/NN found/VBD in/IN factories/NNS and/CC the/DT military/JJ ./." 

品詞付きデータで、品詞を指定して検索

  • 形容詞+名詞 を例に
> grep("\\w+/JJ \\w+/NN", ns502.pos.tagged, value=T)
 [1] "Of/IN course/NN ,/, many/JJ aware/JJ of/IN the/DT dynamic/JJ contrast/NN between/IN student/NN and/CC teacher/NN are/VBP perfectly/RB willing/JJ to/TO perpetuate/VB and/CC even/RB strengthen/VB this/DT relationship/NN ./."                                                                                                                                                                                                                                                                                      
 [2] "However/RB ,/, with/IN growing/VBG competition/NN in/IN workplaces/NNS and/CC with/IN newer/JJR jobs/NNS being/VBG developed/VBN on/IN a/DT regular/JJ basis/NN ,/, it/PRP may/MD be/VB necessary/JJ to/TO reexamine/VB this/DT two-dimensional/JJ hierarchy/NN in/IN order/NN to/TO better/RB prepare/VB students/NNS for/IN the/DT changing/VBG world/NN ./."                                                                                                                                                     
 [3] "For/IN years/NNS ,/, American/JJ schools/NNS have/VBP maintained/VBN a/DT strict/JJ adherence/NN to/TO an/DT invisible/JJ ,/, ghostly/JJ code/NN ,/, one/CD that/WDT encourages/VBZ respect/NN and/CC honor/NN in/IN a/DT manner/NN reminiscent/JJ of/IN both/DT the/DT power/NN structure/NN found/VBD in/IN factories/NNS and/CC the/DT military/JJ ./."        
  • grepだと、該当する文字列がある一文全体が表示される

 stringrパッケージの利用


stringrのインストール

install.packages("stringr", dependencies=T)
library(stringr)

stringrパッケージの機能を使って該当文字列だけの検索

https://sugiura-ken.org/wiki/wiki.cgi/exp?page=grepExtract

  • ↑これを修正
grepExtract2 <- function(a,b){
  # stringrパッケージ利用
  #copyleft 2020-12-20 sugiura@nagoya-u.jp
  # grepExtract2("検索文字列", データ)
  
    hit.all <- c()
    
    hit <- str_extract(b, a)
    
    hit.all <- c(hit.all, hit)
    
  
  hit.all
}

> grepExtract2("\\w+/JJ \\w+/NN", ns502.pos.tagged)
 [1] NA                             NA                             "dynamic/JJ contrast/NN"      
 [4] NA                             "regular/JJ basis/NN"          "American/JJ schools/NN"      
 [7] "American/JJ schools/NN"       "last/JJ name/NN"              "student/JJ roles/NN"         
[10] "Many/JJ professors/NN"        "romantic/JJ relationships/NN" "many/JJ foreigners/NN"       
[13] "student/JJ relationships/NN"  "intellectual/JJ equals/NN"    NA                            
[16] "loose/JJ roles/NN"            "active/JJ response/NN"        "first/JJ step/NN"            
[19] NA                             "proper/JJ parenting/NN"       "distinct/JJ purpose/NN"      
[22] NA                             "right/JJ course/NN"           "human/JJ experience/NN"      
[25] "equal/JJ exchange/NN"         "equal/JJ power/NN"