*disclaimer
603622
ngram
- 文字列をn-gramに切り分ける
注
- 一文一行になってないと、文をまたいでn-gramを生成してしまう。
データの読み込み
- テキストファイルの読み込み
readLines()
- 特定のフォルダー内の、特定の拡張子のファイルを一度に読み込む
multiread() multiread(パス, extention="拡張子")
大文字小文字・句読点の前処理 preprocess()
preprocess(データ, case="lower", remove.punct=T)
tmパッケージとの関連
concatenate(lapply(データ , "[", 1) )
n-gramの作成 ngram(データ, n=グラム数)
作成されたn-gramオブジェクト全体の出力 print(オブジェクト, output="full")
- output="truncated" とすると、全部は出ない。
一覧表形式で出力 get.phrasetable(オブジェクト)
- 入れ子式にすると便利
get.phrasetable(ngram(テキストデータ, n=グラム数))
n-gram表現を個別にすべて出力 get.ngrams(オブジェクト)
- 入れ子式にすると便利
get.ngrams(ngram(テキストデータ, n=グラム数))
疑似的文字列の生成 rcorpus(単語数, alphabet=letters[何文字から:何文字まで使って], maxwordlen=最大単語長)
> rcorpus(20, alphabet=letters[1:3], maxwordle=4) [1] "cbac ab a cbca aaa bc ba abaa ab a a caa bba abc cb a abbc a cc abbc" > rcorpus(100, alphabet=letters[1:26], maxwordle=6) [1] "zlmxc ohe fy djxz qe jmkfzy uk ovaqw ouc lg hc rdecm ouefue whu vr spwgh pdv vysz it f qo votwt dkhud rraq vc jehrrj yyjlrv on vdffwi gbp uozt o zdxej vxaxm mkir tqhsw iehg mtq tu dgxtr kq p oz xyq ca jxunw zs cmqeqo mg r vkbawi wnza qj phout dnu fcm a qow g zhz dttrvz fi v a ito wah i reh x f jxc mhdme tr uus w iumoy hzi kz qabux zlppn genmyw r iqkw pd majp dtk hf tvfxs kym dgrq ytolxc fycxd ea vkpgg uaxj nb ckcn jnwz xspu oci"
以下古いメモ
install.packages("ngram") library(ngram) > sample <- "I often hear that death penalty is reasonable for people whose family member or friend is killed by someone." > get.ngrams(ngram(sample)) [1] "death penalty" "is reasonable" "hear that" "friend is" "is killed" "for people" "whose family" [8] "by someone." "often hear" "people whose" "that death" "killed by" "I often" "member or" [15] "or friend" "family member" "reasonable for" "penalty is" > > get.ngrams(ngram(preprocess(sample, remove.punct=TRUE))) [1] "death penalty" "is reasonable" "hear that" "friend is" "is killed" "i often" "for people" [8] "whose family" "often hear" "people whose" "by someone" "that death" "killed by" "member or" [15] "or friend" "family member" "reasonable for" "penalty is" > print(ngram(preprocess(sample, remove.punct=TRUE)), output="full") death penalty | 1 is {1} | is reasonable | 1 for {1} | hear that | 1 death {1} | friend is | 1 killed {1} | is killed | 1 by {1} | i often | 1 hear {1} | for people | 1 whose {1} | whose family | 1 member {1} | often hear | 1 that {1} | people whose | 1 family {1} | by someone | 1 NULL {1} | that death | 1 penalty {1} | killed by | 1 someone {1} | member or | 1 friend {1} | or friend | 1 is {1} | family member | 1 or {1} | reasonable for | 1 people {1} | penalty is | 1 reasonable {1} |
https://sugiura-ken.org/wiki/