ブラウンコーパスのジャンル毎の分布抽出。法助動詞の他、代名詞で調べてみた。
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
>>> genres = brown.categories()
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will', 'should']
>>> cfd.tabulate(conditions=genres, samples=modals)
can could may might must will should
adventure 46 151 5 58 27 50 15
belles_lettres 246 213 207 113 170 236 102
editorial 121 56 74 39 53 233 88
fiction 37 166 8 44 55 52 35
government 117 38 153 13 102 244 112
hobbies 268 58 131 22 83 264 73
humor 16 30 8 8 9 13 7
learned 365 159 324 128 202 340 171
lore 170 141 165 49 96 175 76
mystery 42 141 13 57 30 20 29
news 93 86 66 38 50 389 59
religion 82 59 78 12 54 71 45
reviews 45 40 45 26 19 58 18
romance 74 193 11 51 45 43 32
science_fiction 16 49 4 12 8 16 3
>>> pronouns = ['my', 'your', 'his', 'her', 'our', 'their']
>>> cfd.tabulate(conditions=genres, samples=pronouns)
my your his her our their
adventure 168 59 776 444 39 156
belles_lettres 209 51 1342 281 281 490
editorial 47 32 244 37 120 124
fiction 119 39 735 397 42 162
government 27 50 141 3 144 174
hobbies 39 271 238 16 77 152
humor 68 38 137 62 30 49
learned 33 18 437 129 133 359
lore 47 57 496 302 87 300
mystery 104 67 529 296 14 48
news 34 13 399 103 55 219
religion 71 40 132 8 77 113
reviews 9 10 208 85 12 62
romance 156 106 559 651 25 114
science_fiction 30 17 93 71 6 40
コーパスは単語発音の組になっているので、まず単語のみ抽出する。 複数あるものを抽出し、その個数を数える。
>>> entries = nltk.corpus.cmudict.entries()
>>> words = [w for w, pron in entries]
>>> wordfreq = FreqDist(words)
>>> pronounciations = [w for w in words if wordfreq[w] > 1]
>>> len(set(pronounciations))
9241
複数の発音を持つ語を抽出する。
>>> [pron for w, pron in entries if w == 'february']
[['F', 'EH1', 'B', 'Y', 'AH0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'AH0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'R', 'UW0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'UW0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'Y', 'UW0', 'W', 'EH2', 'R', 'IY0']]
len(synset.lemmas)で、同義語が複数あるかを調べる。(下位語があるかどうかの判定は分かりませんでした)
>>> len([synset for synset in list(wn.all_synsets('n')) if len(synset.lemmas) >1])
40061