ブラウンコーパスのジャンル毎の分布抽出。法助動詞の他、代名詞で調べてみた。
>>> cfd = nltk.ConditionalFreqDist(
...             (genre, word)
...             for genre in brown.categories()
...             for word in brown.words(categories=genre))
>>> genres = brown.categories()
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will', 'should']
>>> cfd.tabulate(conditions=genres, samples=modals)
                 can could  may might must will should
      adventure   46  151    5   58   27   50   15
 belles_lettres  246  213  207  113  170  236  102
      editorial  121   56   74   39   53  233   88
        fiction   37  166    8   44   55   52   35
     government  117   38  153   13  102  244  112
        hobbies  268   58  131   22   83  264   73
          humor   16   30    8    8    9   13    7
        learned  365  159  324  128  202  340  171
           lore  170  141  165   49   96  175   76
        mystery   42  141   13   57   30   20   29
           news   93   86   66   38   50  389   59
       religion   82   59   78   12   54   71   45
        reviews   45   40   45   26   19   58   18
        romance   74  193   11   51   45   43   32
science_fiction   16   49    4   12    8   16    3
>>> pronouns = ['my', 'your', 'his', 'her', 'our', 'their']
>>> cfd.tabulate(conditions=genres, samples=pronouns)
                  my your  his  her  our their
      adventure  168   59  776  444   39  156
 belles_lettres  209   51 1342  281  281  490
      editorial   47   32  244   37  120  124
        fiction  119   39  735  397   42  162
     government   27   50  141    3  144  174
        hobbies   39  271  238   16   77  152
          humor   68   38  137   62   30   49
        learned   33   18  437  129  133  359
           lore   47   57  496  302   87  300
        mystery  104   67  529  296   14   48
           news   34   13  399  103   55  219
       religion   71   40  132    8   77  113
        reviews    9   10  208   85   12   62
        romance  156  106  559  651   25  114
science_fiction   30   17   93   71    6   40
コーパスは単語発音の組になっているので、まず単語のみ抽出する。 複数あるものを抽出し、その個数を数える。
>>> entries = nltk.corpus.cmudict.entries()
>>> words = [w for w, pron in entries]
>>> wordfreq = FreqDist(words)
>>> pronounciations = [w for w in words if wordfreq[w] > 1]
>>> len(set(pronounciations))
9241
複数の発音を持つ語を抽出する。
>>> [pron for w, pron in entries if w == 'february']
[['F', 'EH1', 'B', 'Y', 'AH0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'AH0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'R', 'UW0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'UW0', 'W', 'EH2', 'R', 'IY0'], ['F', 'EH1', 'B', 'Y', 'UW0', 'W', 'EH2', 'R', 'IY0']]
len(synset.lemmas)で、同義語が複数あるかを調べる。(下位語があるかどうかの判定は分かりませんでした)
>>> len([synset for synset in list(wn.all_synsets('n')) if len(synset.lemmas) >1])
40061