The Natural Language Toolkit for Python is a great framework for simple, non-probabilistic natural language processing. Here are some example snippets (and some trouble-shooting notes).


We can search for “dog” in Chesterton’s The Man Who Was Thursday:

>>> from import *
>>> text9.concordance("dog", width=40)
Displaying 4 of 4 matches:
ead of a cat or a dog , it could not ha
d you ever hear a dog bark like that ?"
aid , " is that a dog -- anybody ' s do
og -- anybody ' s dog ?" There broke up

Synonyms and categories

We can use WordNet to look up synonyms:

from nltk.corpus import wordnet

dog = wordnet.synset('dog.n.01')
print dog.lemma_names

This prints:

['dog', 'domestic_dog', 'Canis_familiaris']

We can also look up the “hypernyms”, or larger categories that include the word “dog”:

paths = dog.hypernym_paths()

def simple_path(path):
    return [s.lemmas[0].name for s in path]

for path in paths:
    print simple_path(path)

This prints:

['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'domestic_animal', 'dog']
['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'chordate', 'vertebrate',
 'mammal', 'placental', 'carnivore',
 'canine', 'dog']

For more neat examples, take a look at the NLTK book.

Installation notes

While setting up NLTK, I bumped into a few problems.

Problem: The dispersion_plot function returns immediately without displaying anything.

Fix: Configure your matplotlib back-end correctly.

Problem: The GUI fails with the error:

out of stack space (infinite loop?)

Fix: Recompile Tcl with threads. On the Mac:

sudo port install tcl +threads