Posted by Eric Kidd
Tue, 29 Dec 2009 20:38:00 GMT
The WordNet database contains all sorts of interesting relationships between words: it can categorize words into hierarchies, find the parts of an object, and answer many other interesting questions.
The code below relies on the NLTK and NetworkX libraries for Python.
Categorizing words
What, exactly, is a dog? It’s a domestic animal and a carnivore, not to mention a physical entity (as opposed to an abstract entity, such as an idea). WordNet knows all these facts:

How do we generate this image? First, we look up the first entry for “dog” in WordNet. This returns a “synset”, or a set of words with equivalent meanings.
dog = wn.synset('dog.n.01')
Next, we compute the transitive closure of the hypernym relationship, or (in English) we look for all the categories to which “dog” belongs, and all the categories to which those categories belong, recursively:
graph = closure_graph(dog,
lambda s: s.hypernyms())
After that, we just pass the resulting graph to NetworkX for display:
The implementation
The closure_graph function repeatedly calls fn on the supplied symset, and uses the result to build a NetworkX graph. This code goes at the top of the file, so you can use wn and nx in your own code.
from nltk.corpus import wordnet as wn
import networkx as nx
def closure_graph(synset, fn):
seen = set()
graph = nx.DiGraph()
def recurse(s):
if not s in seen:
seen.add(s)
graph.add_node(s.name)
for s1 in fn(s):
graph.add_node(s1.name)
graph.add_edge(s.name, s1.name)
recurse(s1)
recurse(synset)
return graph
By using a high-quality graph library, we make it much easier to merge, analyze and display our graphs.
More graphs
Parts of the finger, generated with synset('finger.n.01') and part_meronyms:

Types of running, generated with synset('run.v.01') and hyponyms:

Tags NLP, Python
Posted by Eric Kidd
Mon, 28 Dec 2009 21:31:00 GMT
The Natural Language Toolkit for Python is a great framework for simple, non-probabilistic natural language processing. Here are some example snippets (and some trouble-shooting notes).
Concordances
We can search for “dog” in Chesterton’s The Man Who Was Thursday:
>>> from nltk.book import *
>>> text9.concordance("dog", width=40)
Displaying 4 of 4 matches:
ead of a cat or a dog , it could not ha
d you ever hear a dog bark like that ?"
aid , " is that a dog -- anybody ' s do
og -- anybody ' s dog ?" There broke up
Synonyms and categories
We can use WordNet to look up synonyms:
from nltk.corpus import wordnet
dog = wordnet.synset('dog.n.01')
print dog.lemma_names
This prints:
['dog', 'domestic_dog', 'Canis_familiaris']
We can also look up the “hypernyms”, or larger categories that include the word “dog”:
paths = dog.hypernym_paths()
def simple_path(path):
return [s.lemmas[0].name for s in path]
for path in paths:
print simple_path(path)
This prints:
['entity', 'physical_entity', 'object',
'whole', 'living_thing', 'organism',
'animal', 'domestic_animal', 'dog']
['entity', 'physical_entity', 'object',
'whole', 'living_thing', 'organism',
'animal', 'chordate', 'vertebrate',
'mammal', 'placental', 'carnivore',
'canine', 'dog']
For more neat examples, take a look at the NLTK book.
Installation notes
While setting up NLTK, I bumped into a few problems.
Problem: The dispersion_plot function returns immediately without displaying anything.
Fix: Configure your matplotlib back-end correctly.
Problem: The nltk.app.concordance() GUI fails with the error:
out of stack space (infinite loop?)
Fix: Recompile Tcl with threads. On the Mac:
sudo port install tcl +threads
Tags NLP, Python
Posted by Eric Kidd
Mon, 28 Dec 2009 15:56:00 GMT
I’ve been looking at various libraries for natural language processing, and I’m pleasantly surprised by the tools created by the Python community. Some examples:
- The Python NLTK library provides parsers for many popular copora, visualization tools, and a wide variety of simple natural language algorithms (though few of these are probabilistic). Highlights include:
- ConceptNet provides a simple semantic model of the world.
- NumPy (and SciPy) provide extensive support for linear algebra and data visualization.
- PyCUDA provides access to Nvidia GPUs for high-performance scientific computation, and it integrates with NumPy.
If you need to build a web crawler, there’s Twisted, which makes it easy to write fast, asynchronous networking code.
All in all, I usually prefer Ruby to Python, because I love Ruby’s metaprogramming support. But the Python community has built an impressive variety of scientific and linguistic tools. Many thanks to everybody who contributed to these projects!
Tags NLP, Python
Posted by Eric
Sun, 29 Sep 2002 00:00:00 GMT
The biggest challenge with spam filtering is reducing false
positives--that is, finding the good mail among the spam. Even the
best spam filters occasionally mistake legitimate e-mail for spam. For
example, in some recent
tests, bogofilter
processed 18,000 e-mails with only 34 false positives. Unfortunately,
several of these false positives were urgent e-mails from former
clients. This unpleasant mistake wasn't necessary--the most important
of these false positives could have been avoided with an automatic
whitelisting system.
Read more...
Tags Hacks, Probability, Python, Recommended, Spam