Visualizing WordNet relationships as graphs

Posted by Eric Kidd Tue, 29 Dec 2009 20:38:00 GMT

The WordNet database contains all sorts of interesting relationships between words: it can categorize words into hierarchies, find the parts of an object, and answer many other interesting questions.

The code below relies on the NLTK and NetworkX libraries for Python.

Categorizing words

What, exactly, is a dog? It’s a domestic animal and a carnivore, not to mention a physical entity (as opposed to an abstract entity, such as an idea). WordNet knows all these facts:

How do we generate this image? First, we look up the first entry for “dog” in WordNet. This returns a “synset”, or a set of words with equivalent meanings.

dog = wn.synset('dog.n.01')

Next, we compute the transitive closure of the hypernym relationship, or (in English) we look for all the categories to which “dog” belongs, and all the categories to which those categories belong, recursively:

graph = closure_graph(dog,
                      lambda s: s.hypernyms())

After that, we just pass the resulting graph to NetworkX for display:

nx.draw_graphviz(graph)

The implementation

The closure_graph function repeatedly calls fn on the supplied symset, and uses the result to build a NetworkX graph. This code goes at the top of the file, so you can use wn and nx in your own code.

from nltk.corpus import wordnet as wn
import networkx as nx

def closure_graph(synset, fn):
    seen = set()
    graph = nx.DiGraph()

    def recurse(s):
        if not s in seen:
            seen.add(s)
            graph.add_node(s.name)
            for s1 in fn(s):
                graph.add_node(s1.name)
                graph.add_edge(s.name, s1.name)
                recurse(s1)

    recurse(synset)
    return graph

By using a high-quality graph library, we make it much easier to merge, analyze and display our graphs.

More graphs

Parts of the finger, generated with synset('finger.n.01') and part_meronyms:

Types of running, generated with synset('run.v.01') and hyponyms:

Tags ,

Experimenting with NLTK

Posted by Eric Kidd Mon, 28 Dec 2009 21:31:00 GMT

The Natural Language Toolkit for Python is a great framework for simple, non-probabilistic natural language processing. Here are some example snippets (and some trouble-shooting notes).

Concordances

We can search for “dog” in Chesterton’s The Man Who Was Thursday:

>>> from nltk.book import *
>>> text9.concordance("dog", width=40)
Displaying 4 of 4 matches:
ead of a cat or a dog , it could not ha
d you ever hear a dog bark like that ?"
aid , " is that a dog -- anybody ' s do
og -- anybody ' s dog ?" There broke up

Synonyms and categories

We can use WordNet to look up synonyms:

from nltk.corpus import wordnet

dog = wordnet.synset('dog.n.01')
print dog.lemma_names

This prints:

['dog', 'domestic_dog', 'Canis_familiaris']

We can also look up the “hypernyms”, or larger categories that include the word “dog”:

paths = dog.hypernym_paths()

def simple_path(path):
    return [s.lemmas[0].name for s in path]

for path in paths:
    print simple_path(path)

This prints:

['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'domestic_animal', 'dog']
['entity', 'physical_entity', 'object',
 'whole', 'living_thing', 'organism',
 'animal', 'chordate', 'vertebrate',
 'mammal', 'placental', 'carnivore',
 'canine', 'dog']

For more neat examples, take a look at the NLTK book.

Installation notes

While setting up NLTK, I bumped into a few problems.

Problem: The dispersion_plot function returns immediately without displaying anything.

Fix: Configure your matplotlib back-end correctly.

Problem: The nltk.app.concordance() GUI fails with the error:

out of stack space (infinite loop?)

Fix: Recompile Tcl with threads. On the Mac:

sudo port install tcl +threads

Tags ,

Interesting Python libraries for natural language processing

Posted by Eric Kidd Mon, 28 Dec 2009 15:56:00 GMT

I’ve been looking at various libraries for natural language processing, and I’m pleasantly surprised by the tools created by the Python community. Some examples:

  • The Python NLTK library provides parsers for many popular copora, visualization tools, and a wide variety of simple natural language algorithms (though few of these are probabilistic). Highlights include:
  • ConceptNet provides a simple semantic model of the world.
  • NumPy (and SciPy) provide extensive support for linear algebra and data visualization.
  • PyCUDA provides access to Nvidia GPUs for high-performance scientific computation, and it integrates with NumPy.

If you need to build a web crawler, there’s Twisted, which makes it easy to write fast, asynchronous networking code.

All in all, I usually prefer Ruby to Python, because I love Ruby’s metaprogramming support. But the Python community has built an impressive variety of scientific and linguistic tools. Many thanks to everybody who contributed to these projects!

Tags ,

Bayesian Whitelisting: Finding the Good Mail Among the Spam

Posted by Eric Sun, 29 Sep 2002 00:00:00 GMT

The biggest challenge with spam filtering is reducing false positives--that is, finding the good mail among the spam. Even the best spam filters occasionally mistake legitimate e-mail for spam. For example, in some recent tests, bogofilter processed 18,000 e-mails with only 34 false positives. Unfortunately, several of these false positives were urgent e-mails from former clients. This unpleasant mistake wasn't necessary--the most important of these false positives could have been avoided with an automatic whitelisting system.

Read more...

Tags , , , ,