((NLTK) (is (fun)))

15 Jul 2009

Fun? Really?
Yup.

Isn’t it for those fancy computational linguistics guys?
NLTK is pretty fancy, but it is also very accessible and useful for curious newcomers. Think: low floor high ceiling, just like Python itself.

Are you an NLP/NLTK expert?
Nope. This is my public service announcement to let you know that non-experts can use it.

Introduction

The Natural Language Toolkit [NLTK], a suite of open source Python modules for natural language processing [NLP] tasks, is at a minimum fun and easy to play with as a standalone language exploration tool. Beyond that, the possibilities of integrating it with your research and software are vast as far as this NLP dilettante can tell.

I attended a “Python in the Classroom” workshop in the Open Source Labs program of NECC 2009 several weeks ago and had the pleasure of speaking to one of the two educators presenting (Jeff Elkner) as well as two graduating high school seniors who had participated in the presentation. I have always thought of Python as a great learning language, and was excited to hear about how it is being presented in a high school environment. This post originated as an email to them.

Resources

The NLTK book is available for free online, and serves as an excellent introduction to the toolkit. If you are new to Python, working through the chapters and exercises in the NLTK book (which provides introductory Python instructions inline) would be a nice way to get your feet wet while learning about interesting applications of the language.

Another great resource is the NLTK API documentation. Since the modules are named in a meaningful way it is fairly easy to browse the documentation and discover and experiment with stuff along the way.

Setup

We’re not going to write programs during this tutorial… just taking a peek at some nice features of NLTK and learning to experiment with it in a Python interactive shell. In order to do that you’ll need Python, NLTK, and some NLTK data.

  1. Python: OS X and Linux users, you probably already have an acceptable version of Python (use 2.5 or 2.6, but not 3.0). Windows users, refer to the link in the next step, as the download page includes Python installation recommendations for you.
  2. NLTK: Follow instructions to install NLTK for your platform. NOTE: for this introduction, you only need to download NLTK; you do not need the optional packages listed on the download site (Numpy, Matplotlib, Prover9).
    Once you can enter ‘>>> import nltk’ at the Python prompt with no errors we are in business.
  3. NLTK data: Your NLTK installation came with a downloader tool for NLTK data. Download data for the WordNet corpus and the Treebank Part of Speech Tagger (Maximum entropy).
    # @ Unix command prompt
    python -m nltk.downloader wordnet
    python -m nltk.downloader maxent_treebank_pos_tagger
    
    OR 
    
    # @ Python prompt to open Tcl/Tk GUI
    >>> nltk.download()
    (Navigate to Corpera -> wordnet and Models -> maxent_treebank_pos_tagger and press Download)
    

If you have any trouble with these instructions please refer to NLTK’s data installation page for further help on the downloader tool.

Simple example

Let’s take a look at some basic word tokenization functionality provided out of the box with NLTK.

>>> import nltk
>>> tokens = nltk.word_tokenize("OMG, NLTK is like so cool.")

>>> tokens
['OMG', ',', 'NLTK', 'is', 'like', 'so', 'cool', '.']

(NOTE: I’m going to assume that from now on you ‘import nltk’ if you restart your interactive session, and will leave it out of further examples.)

The word_tokenize() function uses NLTK’s currently recommended word tokenizer, TreebankWordTokenizer, and should be fed one sentence at a time. Notice that the resulting list includes the punctuation as tokens and not as part of the adjacent words. Not very fancy, but I’d take that over string.split() plus special handling for punctuation if I’m in the market for simple sentence handling.

What I find most significant here is that the word_tokenize() function is there to simplify a common task for most users. Since many users (myself included) will want to use whatever the recommended tokenizer is, it’s nice of the NLTK team to give us some tools that we don’t have to become researchers to figure out. On the flip side, a linguist who needs to use a different method of tokenization with similarly accessible NLTK tools is also in luck with his/her ability to configure the tool and access whatever data is needed via the framework.

Corpus linguistics

One of the cool features of NLTK is its corpus reader, which allows us to tap into interesting collections of texts that have been compiled and maintained by linguists. A corpus is organized to suit its domain – for instance see the filenames of the text files that make up the following three corpera:

>>> nltk.corpus.shakespeare.fileids()
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', 'macbeth.xml', 'merchant.xml', 'othello.xml', 'r_and_j.xml']

>>> nltk.corpus.stopwords.fileids()
['danish', 'dutch', 'english', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish']

>>> nltk.corpus.names.fileids()
['female.txt', 'male.txt']

Within the corpus and its text files, the text itself is also organized in various ways to suit the corpus’ intent and structure. The Brown Corpus, which contains texts representative of present-day American English, contains tagged words:

 [from nltk_data/grammars/brown/cm02]
Hal/np ,/, the/at linguist/nn ,/, saw/vbd the/at glittering/vbg discs/nns and/cc necklaces/nns in/in terms/nns of/in the/at languages/nns spoken/vbn therein/rb ./.

A more tool-like corpus, on the other hand, can be organized in a very simple way, as you can see with some Stopwords and Names content:

[from nltk_data/grammars/stopwords/english]
a
a's
able
about
above
...

[from nltk_data/grammars/stopwords/swedish]
alla
allt
att
av
blev
...

[from nltk_data/grammars/names/female.txt]
...
Zsa Zsa
Zsazsa
Zulema
Zuzana

[from nltk_data/grammars/names/male.txt]
Zippy
Zollie
Zolly
Zorro

Since we will be looking at the WordNet corpus in the next section, take a look at the text representation of a word in this more esoterically designed corpus:

[from nltk_data/grammars/wordnet/data.noun]
03452741 06 n 02 grand_piano 0 grand 0 004 @ 03928116 n 0000 ~ 02766792 n 0000 ~ 03086457 n 0000 %p 03654826 n 0000 | a piano with the strings on a horizontal harp-shaped frame; usually supported by three legs

You should be able to see that NLTK’s corpus reader gives you meaningful access to all sorts of data and allows you to start rather far along on the exploration path without worrying about how it’s all working behind the scenes. As far as our curiosity about NLP goes, this is great news because we can focus on the linguistics lessons straight away.

So, what about your own research? What if you have texts that you want to tag and analyze, etc.? Perhaps you are building a corpus that represents conversations amongst valley girls so that your valley girl detector tool will be all the more powerful. You might have various files collected over time as you recorded communication in different locations: “starbucks.txt, beach.txt, office.txt…” Or perhaps your style of collecting data relied on dates.. “May2009.txt, June2009.txt…” Check out NLTK’s PlaintextCorpusReader which will allow you to start building your corpus in no time. (I’m starting to feel like the iPhone ad when I talk about NLTK –> “there’s a corpus for that!”, “there’s a module for that!”)

WordNet

WordNet is a large lexical database maintained by http://wordnet.princeton.edu/.

Let’s say that I want to go down a discovery path about the term piano (the instrument). Perhaps my application seeks to tie a community of musicians together and it would be useful to find paths amongst instruments as well as roots (hyponyms) and related items.

Let’s grab wordnet from the nltk.corpus namespace first:

from nltk.corpus import wordnet as wn

Next, I will find the correct synset, which is a collection of synonymous words:

>>> wn.synsets("piano")
[Synset('piano.n.01'), Synset('piano.n.02'), Synset('piano.a.01'), Synset('piano.r.01')]

Uh oh, there are four synsets for our word. How do we know which one is correct? Although this won’t help programmatically, all of my human readers will find the definitions that WordNet includes to be helpful in narrowing it down:

>>> wn.synset('piano.n.01').definition
'a keyboard instrument that is played by depressing keys that cause hammers to strike tuned strings and produce sounds'
>>> wn.synset('piano.n.02').definition
'(music) low loudness'
>>> wn.synset('piano.a.01').definition
'used chiefly as a direction or description in music'
>>> wn.synset('piano.r.01').definition
'used as a direction in music; to be played relatively softly'

It is clear from the definitions that ‘piano.n.01′ is what I’m looking for. In order to discover computational techniques for analyzing these sets of words and their relations, you will want to learn more about synsets, lemmas, and more linguistics terminology, which Chapter 2 of the NLTK Book will give you a great start on. Since I am learning about these myself, I will defer to the better resource.

Before I continue to the next section (which will also send you off to the NLTK Book before my conclusion), I will just sneak in a few more cool show-and-tell aspects of WordNet:

[lexical "is a type of" relations: hyponyms]
>>> wn.synset('piano.n.01').hyponyms()
[Synset('upright.n.02'), Synset('grand_piano.n.01'), Synset('mechanical_piano.n.01')]

[lexical "is a" relations: hypernyms]
>>> wn.synset('piano.n.01').hypernyms()
[Synset('keyboard_instrument.n.01'), Synset('stringed_instrument.n.01'), Synset('percussion_instrument.n.01')]

[component "has a" relations: meronyms]

>>> wn.synset('piano.n.01').part_meronyms()
[Synset('piano_action.n.01'), Synset('keyboard.n.01'), Synset('fallboard.n.01'), Synset('sustaining_pedal.n.01'), Synset('soft_pedal.n.01'), Synset('piano_keyboard.n.01'), Synset('sounding_board.n.02')]

One example that the book provides of how these relationships can get pretty intricate uses the word “mint”. ‘mint.n.04′ is both part of ‘mint.n.02′ and one of the substances that ‘mint.n.05′ is made out of:


>>> wn.synset('mint.n.02').definition
'any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers'
>>> wn.synset('mint.n.04').definition
'the leaves of a mint plant used fresh or candied'
>>> wn.synset('mint.n.05').definition
'a candy that is flavored with a mint oil'

>>> wn.synset('mint.n.04').part_holonyms()
[Synset('mint.n.02')]
>>> wn.synset('mint.n.04').substance_holonyms()
[Synset('mint.n.05')]
>>>

I hope that I whetted your appetite for your own corpus discovery activities with some of these examples. There are so many paths to go down with all of the available data and tools, and you’ve seen how easy it is to explore without the benefit of expertise.

Tagging, Classifying, Chunking, Parsing

Although I have personally been able to leverage the tagging, classifying, chunking, and parsing features of NLTK, I will again defer to the corresponding chapters of the NLTK book for lessons. The learning curve in these areas becomes steeper, but you can certainly start small and build upon your experience if you want to travel down those paths. I’ll finish with some high level introductions/examples and links to the relevant chapters:

Tagging (Chapter 5):
If you would like to easily identify some parts of speech, you downloaded the Penn Treebank Tag-set during the setup and can use it to add POS tags to your tokens:

>>> tokens = nltk.word_tokenize("OMG, NLTK is like so cool.")
>>> nltk.pos_tag(tokens)
[('OMG', 'NNP'), (',', ','), ('NLTK', 'NNP'), ('is', 'VBZ'), ('like', 'IN'), ('so', 'RB'), ('cool', 'JJ'), ('.', '.')]

Classifying (Chapter 6):
This is where you can start building your valley girl detector tool.

Chunking (Chapter 7):
[When I use] [NLTK] [I learn] [cool things] [about] [languages]

Parsing (Chapter 8):
I cheated with the title of this blog post. Here is the real parse tree:

(S
    (NP (NNP NLTK))
    (VP (VBZ is)
      (NP (NN fun)))))

Conclusion

I have seen many articles about NLTK yet speak to many developers and students who are not aware of it. Having previously assumed that the toolkit wouldn’t be very useful for me unless I study in the computational linguistics field, I was pleasantly surprised to find out that I was wrong. As for the students who are learning about Computer Science through Python, I’m sure you can think of some ways to add language analysis to all the fun – or at the very least to impress your English teachers with your knowledge of hyponyms.

>>> from nltk.chat import rude
>>> rude.demo()
Talk to the program by typing in plain English, using normal upper-
and lower-case letters and punctuation.  Enter "quit" when done.
========================================================================
I suppose I should say hello.
>I just wanted to show the readers how one of the chatbot demos works.
Either become more thrilling or get lost, buddy.
>Ok, bye.
Change the subject before I die of fatal boredom.

That wasn’t very nice, maybe I’ll let the zen chatbot help me wrap this up:

>>> from nltk.chat import zen

>>> zen.demo()
***************************************************************************
                                Zen Chatbot!
***************************************************************************
         "Look beyond mere words and letters - look into your mind"
* Talk your way to truth with Zen Chatbot.
* Type 'quit' when you have had enough.
***************************************************************************
Welcome, my child.
>Can you help me end my blog post?
I probably can, but I may not.
>Ok, whatever.  I'll just say bye then!
I pour out a cup of water. Is the cup empty?