I was so proud of my newly-efficient navigation skills in Paris, only to be foiled in my attempt to get home from the library for the 3rd time in a row :(
I love my NY grid.
I was so proud of my newly-efficient navigation skills in Paris, only to be foiled in my attempt to get home from the library for the 3rd time in a row :(
I love my NY grid.
One thing I love about Paris, where I’m living for a few months, is my ability to get lost just about anywhere even though I know the city well. I lose my sense of direction, do not recognize street names, and walk in all sorts of round-about paths (apart from some places in Central Park and the West Village, that is very difficult to do in NYC!)… On the first day I arrived, I couldn’t find my apartment after an afternoon stroll and had no idea what the address was or how to contact my landlady since I’d suspended my phone/data plan service. I eventually found it through clues like the location of the Monoprix on the corner and the distinct “Gardien” sign that my landlady had pointed out to me on our way in from the airport, as well as my floor number and where my door was in relation to the elevator. This experience made me oddly pleased, because I felt like I had successfully solved my scatter-brained problem in an old-fashioned way. I will happily pick up my Blackberry upon returning to New York in February, but the knowledge that the world doesn’t end if I can’t quickly look up a map or contact someone for help is comforting.
The traffic volume on our available communication tools is overwhelming for me. Many are terrific, and I love them, including Twitter, but regardless of their benefits I find that my attention span suffers tremendously when I’m using them. By last year, I could only process many thoughts/tasks/activities at once but not one at a time, and I had stopped reading books and articles since the highlights were all I had time for. Even worse, my own output was also fragmented. I have found that taking a pause and evaluating tools more objectively has been a rewarding and fun exercise so far.
Current top tools in my evil plan:
Looking around at the gadgets that are most crucial to my livelihood and happiness, including my Thinkpad X301, magical (I don’t want to understand how the e-ink works) Kindle, Blackberry Curve, and second-generation (i.e., old) iPod Nano, I find myself stunned to realize that only the last product was made by Apple… And what about software? I’m using standard Unix tools for development, including vim as my primary text editor (I do miss TextMate, but apparently not enough), Firefox and Opera browsers, an open-source music management tool that comes with Ubuntu, and my emusic subscription for online music shopping.
What happened to the OS X fangirl who attended WWDC ‘05 as a student? I would never have predicted that I’d stray so far from Apple products, and yet one by one I’ve drifted:
Here’s hoping for real keyboards on smaller Apple devices in the future and smaller laptops. Or (gulp) maybe we’re just no longer a match.
Fun? Really?
Yup.
Isn’t it for those fancy computational linguistics guys?
NLTK is pretty fancy, but it is also very accessible and useful for curious newcomers. Think: low floor high ceiling, just like Python itself.
Are you an NLP/NLTK expert?
Nope. This is my public service announcement to let you know that non-experts can use it.
The Natural Language Toolkit [NLTK], a suite of open source Python modules for natural language processing [NLP] tasks, is at a minimum fun and easy to play with as a standalone language exploration tool. Beyond that, the possibilities of integrating it with your research and software are vast as far as this NLP dilettante can tell.
I attended a “Python in the Classroom” workshop in the Open Source Labs program of NECC 2009 several weeks ago and had the pleasure of speaking to one of the two educators presenting (Jeff Elkner) as well as two graduating high school seniors who had participated in the presentation. I have always thought of Python as a great learning language, and was excited to hear about how it is being presented in a high school environment. This post originated as an email to them.
The NLTK book is available for free online, and serves as an excellent introduction to the toolkit. If you are new to Python, working through the chapters and exercises in the NLTK book (which provides introductory Python instructions inline) would be a nice way to get your feet wet while learning about interesting applications of the language.
Another great resource is the NLTK API documentation. Since the modules are named in a meaningful way it is fairly easy to browse the documentation and discover and experiment with stuff along the way.
We’re not going to write programs during this tutorial… just taking a peek at some nice features of NLTK and learning to experiment with it in a Python interactive shell. In order to do that you’ll need Python, NLTK, and some NLTK data.
# @ Unix command prompt python -m nltk.downloader wordnet python -m nltk.downloader maxent_treebank_pos_tagger OR # @ Python prompt to open Tcl/Tk GUI >>> nltk.download() (Navigate to Corpera -> wordnet and Models -> maxent_treebank_pos_tagger and press Download)
If you have any trouble with these instructions please refer to NLTK’s data installation page for further help on the downloader tool.
Let’s take a look at some basic word tokenization functionality provided out of the box with NLTK.
>>> import nltk
>>> tokens = nltk.word_tokenize("OMG, NLTK is like so cool.")
>>> tokens
['OMG', ',', 'NLTK', 'is', 'like', 'so', 'cool', '.']
(NOTE: I’m going to assume that from now on you ‘import nltk’ if you restart your interactive session, and will leave it out of further examples.)
The word_tokenize() function uses NLTK’s currently recommended word tokenizer, TreebankWordTokenizer, and should be fed one sentence at a time. Notice that the resulting list includes the punctuation as tokens and not as part of the adjacent words. Not very fancy, but I’d take that over string.split() plus special handling for punctuation if I’m in the market for simple sentence handling.
What I find most significant here is that the word_tokenize() function is there to simplify a common task for most users. Since many users (myself included) will want to use whatever the recommended tokenizer is, it’s nice of the NLTK team to give us some tools that we don’t have to become researchers to figure out. On the flip side, a linguist who needs to use a different method of tokenization with similarly accessible NLTK tools is also in luck with his/her ability to configure the tool and access whatever data is needed via the framework.
One of the cool features of NLTK is its corpus reader, which allows us to tap into interesting collections of texts that have been compiled and maintained by computational linguists. A corpus is organized to suit its domain – for instance see the filenames of the text files that make up the following three corpera:
>>> nltk.corpus.shakespeare.fileids() ['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', 'macbeth.xml', 'merchant.xml', 'othello.xml', 'r_and_j.xml'] >>> nltk.corpus.stopwords.fileids() ['danish', 'dutch', 'english', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish'] >>> nltk.corpus.names.fileids() ['female.txt', 'male.txt']
Within the corpus and its text files, the text itself is also organized in various ways to suit the corpus’ intent and structure. The Brown Corpus, which contains texts representative of present-day American English, contains tagged words:
[from nltk_data/grammars/brown/cm02] Hal/np ,/, the/at linguist/nn ,/, saw/vbd the/at glittering/vbg discs/nns and/cc necklaces/nns in/in terms/nns of/in the/at languages/nns spoken/vbn therein/rb ./.
A more tool-like corpus, on the other hand, can be organized in a very simple way, as you can see with some Stopwords and Names content:
[from nltk_data/grammars/stopwords/english] a a's able about above ... [from nltk_data/grammars/stopwords/swedish] alla allt att av blev ... [from nltk_data/grammars/names/female.txt] ... Zsa Zsa Zsazsa Zulema Zuzana [from nltk_data/grammars/names/male.txt] Zippy Zollie Zolly Zorro
Since we will be looking at the WordNet corpus in the next section, take a look at the text representation of a word in this more esoterically designed corpus:
[from nltk_data/grammars/wordnet/data.noun] 03452741 06 n 02 grand_piano 0 grand 0 004 @ 03928116 n 0000 ~ 02766792 n 0000 ~ 03086457 n 0000 %p 03654826 n 0000 | a piano with the strings on a horizontal harp-shaped frame; usually supported by three legs
You should be able to see that NLTK’s corpus reader gives you meaningful access to all sorts of data and allows you to start rather far along on the exploration path without worrying about how it’s all working behind the scenes. As far as our curiosity about NLP goes, this is great news because we can focus on the linguistics lessons straight away.
So, what about your own research? What if you have texts that you want to tag and analyze, etc.? Perhaps you are building a corpus that represents conversations amongst valley girls so that your valley girl detector tool will be all the more powerful. You might have various files collected over time as you recorded communication in different locations: “starbucks.txt, beach.txt, office.txt…” Or perhaps your style of collecting data relied on dates.. “May2009.txt, June2009.txt…” Check out NLTK’s PlaintextCorpusReader which will allow you to start building your corpus in no time. (I’m starting to feel like the iPhone ad when I talk about NLTK –> “there’s a corpus for that!”, “there’s a module for that!”)
WordNet is a large lexical database maintained by http://wordnet.princeton.edu/.
Let’s say that I want to go down a discovery path about the term piano (the instrument). Perhaps my application seeks to tie a community of musicians together and it would be useful to find paths amongst instruments as well as roots (hyponyms) and related items.
Let’s grab wordnet from the nltk.corpus namespace first:
from nltk.corpus import wordnet as wn
Next, I will find the correct synset, which is a collection of synonymous words:
>>> wn.synsets("piano")
[Synset('piano.n.01'), Synset('piano.n.02'), Synset('piano.a.01'), Synset('piano.r.01')]
Uh oh, there are four synsets for our word. How do we know which one is correct? Although this won’t help programmatically, all of my human readers will find the definitions that WordNet includes to be helpful in narrowing it down:
>>> wn.synset('piano.n.01').definition
'a keyboard instrument that is played by depressing keys that cause hammers to strike tuned strings and produce sounds'
>>> wn.synset('piano.n.02').definition
'(music) low loudness'
>>> wn.synset('piano.a.01').definition
'used chiefly as a direction or description in music'
>>> wn.synset('piano.r.01').definition
'used as a direction in music; to be played relatively softly'
It is clear from the definitions that ‘piano.n.01′ is what I’m looking for. In order to discover computational techniques for analyzing these sets of words and their relations, you will want to learn more about synsets, lemmas, and more linguistics terminology, which Chapter 2 of the NLTK Book will give you a great start on. Since I am learning about these myself, I will defer to the better resource.
Before I continue to the next section (which will also send you off to the NLTK Book before my conclusion), I will just sneak in a few more cool show-and-tell aspects of WordNet:
[lexical "is a type of" relations: hyponyms]
>>> wn.synset('piano.n.01').hyponyms()
[Synset('upright.n.02'), Synset('grand_piano.n.01'), Synset('mechanical_piano.n.01')]
[lexical "is a" relations: hypernyms]
>>> wn.synset('piano.n.01').hypernyms()
[Synset('keyboard_instrument.n.01'), Synset('stringed_instrument.n.01'), Synset('percussion_instrument.n.01')]
[component "has a" relations: meronyms]
>>> wn.synset('piano.n.01').part_meronyms()
[Synset('piano_action.n.01'), Synset('keyboard.n.01'), Synset('fallboard.n.01'), Synset('sustaining_pedal.n.01'), Synset('soft_pedal.n.01'), Synset('piano_keyboard.n.01'), Synset('sounding_board.n.02')]
One example that the book provides of how these relationships can get pretty intricate uses the word “mint”. ‘mint.n.04′ is both part of ‘mint.n.02′ and one of the substances that ‘mint.n.05′ is made out of:
>>> wn.synset('mint.n.02').definition
'any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers'
>>> wn.synset('mint.n.04').definition
'the leaves of a mint plant used fresh or candied'
>>> wn.synset('mint.n.05').definition
'a candy that is flavored with a mint oil'
>>> wn.synset('mint.n.04').part_holonyms()
[Synset('mint.n.02')]
>>> wn.synset('mint.n.04').substance_holonyms()
[Synset('mint.n.05')]
>>>
I hope that I whetted your appetite for your own corpus discovery activities with some of these examples. There are so many paths to go down with all of the available data and tools, and you’ve seen how easy it is to explore without the benefit of expertise.
Although I have personally been able to leverage the tagging, classifying, chunking, and parsing features of NLTK, I will again defer to the corresponding chapters of the NLTK book for lessons. The learning curve in these areas becomes steeper, but you can certainly start small and build upon your experience if you want to travel down those paths. I’ll finish with some high level introductions/examples and links to the relevant chapters:
Tagging (Chapter 5):
If you would like to easily identify some parts of speech, you downloaded the Penn Treebank Tag-set during the setup and can use it to add POS tags to your tokens:
>>> tokens = nltk.word_tokenize("OMG, NLTK is like so cool.")
>>> nltk.pos_tag(tokens)
[('OMG', 'NNP'), (',', ','), ('NLTK', 'NNP'), ('is', 'VBZ'), ('like', 'IN'), ('so', 'RB'), ('cool', 'JJ'), ('.', '.')]
Classifying (Chapter 6):
This is where you can start building your valley girl detector tool.
Chunking (Chapter 7):
[When I use] [NLTK] [I learn] [cool things] [about] [languages]
Parsing (Chapter 8):
I cheated with the title of this blog post. Here is the real parse tree:
(S
(NP (NNP NLTK))
(VP (VBZ is)
(NP (NN fun)))))
I have seen many articles about NLTK yet speak to many developers and students who are not aware of it. Having previously assumed that the toolkit wouldn’t be very useful for me unless I study in the computational linguistics field, I was pleasantly surprised to find out that I was wrong. As for the students who are learning about computer science through Python, I’m sure you can think of some ways to add language analysis to all the fun – or at the very least to impress your English teachers with your knowledge of hyponyms.
>>> from nltk.chat import rude >>> rude.demo() Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation. Enter "quit" when done. ======================================================================== I suppose I should say hello. >I just wanted to show the readers how one of the chatbot demos works. Either become more thrilling or get lost, buddy. >Ok, bye. Change the subject before I die of fatal boredom.
That wasn’t very nice, maybe I’ll let the zen chatbot help me wrap this up:
>>> from nltk.chat import zen
>>> zen.demo()
***************************************************************************
Zen Chatbot!
***************************************************************************
"Look beyond mere words and letters - look into your mind"
* Talk your way to truth with Zen Chatbot.
* Type 'quit' when you have had enough.
***************************************************************************
Welcome, my child.
>Can you help me end my blog post?
I probably can, but I may not.
>Ok, whatever. I'll just say bye then!
I pour out a cup of water. Is the cup empty?
“How does a computer understand languages if it’s not a human being?”
I wish I had a recording of my 5-year old nephew asking me that question, after I had quickly explained to him that there are languages that people use to communicate with computers. I had visited Patrick that day to show him Scratch, the “programming language for everyone” from the MIT Media Lab. He’s younger than the target demographic, which is 8 and above, but I was curious to see what he’d say as we looked at it together.
It was clear from the outset that I had underestimated his agility with user interfaces and ability to learn new concepts. He proceeded to set up a football stadium background with several football player “sprites”. I asked him if he wanted to get rid of the default Scratch cartoon cat sprite which looked out of place on the football field, but he gave me one stern look and made sure that I no longer had access to the mouse: “That’s the coach, Megan.” I told him in my best how-to-talk-to-a-kindergartner voice that he could program the players to move around, and before I could finish the sentence he jumped in to help: “You mean animate them?”
When I was Patrick’s age, my older brother and I had a Commodore VIC-20. I can’t recall where it came from — as far as I know it was just there one day and we loved it. Watching Patrick fiddle with Scratch, figuring out by himself how to draw a football with the mouse after a brief trial-and-error period with the buttons on the main screen, I was reminded of my approach to the VIC-20 back then. We had a manual with Commodore BASIC commands, and some games, and would spend hours tinkering away; no goals, no schedules, just playing. It’s exciting to think about the learning environment that open-minded, technically nimble children and their teachers can create together with a tool like Scratch, especially since the teachers will have to continuously learn from their students in order to keep up.
I encourage anyone who is interested in Scratch to visit MIT’s site (http://scratch.mit.edu/) to view tutorials and example projects, and of course to download and start exploring it. After my own cursory look at Scratch, I plan to check out the community of developers and educators via the active forums, write a program or two myself, and basically arrive much better prepared for my next lesson with Patrick — or more accurately his next lesson with me.
At Twitterers Anonymous, they encourage us to maintain a backlog of tweets so that we don’t feel tempted to sign up again. My sponsor, who once had so many followers that it got to his head, told me to never, never publish my backlog. I hope he doesn’t read this.
last updated 3/2/2010