MIT Department of Electrical Engineering & Computer Science

E E C S

Similarity-based Approaches to Natural Language Processing

Lillian Lee
Harvard University

Thursday, April 24, 1997
4:00 PM (3:45 refreshments)
Building NE43, 8th Floor AI Playroom
EECS Special Seminar

Abstract

Statistical methods for automatically extracting information about associations between words or documents from large collections of text would have considerable impact in a number of areas, such as information retrieval and natural language-based user interfaces. Unfortunately, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events. Traditional approaches to this sparse data problem use crude approximations. However, suppose we are able to organize the data into classes of "similar" events. Then, if information about an event is lacking, we can estimate its behavior from information about similar events. We present two such similarity-based approaches.

Our first approach is top-down. We describe an algorithm for building soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. We used this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia. Our algorithm also extends with no modification to the problem of document clustering.

Our second approach is bottom-up: instead of calculating a centroid for each class, we in essence build a cluster around each word. Using estimation techniques based on this model, we are able to achieve improvements of more than 20 percent over standard techniques in the prediction of low-frequency events.


URL of this page: http://www-eecs.mit.edu/AY96-97/events/43.html
Created: Apr 14, 1997  | Modified: Jun 24, 1997
This announcement is from the MIT EECS 1996-97 archive.  | Current events
To MIT EECS home page  | Your comments and inquiries are welcome.