MIT Department of Electrical Engineering & Computer Science
Similarity-based Approaches to Natural Language Processing
Lillian Lee
Harvard University
Thursday, April 24, 1997
4:00 PM (3:45 refreshments)
Building NE43, 8th Floor AI Playroom
EECS Special Seminar
Abstract
Statistical methods for automatically extracting information about
associations between words or documents from large collections of text
would have considerable impact in a number of areas, such as
information retrieval and natural language-based user interfaces.
Unfortunately, even huge bodies of text yield highly unreliable
estimates of the probability of relatively common events. Traditional
approaches to this sparse data problem use crude approximations.
However, suppose we are able to organize the data into classes of
"similar" events. Then, if information about an event is lacking,
we can estimate its behavior from information about similar events.
We present two such similarity-based approaches.
Our first approach is top-down. We describe an algorithm for building
soft, hierarchical clusters: soft, because each event belongs to each
cluster with some probability; hierarchical, because cluster centroids
are iteratively split to model finer distinctions. We used this
method to cluster words drawn from 44 million words of Associated
Press Newswire and 10 million words from Grolier's encyclopedia. Our
algorithm also extends with no modification to the problem of document
clustering.
Our second approach is bottom-up: instead of calculating a centroid
for each class, we in essence build a cluster around each word. Using
estimation techniques based on this model, we are able to achieve
improvements of more than 20 percent over standard techniques in the
prediction of low-frequency events.
URL of this page:
http://www-eecs.mit.edu/AY96-97/events/43.html
Created: Apr 14, 1997
|
Modified: Jun 24, 1997
This announcement is from the MIT EECS 1996-97 archive.
|
Current events
To MIT EECS home page
|
Your comments
and inquiries are welcome.