UT Educational Information Retrieval Package
This package contains miniature pedagogical Java
implementations of information retrieval, spidering, and other
text processing software. Features include:
- Classes for tokenization, stopword removal, and
stemming;
- An efficient implementation of inverted index for text
or HTML document retrieval using cosine similarity with
TF-IDF weighting;
- Classes for running evaluation experiments for information
retrieval, generating recall-precision curves for a given test
corpus of query/relevant-document pairs;
- A simple web crawler that supports robot exclusion;
- An implementation of Naive Bayes for text categorization,
along with software that generates cross-validated learning
curves.
Click here to
download!
It was originally developed by Prof. Raymond
Mooney et al. for an introductory course on Intelligent
Information Retrieval and Web Search in University of Texas at
Austin. It is being released for educational and research
purposes only under the GNU General Public
License.
This site is kindly hosted by
.
Yuk-Wah Wong
Last modified: Thu Nov 20 22:04:40 CST 2003