UT Educational Information Retrieval Package

This package contains miniature pedagogical Java implementations of information retrieval, spidering, and other text processing software. Features include:

Classes for tokenization, stopword removal, and stemming;
An efficient implementation of inverted index for text or HTML document retrieval using cosine similarity with TF-IDF weighting;
Classes for running evaluation experiments for information retrieval, generating recall-precision curves for a given test corpus of query/relevant-document pairs;
A simple web crawler that supports robot exclusion;
An implementation of Naive Bayes for text categorization, along with software that generates cross-validated learning curves.

Click here to download!

It was originally developed by Prof. Raymond Mooney et al. for an introductory course on Intelligent Information Retrieval and Web Search in University of Texas at Austin. It is being released for educational and research purposes only under the GNU General Public License.

This site is kindly hosted by .

Yuk-Wah Wong

Last modified: Thu Nov 20 22:04:40 CST 2003