PU1 and PU123A

PU1

This directory contains the PU1 corpus, as described in the paper:

I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos, "An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages". In Belkin, N.J., Ingwersen, P. and Leong, M.-K. (Eds.), Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, pp. 160-167, 2000.

There are 4 subdirectories, corresponding to the four "encrypted" versions of the corpus mentioned in the paper:

bare: Lemmatiser disabled, stop-list disabled.
lemm: Lemmatiser enabled, stop-list disabled.
lemm_stop: Lemmatiser enabled, stop-list enabled.
stop: Lemmatiser disabled, stop-list enabled.

Each one of these 4 directories contains 10 subdirectories (part1, ..., part10). These correspond to the 10 partitions of the corpus that were used in the 10-fold experiments. In each repetition, one part was reserved for testing and the other 9 were used for training.

Each one of the 10 subdirectories contains both spam and legitimate messages, one message in each file. Files whose names have the form *spmsg*.txt are spam messages. Files whose names have the form *legit*.txt are legitimate messages.

You are free to use this corpus for non-commercial purposes, provided that you acknowledge the use and origin of the corpus in any published work of yours that makes use of the corpus, and that you notify the person below about this work. To use this corpus for commercial applications, you must obtain a written permission from the person below.

Ion Androutsopoulos http://www.aueb.gr/users/ion/
PU1 corpus last updated: July 17, 2000.
This file (readme.txt) last updated: July 30, 2003.

PU123A

This directory contains the PU1, PU2, PU3, and PUA corpora, as described in the paper:

I. Androutsopoulos, G. Paliouras, E. Michelakis, "Learning to Filter Unsolicited Commercial E-Mail", submitted for journal publication, 2003.

There are 4 directories (pu1, pu2, pu3, pua), each containing one of the four corpora.

Each one of the 4 directories in turn contains 11 subdirectories (part1, ..., part10, unused). These correspond to the 10 partitions of each corpus that were used in the 10-fold cross-validation experiments. In each repetition, one part was reserved for testing, and the other 9 parts were used for training.

Each one of the 10 subdirectories contains both spam and legitimate messages, one message in each file. Files whose names have the form *spmsg*.txt are spam messages. Files whose names have the form *legit*.txt are legitimate messages. The first number in each filename is random; it was used to shuffle the messages. The second number was the initial identifier of the message; it does not reflect the order in which the messages were received. To bypass privacy issues, the messages are "encoded", as explained in the paper above.

To maintain the same number of messages and the same spam-to-legitimate ratio across the parts of each corpus, we had to discard some messages. These can be found in the "unused" directory of each corpus.

Unlike the earlier Ling-Spam corpus and the form of PU1 that was released in 2000, the corpora in this directory are only in "bare" form: tokens are separated by white characters, but no stop-list or lemmatizer has been applied. Apart from this difference, the PU1 corpus in this directory is the same as the PU1 corpus that was released in 2000, except that the distribution of the messages in the 10 parts is different, to reflect the distribution we used in the experiments of the paper mentioned above. Ling-Spam and the 2000 version of PU1 are still available.

The PU1, PU2, PU3, and PUA corpora (as well as Ling-Spam) can be obtained from the site of i-config:   http://www.iit.demokritos.gr/skel/i-config/ or the publications section of Ion Androutsopoulos' web pages:   http://www.aueb.gr/users/ion/publications.html.

You are free to use PU1, PU2, PU3, and PUA for non-commercial purposes, provided that you acknowledge the use and origin of the corpora in any published work of yours that makes use of them, and that you notify one of the persons below about this work. To use the four corpora for commercial applications, you must obtain a written permission from the persons below.

Ion Androutsopoulos ( http://www.aueb.gr/users/ion/ )
George Paliouras ( http://www.iit.demokritos.gr/~paliourg/ )
Eirinaios Michelakis ( http://www.iit.demokritos.gr/~ernani/ )

This file last updated: December 16, 2003.