Datasets

Enron-Spam datasets

The datasets from the Enron corpus consist of preprocessed and raw forms, which contain 88792 messages. The "preprocessed" category contains the messages in preprocessed format, which include 20170 spam messages and 16545 ham messages. The "raw" directory contains the messages in their original form, which include 19089 spam messages and 32988 ham messages. The total size of the entire datasets is 72.64mb.

Ling-Spam datasets

The datasets from the Linguist list consist of 2412 Linguist messages and 481 spam messages from the Linguist list. The total size of the datasets is 34.8mb.

PU1 and PU123A datasets

The PU1 datasets consist of 481 spam messages and 618 legitimate messages. In addition, PU123A datasets consist of XXX spam messages and XXX legitimate messages. The total size of the datasets is 10.3mb.

Spam-Assassin datasets

The datasets from Spam-Assassin public mail corpus consist of 1897 spam and 4150 legitimate messages. The total size of the datasets is 12.5mb.

20 Newsgroups

The datasets consist of 18821 newsgroup documents across 20 different newsgroups. The total size of the datasets is 82.44mb.

R52 and R8 of Reuters 21578

The datasets consist of two categories, R8 category and R52 category. R8 category contains 5485 documents for training and 2189 documents for testing. In addition, R52 category contains 6532 documents for training and 2568 documents for testing. The total size of the datasets is 32.84mb.

WebKB

The datasets consist of 8,282 pages being manually classified into the 7 classes, such as student, faculty, staff, department, course, etc. The total size of the datasets is 3.6mb.

e-News datasets *

The datasets consist of 668 documents from newspaper of different countries (i.e., New Zealand, Australia, etc.) on topics of business, education, entertainments, sport and travel, respectively. The total size of the datasets is 3.6mb.

Spam email datasets *

The datasets consist of two parts, training set and testing set. The training set contains 2949 Ham messages and 1378 spam messages. And the testing set contains 4292 messages without known class labels. The total size of the datasets is 23.4mb.

Malicious software datasets *

The datasets consist of two parts, training set and testing set. The training set contains 320 malware traces and 68 benign software traces. And the testing set contains 378 traces without known class labels. The total size of the datasets is 935.4kb.

Note: The dataset with * is provided by CSMINING GROUP.