R52 and R8 of Reuters 21578

Reuters-21578

Currently the most widely used test collection for text categorization research, though likely to be superceded over the next few years by RCV1.  The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system.  Further details, including discussion of previous versions of the collection (e.g. Reuters-22173), are available in the README file.

The collection is available here as a gzipped tar archive (8.2 MB; 28.0 MB uncompressed).  The UCI KDD archive also has an entry for the collection, including a copy.  The version at UCI is identical, and I encourage you to get the UCI copy if available to save bandwidth at this site.  Previous locations of the collection (now no longer active) were http://www.research.att.com/~lewis/reuters21578.html and ftp:://canberra.cs.umass.edu/pub/reuters.

Various researchers have prepared data files useful for work with Reuters-21578.  Contact me if you would like me to host such resources here; I am happy to if their disk space requirements are modest.  Currently the only such resource available here is a PROLOG fact base about countries contributed by Ronen Feldman.

Reuters 21578
# Topics# train docs# test docs# otherTotal # docs
0 1828 280 8103 10211
1 6552 2581 361 9494
2 890 309 135 1334
3 191 64 55 310
4 62 32 10 104
5 39 14 8 61
6 21 6 3 30
7 7 4 0 11
8 4 2 0 6
9 4 2 0 6
10 3 1 0 4
11 0 1 1 2
12 1 1 0 2
13 0 0 0 0
14 0 2 0 2
15 0 0 0 0
16 1 0 0 1

Considering only the documents with a single topic and the classes which still have at least one train and one test example, we have 8 of the 10 most frequent classes and 52 of the original 90.

Following Sebastiani's convention, we will call these sets R8 and R52. Note that from R10 to R8 the classes corn and wheat, which are intimately related to the class grain disapeared and this last class lost many of its documents.

The distribution of documents per class is the following for R8 and R52:

R8
Class# train docs# test docsTotal # docs
acq 1596 696 2292
crude 253 121 374
earn 2840 1083 3923
grain 41 10 51
interest 190 81 271
money-fx 206 87 293
ship 108 36 144
trade 251 75 326
Total548521897674
R52
Class# train docs# test docsTotal # docs
acq 1596 696 2292
alum 31 19 50
bop 22 9 31
carcass 6 5 11
cocoa 46 15 61
coffee 90 22 112
copper 31 13 44
cotton 15 9 24
cpi 54 17 71
cpu 3 1 4
crude 253 121 374
dlr 3 3 6
earn 2840 1083 3923
fuel 4 7 11
gas 10 8 18
gnp 58 15 73
gold 70 20 90
grain 41 10 51
heat 6 4 10
housing 15 2 17
income 7 4 11
instal-debt 5 1 6
interest 190 81 271
ipi 33 11 44
iron-steel 26 12 38
jet 2 1 3
jobs 37 12 49
lead 4 4 8
lei 11 3 14
livestock 13 5 18
lumber 7 4 11
meal-feed 6 1 7
money-fx 206 87 293
money-supply 123 28 151
nat-gas 24 12 36
nickel 3 1 4
orange 13 9 22
pet-chem 13 6 19
platinum 1 2 3
potato 2 3 5
reserves 37 12 49
retail 19 1 20
rubber 31 9 40
ship 108 36 144
strategic-metal 9 6 15
sugar 97 25 122
tea 2 3 5
tin 17 10 27
trade 251 75 326
veg-oil 19 11 30
wpi 14 9 23
zinc 8 5 13
Total653225689100

Download

  Reuters-21578 R8 Reuters-21578 R52

Train Test Train Test
# documents 5485 docs 2189 docs 6532 docs 2568 docs
all-terms

r8-train-all-terms 3.20Mb

r8-test-all-terms 1.14Mb r52-train-all-terms 4.08Mb r52-test-all-terms 1.45Mb
no-short
r8-train-no-short 2.90Mb r8-test-no-short 1.03Mb r52-train-no-short 3.71Mb r52-test-no-short 1.32Mb
no-stop
r8-train-no-stop 2.42Mb r8-test-no-stop 0.86Mb r52-train-no-stop 3.08Mb r52-test-no-stop 1.09Mb
stemmed
r8-train-stemmed 2.13Mb r8-test-stemmed 0.76Mb r52-train-stemmed 2.71Mb r52-test-stemmed 0.96Mb