The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.

The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.

For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages collected from other universities.

I discarded the classes Department and Staff because there were only a few pages from each university. I also discarded the class Other because pages were very different among this class.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

Class # train docs# test docsTotal # docs
project 336 168 504
course 620 310 930
faculty 750 374 1124
student 1097 544 1641
Total 2803 1396 4199



Train Test
# documents 2803 docs 1396 docs
webkb-train-stemmed 2.40 Mb webkb-test-stemmed 1.20 Mb