20 Newsgroups data set

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

20 Newsgroups
Class# train docs# test docsTotal # docs
alt.atheism 480 319 799
comp.graphics 584 389 973
comp.os.ms-windows.misc 572 394 966
comp.sys.ibm.pc.hardware 590 392 982
comp.sys.mac.hardware 578 385 963
comp.windows.x 593 392 985
misc.forsale 585 390 975
rec.autos 594 395 989
rec.motorcycles 598 398 996
rec.sport.baseball 597 397 994
rec.sport.hockey 600 399 999
sci.crypt 595 396 991
sci.electronics 591 393 984
sci.med 594 396 990
sci.space 593 394 987
soc.religion.christian 598 398 996
talk.politics.guns 545 364 909
talk.politics.mideast 564 376 940
talk.politics.misc 465 310 775
talk.religion.misc 377 251 628
Total11293752818821

Download

20 Newsgroups

Train Test
# documents 11293 docs 7528 docs
all-terms
20ng-train-all-terms 15.91 Mb 20ng-test-all-terms 10.31 Mb
no-short
20ng-train-no-short 14.06 Mb 20ng-test-no-short 9.12 Mb
no-stop
20ng-train-no-stop 10.59 Mb 20ng-test-no-stop 6.86 Mb
stemmed
20ng-train-stemmed 9.46 Mb 20ng-test-stemmed 6.13 Mb