KEEL-dataset - data set description

This section describes main characteristics of the spambase data set and its attributes:

General information

Spambase data set
TypeClassificationOriginReal world
Features 57(Real / Integer / Nominal)(57 / 0 / 0)
Instances4597 Classes2
Missing values?No

Attribute description

Word_freq_make[0.0, 4.54]Word_freq_credit[0.0, 18.18]Word_freq_pm[0.0, 11.11]
Word_freq_address[0.0, 14.28]Word_freq_your[0.0, 11.11]Word_freq_direct[0.0, 4.76]
Word_freq_all[0.0, 5.1]Word_freq_font[0.0, 17.1]Word_freq_cs[0.0, 7.14]
Word_freq_3d[0.0, 42.81]Word_freq_000[0.0, 5.45]Word_freq_meeting[0.0, 14.28]
Word_freq_our[0.0, 10.0]Word_freq_money[0.0, 12.5]Word_freq_original[0.0, 3.57]
Word_freq_over[0.0, 5.88]Word_freq_hp[0.0, 20.83]Word_freq_project[0.0, 20.0]
Word_freq_remove[0.0, 7.27]Word_freq_hpl[0.0, 16.66]Word_freq_re[0.0, 21.42]
Word_freq_internet[0.0, 11.11]Word_freq_george[0.0, 33.33]Word_freq_edu[0.0, 22.05]
Word_freq_order[0.0, 5.26]Word_freq_650[0.0, 9.09]Word_freq_table[0.0, 2.17]
Word_freq_mail[0.0, 18.18]Word_freq_lab[0.0, 14.28]Word_freq_conference[0.0, 10.0]
Word_freq_receive[0.0, 2.61]Word_freq_labs[0.0, 5.88]Char_freq1[0.0, 4.385]
Word_freq_will[0.0, 9.67]Word_freq_telnet[0.0, 12.5]Char_freq2[0.0, 9.752]
Word_freq_people[0.0, 5.55]Word_freq_857[0.0, 4.76]Char_freq3[0.0, 4.081]
Word_freq_report[0.0, 10.0]Word_freq_data[0.0, 18.18]Char_freq4[0.0, 32.478]
Word_freq_addresses[0.0, 4.41]Word_freq_415[0.0, 4.76]Char_freq5[0.0, 6.003]
Word_freq_free[0.0, 20.0]Word_freq_85[0.0, 20.0]Char_freq6[0.0, 19.829]
Word_freq_business[0.0, 7.14]Word_freq_technology[0.0, 7.69]Capital_run_length_average[1.0, 1102.5]
Word_freq_email[0.0, 9.09]Word_freq_1999[0.0, 6.89]Capital_run_length_longest[1.0, 9989.0]
Word_freq_you[0.0, 18.75]Word_freq_parts[0.0, 8.33]Capital_run_length_total[1.0, 15841.0]
Spam{1, 0}

Additional information

This database contains information about 4597 e-mail messages. The task is to determine whether a given email is spam (class 1) or not (class 2), depending on its contents (4 duplicated instances have been removed from the original data set).

Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. Here are the definitions of the attributes:

- 48 continuous real attributes of type word_freq_"WORD" = percentage of words in the e-mail that match "WORD". A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
- 6 continuous real attributes of type char_freq_"CHAR" = percentage of characters in the e-mail that match "CHAR".
- 1 continuous real attribute of type Capital_run_length_average = average length of uninterrupted sequences of capital letters.
- 1 continuous integer attribute of type Capital_run_length_longest = length of longest uninterrupted sequence of capital letters.
- 1 continuous integer attribute of type Capital_run_length_total = total number of capital letters in the e-mail.

In this section you can download some files related to the spambase data set:

  • The complete data set already formatted in KEEL format can be downloaded from herezip.gif.
  • A copy of the data set already partitioned by means of a 10-folds cross validation procedure can be downloaded from herezip.gif.
  • A copy of the data set already partitioned by means of a 5-folds cross validation procedure can be downloaded from herezip.gif.
  • The header file associated to this data set can be downloaded from heretxt.png.
  • This is not a native data set from the KEEL project. It has been obtained from the UCI Machine Learning Repository. The original page where the data set can be found is:

