KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

This section describes main characteristics of the spambase data set and its attributes:

General information

Spambase data set
Type	Classification	Origin	Real world
Features	57	(Real / Integer / Nominal)	(57 / 0 / 0)
Instances	4597	Classes	2
Missing values?			No

Attribute description

Attribute	Domain	Attribute	Domain	Attribute	Domain
Word_freq_make	[0.0, 4.54]	Word_freq_credit	[0.0, 18.18]	Word_freq_pm	[0.0, 11.11]
Word_freq_address	[0.0, 14.28]	Word_freq_your	[0.0, 11.11]	Word_freq_direct	[0.0, 4.76]
Word_freq_all	[0.0, 5.1]	Word_freq_font	[0.0, 17.1]	Word_freq_cs	[0.0, 7.14]
Word_freq_3d	[0.0, 42.81]	Word_freq_000	[0.0, 5.45]	Word_freq_meeting	[0.0, 14.28]
Word_freq_our	[0.0, 10.0]	Word_freq_money	[0.0, 12.5]	Word_freq_original	[0.0, 3.57]
Word_freq_over	[0.0, 5.88]	Word_freq_hp	[0.0, 20.83]	Word_freq_project	[0.0, 20.0]
Word_freq_remove	[0.0, 7.27]	Word_freq_hpl	[0.0, 16.66]	Word_freq_re	[0.0, 21.42]
Word_freq_internet	[0.0, 11.11]	Word_freq_george	[0.0, 33.33]	Word_freq_edu	[0.0, 22.05]
Word_freq_order	[0.0, 5.26]	Word_freq_650	[0.0, 9.09]	Word_freq_table	[0.0, 2.17]
Word_freq_mail	[0.0, 18.18]	Word_freq_lab	[0.0, 14.28]	Word_freq_conference	[0.0, 10.0]
Word_freq_receive	[0.0, 2.61]	Word_freq_labs	[0.0, 5.88]	Char_freq1	[0.0, 4.385]
Word_freq_will	[0.0, 9.67]	Word_freq_telnet	[0.0, 12.5]	Char_freq2	[0.0, 9.752]
Word_freq_people	[0.0, 5.55]	Word_freq_857	[0.0, 4.76]	Char_freq3	[0.0, 4.081]
Word_freq_report	[0.0, 10.0]	Word_freq_data	[0.0, 18.18]	Char_freq4	[0.0, 32.478]
Word_freq_addresses	[0.0, 4.41]	Word_freq_415	[0.0, 4.76]	Char_freq5	[0.0, 6.003]
Word_freq_free	[0.0, 20.0]	Word_freq_85	[0.0, 20.0]	Char_freq6	[0.0, 19.829]
Word_freq_business	[0.0, 7.14]	Word_freq_technology	[0.0, 7.69]	Capital_run_length_average	[1.0, 1102.5]
Word_freq_email	[0.0, 9.09]	Word_freq_1999	[0.0, 6.89]	Capital_run_length_longest	[1.0, 9989.0]
Word_freq_you	[0.0, 18.75]	Word_freq_parts	[0.0, 8.33]	Capital_run_length_total	[1.0, 15841.0]
Spam	{1, 0}

Additional information

This database contains information about 4597 e-mail messages. The task is to determine whether a given email is spam (class 1) or not (class 2), depending on its contents (4 duplicated instances have been removed from the original data set).

Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. Here are the definitions of the attributes:

- 48 continuous real attributes of type word_freq_"WORD" = percentage of words in the e-mail that match "WORD". A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
- 6 continuous real attributes of type char_freq_"CHAR" = percentage of characters in the e-mail that match "CHAR".
- 1 continuous real attribute of type Capital_run_length_average = average length of uninterrupted sequences of capital letters.
- 1 continuous integer attribute of type Capital_run_length_longest = length of longest uninterrupted sequence of capital letters.
- 1 continuous integer attribute of type Capital_run_length_total = total number of capital letters in the e-mail.