KEEL: A software tool to assess evolutionary algorithms for Data Mining problems (regression, classification, clustering, pattern mining and so on)

This section describes main characteristics of the splice data set and its attributes:

General information

Molecular Biology (Splice-junction Gene Sequences) data set
Type	Classification	Origin	Real world
Features	60	(Real / Integer / Nominal)	(0 / 0 / 60)
Instances	3190	Classes	3
Missing values?			No

Attribute description

Attribute	Domain	Attribute	Domain	Attribute	Domain
POS1	{G, C, A, T, D}	POS21	{C, T, G, A, N}	POS41	{C, T, G, A, N}
POS2	{G, C, A, T, D}	POS22	{C, T, G, A, N}	POS42	{C, T, G, A, N}
POS3	{C, G, T, A}	POS23	{C, T, G, A, N}	POS43	{C, T, G, A, N}
POS4	{C, G, T, A}	POS24	{C, T, G, A, N}	POS44	{C, T, G, A, N}
POS5	{C, G, T, A}	POS25	{C, T, G, A, N}	POS45	{C, T, G, A, N}
POS6	{C, G, T, A}	POS26	{C, T, G, A, N}	POS46	{C, T, G, A, N}
POS7	{C, G, T, A}	POS27	{C, T, G, A, N}	POS47	{C, T, G, A, N}
POS8	{C, G, T, A}	POS28	{C, T, G, A, N}	POS48	{C, T, G, A, N}
POS9	{C, G, T, A}	POS29	{C, T, G, A, N}	POS49	{C, T, G, A, N}
POS10	{C, G, T, A}	POS30	{C, T, G, A, N}	POS50	{C, T, G, A, N}
POS11	{C, G, T, A}	POS31	{C, T, G, A, N}	POS51	{C, T, G, A, N}
POS12	{C, G, T, A}	POS32	{C, T, G, A, N}	POS52	{C, T, G, A, N}
POS13	{C, G, T, A}	POS33	{C, T, G, A, N}	POS53	{C, T, G, A, N}
POS14	{C, A, T, G, N}	POS34	{C, T, G, A, N}	POS54	{C, T, G, A, N}
POS15	{C, G, T, A}	POS35	{G, C, T, A, N, R}	POS55	{C, T, G, A, N}
POS16	{C, G, T, A}	POS36	{T, C, G, A, N, S}	POS56	{C, T, G, A, N}
POS17	{C, G, T, A}	POS37	{C, T, G, A, N}	POS57	{C, T, G, A, N}
POS18	{C, G, T, A}	POS38	{C, T, G, A, N}	POS58	{C, T, G, A, N}
POS19	{C, T, G, A, N}	POS39	{C, T, G, A, N}	POS59	{C, T, G, A, N}
POS20	{C, T, G, A, N}	POS40	{C, T, G, A, N}	POS60	{C, T, G, A, N}
Class	{EI, IE, N}

Additional information

Splice junctions are points on a DNA sequence at which \'superfluous\' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites).