This section describes main characteristics of the splice data set and its attributes:
General information
Molecular Biology (Splice-junction Gene Sequences) data set |
Type | Classification | Origin | Real world |
Features | 60 | (Real / Integer / Nominal) | (0 / 0 / 60) |
Instances | 3190 |
Classes | 3 |
Missing values? | No |
Attribute description
Attribute | Domain | Attribute | Domain | Attribute | Domain |
POS1 | {G, C, A, T, D} | POS21 | {C, T, G, A, N} | POS41 | {C, T, G, A, N} |
POS2 | {G, C, A, T, D} | POS22 | {C, T, G, A, N} | POS42 | {C, T, G, A, N} |
POS3 | {C, G, T, A} | POS23 | {C, T, G, A, N} | POS43 | {C, T, G, A, N} |
POS4 | {C, G, T, A} | POS24 | {C, T, G, A, N} | POS44 | {C, T, G, A, N} |
POS5 | {C, G, T, A} | POS25 | {C, T, G, A, N} | POS45 | {C, T, G, A, N} |
POS6 | {C, G, T, A} | POS26 | {C, T, G, A, N} | POS46 | {C, T, G, A, N} |
POS7 | {C, G, T, A} | POS27 | {C, T, G, A, N} | POS47 | {C, T, G, A, N} |
POS8 | {C, G, T, A} | POS28 | {C, T, G, A, N} | POS48 | {C, T, G, A, N} |
POS9 | {C, G, T, A} | POS29 | {C, T, G, A, N} | POS49 | {C, T, G, A, N} |
POS10 | {C, G, T, A} | POS30 | {C, T, G, A, N} | POS50 | {C, T, G, A, N} |
POS11 | {C, G, T, A} | POS31 | {C, T, G, A, N} | POS51 | {C, T, G, A, N} |
POS12 | {C, G, T, A} | POS32 | {C, T, G, A, N} | POS52 | {C, T, G, A, N} |
POS13 | {C, G, T, A} | POS33 | {C, T, G, A, N} | POS53 | {C, T, G, A, N} |
POS14 | {C, A, T, G, N} | POS34 | {C, T, G, A, N} | POS54 | {C, T, G, A, N} |
POS15 | {C, G, T, A} | POS35 | {G, C, T, A, N, R} | POS55 | {C, T, G, A, N} |
POS16 | {C, G, T, A} | POS36 | {T, C, G, A, N, S} | POS56 | {C, T, G, A, N} |
POS17 | {C, G, T, A} | POS37 | {C, T, G, A, N} | POS57 | {C, T, G, A, N} |
POS18 | {C, G, T, A} | POS38 | {C, T, G, A, N} | POS58 | {C, T, G, A, N} |
POS19 | {C, T, G, A, N} | POS39 | {C, T, G, A, N} | POS59 | {C, T, G, A, N} |
POS20 | {C, T, G, A, N} | POS40 | {C, T, G, A, N} | POS60 | {C, T, G, A, N} |
Class | {EI, IE, N} |
Additional information
Splice junctions are points on a DNA sequence at which \'superfluous\' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites).
In this section you can download some files related to the splice data set:
- The complete data set already formatted in KEEL format can be downloaded from
here.
- A copy of the data set already partitioned by means of a 10-folds cross validation procedure can be downloaded from here.
- A copy of the data set already partitioned by means of a 5-folds cross validation procedure can be downloaded from here.
- The header file associated to this data set can be downloaded from here.
- This is not a native data set from the KEEL project. It has been obtained from the UCI Machine Learning repository. The original page where the data set can be found is: http://archive.ics.uci.edu/ml/datasets/Molecular+Biology+%28Splice-junction+Gene+Sequences%29.
|