main main
KEEL-dataset - data set description
dataset/images/genbase.jpg



This section describes main characteristics of the genbase data set and its attributes:

General information

Proteins data set data set
TypeMulti labelOriginReal world
Features 1186(Real / Integer / Nominal)(0 / 0 / 1186)
Instances662 Classes27
Missing values?No

Additional information

The protein classes considered in are the 27 most important protein families. For clarity of presentation, the Prosite documentation ID, i.e. the PDOCxxxxx number was used to represent that class. Similarly, the Prosite access number i.e. the PSxxxxx was used to represent that motif pattern or profile. During the preprocessing, a training set was exported, consisting of 662 proteins that belong in barely 27 classes. Some proteins belonged in more than one class, thus the problem could be defined as a multi-label classification problem.




In this section you can download some files related to the genbase data set:

  • The complete data set already formatted in KEEL format can be downloaded from herezip.gif.
  • A copy of the data set already partitioned by means of a 10-folds cross validation procedure can be downloaded from herezip.gif.
  • A copy of the data set already partitioned by means of a 5-folds cross validation procedure can be downloaded from herezip.gif.
  • The header file associated to this data set can be downloaded from heretxt.png.
  • This is not a native data set from the KEEL project. It has been obtained from the Mulan repository. The original page where the data set can be found is: http://mulan.sourceforge.net/datasets.html.


 
 Copyright 2004-2018, KEEL (Knowledge Extraction based on Evolutionary Learning)
About the Webmaster Team
Valid XHTML 1.1   Valid CSS!