This section describes main characteristics of the genbase data set and its attributes:
The protein classes considered in are the 27 most important protein families. For clarity of presentation, the Prosite documentation ID, i.e. the PDOCxxxxx number was used to represent that class. Similarly, the Prosite access number i.e. the PSxxxxx was used to represent that motif pattern or profile. During the preprocessing, a training set was exported, consisting of 662 proteins that belong in barely 27 classes. Some proteins belonged in more than one class, thus the problem could be defined as a multi-label classification problem.