CONCEPTUAL CLUSTERING AND MACHINE LEARNING
Cluster analysis -or simply clustering- is a data mining technique often used to identify various groupings or taxonomies in databases. Most existing methods for clustering are designed for linear feature-value data. However, sometimes we need to represent and learn structural data that not only contains descriptions of individual observations in databases, but also relationships among these observations. Therefore, mining into structural databases entails addressing both the uncertainty of which observations should be placed together, and also which distinct relationships among features best characterize different sets of observations. Typical clustering techniques (Everitt and Der 1996; Kohavi and John 1997; Yeung and Ruzzo 2001) are not designed to do this, even when combined with global filter feature selection methods such as principal component analysis or stepwise descendent methods (Everitt and Der 1996; Kohavi and John 1997; Yeung and Ruzzo 2001). In contrast, conceptual clustering techniques have been successfully applied to structural databases to uncover concepts that are embedded in subsets of structural data or substructures (Cheeseman and Oldford 1994; Cook et al. 2001; Cooper and Herskovits 1992). Consequently, conceptual learning can be formulated as the problem of searching through a predefined space of potential hypotheses (i.e., substructures or associations of features and observations) for those that best fit the training examples.
While most machine learning techniques applied directly or indirectly to structural databases exhibit methodological differences, they do share the same framework even though they employ distinct metrics, heuristics or probability interpretations (Cheeseman and Oldford 1994; Cook et al. 2001; Cooper and Herskovits 1992) as follows:
Structure representation. Structural data can be viewed as a graph containing nodes representing objects, which have features linked to other nodes by edges corresponding to their relations. A substructure consists of a sub-graph of structural data (Cook, et al., 2001). These data can be efficiently organized by taking advantage of a naturally occurring structure over feature space, which consists of a general to specific ordering of possible substructures (i.e., a lattice).
Structure learning. This process consist of searching through the lattice space for potential substructures, and either returning the best one found or an optimal sample of them. If the number of substructures is super-exponential in the number of nodes, different heuristic methods can be used (e.g., greedy (Cooper and Herskovits 1992); hill climbing (Chickering 2003); genetic algorithms (Larranaga 1996)).
Subtructure evaluation. The formulation of the clustering problem in a lattice or graph-based structure would result in the generation of many substructures with small extent, as it is easier to explain or substructure-match smaller data subsets than those that constitute a significant portion of the dataset. For this reason, any successful methodology should also consider additional criteria to extract broader or more comprehensive substructures based on their size, the number of retrieved substructures, and their diversity and extent of overlap (Cook et al. 2001; Ruspini 2001). These are conflicting criteria that can formulated as a multiobjective optimization problem, analogous to minimum description-length methods (Rissanen 1989), based on the combination of the individual criteria or objectives into a global measure of cluster quality. The basic challenge with this approach, however, is its potential bias and inflexibility caused by weighting of the objectives (Ruspini, 2001).
Inference. New observations can be predicted from previously learned substructures by using classifiers that optimize their matching based on distance (Bezdek 1998) or probabilistic metrics (Cooper and Herskovits, 1992(Mitchell 1997)). When designed for labeled data, the approach is referred to as supervised learning (as opposed to unsupervised learning.)
Bezdek, J.C. 1998. Pattern Analysis. In Handbook of Fuzzy Computation (eds. W. Pedrycz P.P. Bonissone, and E.H. Ruspini), pp. F6.1.1-F6.6.20. Institute of Physics, Bristol.
Cheeseman, P. and R.W. Oldford. 1994. Selecting models from data : artificial intelligence and statistics IV. Springer-Verlag, New York.
Chickering, D.M. 2003. Optimal Structure Identification With Greedy Search. Journal of Machine Learning Research 3: 507-554.
Cook, D.J., L.B. Holder, S. Su, R. Maglothin, and I. Jonyer. 2001. Structural mining of molecular biology data. IEEE Eng Med Biol Mag 20: 67-74.
Cooper, G.F. and E. Herskovits. 1992. A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning 9: 309-347.
Everitt, B. and G. Der. 1996. A handbook of statistical analysis using SAS. Chapman & Hall, London.
Kohavi, R. and G.H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97: 273-324.
Larranaga, P.a.P., M. 1996. Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance
Analysis of Control Parameters. IEEE Journal on Pattern Analysis and Machine Intelligence 18: 912-926.
Mitchell, T.M. 1997. Machine learning. McGraw-Hill, New York.
Rissanen, J. 1989. Stochastic complexity in statistical inquiry. World Scientific, Singapore.
Ruspini, E.H.a.Z., I. 2001. Automated Generation of Qualitative Representations of Complex Objects by Hybrid Soft-Computing Methods. In Pattern recognition : from classical to modern approaches (ed. A. Pal), pp. xxii, 612. World Scientific, New Jersey.
Yeung, K.Y. and W.L. Ruzzo. 2001. Principal component analysis for clustering gene expression data. Bioinformatics 17: 763-774.