M4MLab - Mining, Modeling, Annotating & Predicting for M4M

Summary

Cases of Study

Approach

Resources

Home

A Laboratory of the Research Group
"Soft Computing and Intelligent Information Systems"

Research Summary

MOTIVATION: In the last decades, large amounts of data from different areas, especially Biology and Medicine, are being collected and stored in digital repositories. The increasing amount of collected information an the structural complexity of these repositories has not been supported with a parallel development of appropriate tools for information mining and extracting, making almost impossible both the recovery of valuable knowledge from those sites and the generation of hypothesis based on such knowledge. Rather, the structures typically provided to organize and index these collections reflect the convenience of database implementers and their tendency to rely on approaches developed to store large amount of data objects.

The recent renewed interest in knowledge-discovery techniques for their use in systems Biology has caused the development of a large number of data analysis methods, intending to facilitate the extraction of knowledge and the understanding of the represented objects and their related systems in these repositories. However most of these existing methods, such us clustering or machine learning techniques are only customized for linear feature-value data. New biological databases contain as well other data types, such as structural data (e.g., Gene Ontology database, metabolic pathways) temporal series, graphic circuits, or hierarchical organized dataset, which contain not only individual instances descriptions in the database, but also the relationships among these instances [1, 2]. Additionally, the retrieved structures and concepts have to deal with the uncertainty of the stored information, due to the fact that many databases contain predictions with different degrees of confidence, instead of experimental data and do not include available certified experimental results.

APPROACH: To address the problem of identifying information networks in biology, we propose a generalized conceptual clustering methodology that retrieves system descriptions, combines information of different sources (e.g., genomic, proteomic, transcriptómica data), performs annotation and predicts new cases based on that knowledge. This methodology is based on applying conceptual clustering and o multivariate, multiobjective, multimodal and mutidecissor optimization techniques, and combines metaheuristics based on evolutionary computation and fuzzy clustering and control techniques. The computacional approach is combined with molecular biology techniques, including microarray, Chromatin inmunopresipitation, real time PCR, to validate and update the computacional discovered information networks.

We propose a novel multidisciplinary approach to identify features that describe objects contained in biological databases, to discover and represent interesting relationships among these features, to generate structured indices and textual annotations of these relationships, to identify concepts by analyzing and fusing knowledge from different information sources to finally generate hypothesis that can be biologically validated by experimental molecular biology and genetics techniques. Our proposal also extends the initial approach of identifying information networks in biology by considering the evolution of these networks as well as their dynamics, and the integration of different sources of information and experimental validation the those generated hypotheses.

APPLICATION: We apply our approach to identify information networks in three different biological environments: cis-regulatory networks, where we will identify differential regulatory patterns of co-regulated promoters to describe small networks termed two component systems in procaryotic genomes; gene expression networks, where we will learn kinetic classed of the two-component networks and their dynamics in procaryotic organisms, and we will extend the developer methods to identify differential gene expression profiles in eukaryiotic organisms; and last but not least hierarchical networks, where we will describe gene products in terms of their ontologies in an species-independent manner.