# Models for Big Data Preprocessing

### A1: Training Set Selection for Monotonic Ordinal Classification

In recent years, monotonic ordinal classification has increased the focus of attention for machine learning community. Real life problems frequently have monotonicity constraints. Many of the monotonic classifiers require that the input data sets satisfy the monotonicity relationships between its samples. To address this, a conventional strategy consists of relabeling the input data to achieve complete monotonicity. As an alternative, we explore the use of preprocessing
algorithms without modifying the class label of the input data. In this paper we propose the use of training set selection to choose the most effective instances which lead the monotonic classifiers to obtain more accurate and efficient models, fulfilling the monotonic constraints. To show the benefits of our proposed training set selection algorithm, called MonTSS, we carry out an experimentation over 30 data sets related to ordinal classification problems.

This work has been published in the following reference:

J.R. Cano, S. García. Training Set Selection for Monotonic Ordinal Classification. Data and Knowledge Engineering (2017) 112 94-105.

### A2: An insight into imbalanced Big Data classification: outcomes and challenges

Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the
high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.

This work has been published in the following reference:

A. Fernández, S. del Río, N. V. Chawla, F. Herrera. An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell. Syst. (2017) 3 105–120.

### A3: A survey on data preprocessing for data stream mining: Current status and future directions

In this survey, an enumeration, classification and analysis of existing data stream preprocessing contributions is performed, finally outlining the future challenges that need to be addressed to develop novel methods. The experiments evaluated the usefulness and performance of many algorithms for preprocessing data streams according to their effectiveness, time and memory performance and reduction rate. The four main approaches to tackling concept drift in data streams are reviewed as follows:

• Concept drift detectors: are external tools used with the classification module that measure various properties of data stream as the instance distribution on the standard deviation. Changes in these properties are attributed to the possible presence of drift , so a warning signal is emitted when the changes start occurring, and a signal informs when the current degrees of changes is so high and so a new classifier should be trained on most recent instances.
• Sliding windows: A buffer of a fixed size with most recent examples is kept, used for classification and discarded when new instances become available, which is done by cutting-off oldest instances or weighting them according to their relevance. The window size is crucial for performance.
• Online learners: are updated instance by instance, accommodating changes in stream as soon as they occur. Some standard classification models as Naive Bayes and Neutral Networks may work in online mode.
• Ensemble learners: Their structure is compounded by a former solution that trains a new classifier on recently arrived data and adds it to the ensemble, and a pruning  based on a weighting scheme.

As in streaming scenarios reduction techniques are demanded to process elements online or in batch mode as quick as possible without making any assumptions about data distribution, data reduction or data streams are reviewed grouped by family: Dimensionality Reduction, Instance Reduction and Feature Space Simplification.

Results conclude that, for Feature Selection, NB yield better accuracy when all features are available during prediction, however,  and SU generated result where very closed to NB and generate much simpler solutions and information-based methods are more accurate than OFS. For  Instance Selection, results conclude that the best method on average is the updated KNN. For Discretization, results conclude that the most accurate method is IDA with results pretty closed to be obtained with OC and NB.

This work has been published in the following reference:

S. Ramírez-Gallego, B. Krawczyk, S. García, M. Wozniak, F. Herrera. A survey on data preprocessing for data stream mining: Current status and future directions. Neurocomputing 239 (2017) 39-57.

### A4: An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark

The framework contains a generic implementation of several information theory-based FS methods as mRMR (minimum redundancy maximum relevance), conditional MI (mutual information) and maximization and JMI (joint mutual information), and is based on the information theory-based framework proposed by Brown, adapting it to the Big Data environment and integrated under Apache Spark MLlib.

Brown proposed a generic expression that allows multiple information theory criteria to be ensembled in a single FS framework, based on a greedy optimization process that evaluate features by using a simple scoring criterion.  The generic formula proposed by Brown is:

$$J=I(X_{i};Y) - \beta\sum_{X_{j}\epsilon S} I(X_{j};X_{i})+ \gamma \sum_{X_{j}\epsilon S}I(X_{j};X_{i}|Y)$$

Where the first element represents the relevance of a feature $X_{i}$ , the second represents the redundancy between two features ($X_{i}$, $X_{j})$ and the third term represents the conditional redundancy between two features ($X_{i}$, $X_{j}$) and the class ($Y$). $\gamma$ and $\beta$ are weight factors.

For the Spark implementation, Brown's framework was redesigned by making some improvements:

1. columnar transformation was made to make data more manageable by transposing the local data matrix provided by each partition. Afterwards, transformed data is cached and reused in subsequent computations. The result is a new matrix with a row per feature that generates a tuple <(k,(part.index,matrix(k))> where the key is the feature index and the value is the index of the partition and and the local matrix for this feature block.
2. Broadcasting is used to avoid superfluous network and CPU usage, minimizing data shift by replicating the output feature and the last selected feature in each iteration. The MI process is performed locally in each partition so the algorithm run efficiently.
3. Calculating MI for all input features and the output is performed once when the algorithm starts and then cached to be reused in subsequent evaluations. This is done also por marginals and joint proportions, avoiding redundancy computation for feature as this information is replicated in all nodes.
4. The complexity of the problem is transformed using a greedy search process which selects only one feature per iteration.

This work has been published in the following reference:

S. Ramírez-Gallego, H. Mouriño-Talín, D. Martínez-Rego, V. Bolón-Canedo, J.M. Benítez, A. Alonso-Betanzos, F. Herrera. An Information Theory-Based Feature Selection Framework for Big Data under Apache Spark. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2017), In Press.

### A5: A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines

Many of the existing machine learning algorithms, both supervised and unsupervised, depend on the quality of the input characteristics to generate a good model. The amount of these variables is also important, since performance tends to decline as the input dimensionality increases, hence the interest in using feature fusion techniques, able to produce feature sets that are more compact and higher level. A plethora of procedures to fuse original variables for producing new ones has been developed in the past decades. The most basic ones use linear combinations of the original variables, such as PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis), while others find manifold embeddings of lower dimensionality based on non-linear combinations, such as Isomap or LLE (Linear Locally Embedding) techniques. More recently, autoencoders (AEs) have emerged as an alternative to manifold learning for conducting nonlinear feature fusion. Dozens of AE models have been proposed lately, each with its own specific traits. Although many of them can be used to generate reduced feature sets through the fusion of the original ones, there also AEs designed with other applications in mind.
The goal of this paper is to provide the reader with a broad view of what an AE is, how they are used for feature fusion, a taxonomy gathering a broad range of models, and how they relate to other classical techniques. In addition, a set of didactic guidelines on how to choose the proper AE for a given task is supplied, together with a discussion of the software tools available. Finally, two case studies illustrate the usage of AEs with datasets of handwritten digits and breast cancer.

This work has been published in the following reference:

D. Charte, F. Charte, S. García, M. J. del Jesus, F. Herrera. A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines. Information Fusion (2018) 44 78-96

### A6: Tips, guidelines and tools for managing multi-label datasets: the mldr.datasets R package and the Cometa data repository

New proposals in the field of multi-label learning algorithms have been growing in number steadily over the last few
years. The experimentation associated with each of them always goes through the same phases: selection of datasets,
partitioning, training, analysis of results and, finally, comparison with existing methods. This last step is often hampered since it involves using exactly the same datasets, partitioned in the same way and using the same validation strategy. In this paper we present a set of tools whose objective is to facilitate the management of multi-label datasets, aiming to standardize the experimentation procedure. The two main tools are an R package, mldr.datasets, and a web repository with datasets, Cometa. Together, these tools will simplify the collection of datasets, their partitioning, documentation and export to multiple formats, among other functions. Some tips, recommendations and guidelines for a good experimental analysis of multi-label methods are also presented.

This work has been published in the following reference:

F. Charte, A. J. Rivera, D. Charte, M. J. Del Jesús, F. Herrera. Tips, guidelines and tools for managing multi-label datasets: the mldr.datasets R package and the Cometa data repository. Neurocomputing. In Press.

### A7: SMOTE for Learning from Imbalanced Data: Progress and Challenges

Addressing imbalanced datasets in classification tasks is a relevant topic in research studies. The main reason is that for standard classification algorithms, the success rate when identifying minority class instances may be adversely affected. Among different solutions to cope with this problem, data level techniques have shown a robust behavior. In this paper, the novel imbalance package is introduced. Written in R and C++, and available at CRAN repository, this library includes recent relevant oversampling algorithms to improve the quality of data in imbalanced datasets, prior to performing a learning task. The main features of the package, as well as some illustrative examples of its use are detailed throughout this manuscript.

This work has been published in the following reference:

A. Fernandez, S. Garcia, N.V. Chawla, F. Herrera. SMOTE for Learning from Imbalanced Data: Progress and Challenges. Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61 (2018) 863-905

### A8: Principal Components Analysis Random Discretization Ensemble for Scalable Big Data

Humongous amounts of data have created a lot of challenges in terms of data computation and analysis. Classic data mining techniques are not prepared for the new space and time requirements. Discretization and dimensionality reduction are two of the data reduction tasks in knowledge discovery. Random Projection Random Discretization is a novel and recently proposed ensemble method by Ahmad and Brown in 2014 that performs discretization and dimensionality reduction to create more informative data. Despite the good efficiency of random projections in dimensionality reduction, more robust methods like Principal Components Analysis (PCA) can improve the performance.

We propose a new ensemble method to overcome this drawback using the Apache Spark platform and PCA for dimension reduction, named Principal Components Analysis Random Discretization Ensemble. Experimental results on five large-scale datasets show that our solution outperforms both the original algorithm and Random Forest in terms of prediction performance. Results also show that high dimensionality data can affect the runtime of the algorithm.

This work has been published in the following reference

D. Garcı́a-Gil, S. Ramı́rez-Gallego, S. Garcı́a, F. Herrera, Principal Components Analysis Random Discretization Ensemble for Scalable Big Data. Knowledge-Based Systems 150 (2018) 166-174

### A10: Online Entropy-Based Discretization for Data Streaming Classification

Data quality is deemed as determinant in the knowledge extraction process. Low-quality data normally imply low-quality models and decisions. Discretization, as part of data preprocessing, is considered one of the most relevant techniques for improving data quality.

In static discretization, output intervals are generated at once, and maintained along the whole process. However, many contemporary problems demands rapid approaches capable of self-adapting their discretization schemes to an ever-changing nature. Other major issues for stream-based discretization such as interval definition, labeling or how is implemented the interaction between learning and discretization components are also discussed in this paper.

In order to address all the aforementioned problems, we propose a novel, online and self-adaptive discretization solution for streaming classification which aims at reducing the negative impact of fluctuations in evolving intervals. Experiments with a long list of standard streaming datasets and discretizers have demonstrated that our proposal performs significantly more accurately than the other alternatives. In addition, our scheme is able to leverage from class information without incurring in an overweight cost, being ranked as one of the most rapid supervised options.

This work has been published in the following reference

S. Ramírez-Gallego, S. García, F. Herrera. Online Entropy-Based Discretization for Data Streaming Classification. Future Generation Computer Systems 86 (2018) 59-70

### A11: Big Data: Tutorial and Guidelines on Information and Process Fusion for Analytics Algorithms with MapReduce

We live in a world were data are generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapse time, and to extract valuable knowledge from it. Therefore, the use of Big Data Analytics tools provide very significant advantages to both industry and academia. The MapReduce programming framework can be stressed as the main paradigm related with such tools. It is mainly identified by carrying out a distributed execution for the sake of providing a high degree of scalability, together with a fault-tolerant scheme.
In every MapReduce algorithm, first local models are learned with a subset of the original data within the so-called Map tasks. Then, the Reduce task is devoted to fuse the partial outputs generated by each Map. The ways  of designing such fusion of information/models may have a strong impact in the quality of the final system. In this work, we will enumerate and analyze two alternative methodologies that may be found both in the specialized literature and in standard Machine Learning libraries for Big Data. Our main objective is to provide an introduction of the characteristics of these methodologies, as well as giving some guidelines for the design of novel algorithms in this field of research. Finally, a short experimental study will allow us to contrast the scalability issues for each type of process fusion in MapReduce for Big Data Analytics.

This work has been published in the following reference:

S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, F. Herrera. Big Data: Tutorial and Guidelines on Information and Process Fusion for Analytics Algorithms with MapReduce. Information Fusion (2018) 42 51-61.

### A12: A distributed evolutionary multivariate discretizer for Big Data processing on Apache Spark

In this paper, a distributed multivariate discretizer for Apache Spark is proposed, based on a evolutionary points selection scheme, called Distributed Evolutionary Multivariate Discretizer (DEMD). It has been inspired by EMD evolutionary discretizer, however, many improvements have been introduced in DEMD to suit a distributed environment, for example that partial solutions are generated locally and fused eventually to produce the final discretization scheme.

The first step in the main procedure is computing the boundary points in a distributed way, getting  tuples formed by a feature ID and a list of points. This is then used to compute Feature Information (FI) and the boundary points per feature, which will be used later to create chromosome chunks. The procedure divides the evaluation of cut points using chunks by sorting all features by number of boundary points contained in ascending order, and then the number of chunks in which tne list of boundary points will be divided is computed. The number of chunks is computed according to the following formula:

$$np/max(uf,ms/ds)\cdot ds$$

where $np$ is the total number of boundary points, $ds$ is the current proportion of points by data partition, $uf$ is the split factor and  $ms$ is the maximum between the largest feature size and $ds$.

The evaluation process starts distributing points between the chunks, and in each iteration a group of given features is collected and randomly distributed among the chunks, ending where there are no more features to collect. That will let that points of the same features can stay together in the same chunk. After the distribution process, a stratified sampling process is applied and the resulting sample is used to evaluate the boundary points in a distribute manner, so each partition is responsible of evaluating the points in their associate chunks.

Each selection process returns the best chromosome per chunk, and saves the tuples. All these partial results are then summarized using a voting scheme, considering a stablish threshold. Finally the binary vectors are processed to get the final matrix of cut points, by fetching the features in each chunk and their correspondent points, and checking if they have been selected or not. If selected, they are added to the matrix, and if not they are omitted.

This work has been published in the following reference:

S. Ramírez-Gallego, S. García, J.M. Benítez, F. Herrera. A distributed evolutionary multivariate discretizer for Big Data processing on Apache Spark. Swarm and Evolutionary Computation (2017), In Press.

### A13: Chain based Sampling for Monotonic Imbalanced Classification

Classification with monotonic constraints arises from some ordinal real-life problems. In these real-life problems, it is common to find a big difference in the number of instances representing middle-ranked classes and the top classes, because the former usually represents the average or the normality, while the latter are the exceptional and uncommon. This is known as class imbalance problem, and it deteriorates the learning of those under-represented classes. However, the traditional solutions cannot be applied to applications that require monotonic restrictions to be asserted. Since these were not designed to consider monotonic constraints, they compromise the monotonicity of the data-sets and the performance of the monotonic classifiers. In this paper, we propose a set of new sampling techniques to mitigate the imbalanced class distribution and, at the same time, maintain the monotonicity of the data-sets. These methods perform the sampling inside monotonic chains, sets of comparable instances, in order to preserve them and, as a result, the monotonicity. Five different approaches are redesigned based on famous under- and over-sampling techniques and their standard and ordinal versions are compared with outstanding results.

This work has been published in the following reference:

S. González, S. García, ST. Li, F. Herrera. Chain based Sampling for Monotonic Imbalanced Classification. Information Sciences 474 (2019) 187–204

### A14: Label Noise Filtering Techniques to Improve Monotonic Classification

The monotonic ordinal classification has increased the interest of researchers and practitioners within machine learning community in the last years. In real applications, the problems with monotonicity constraints are very frequent. To construct predictive monotone models from those problems, many classifiers require as input a data set satisfying the monotonicity relationships among all samples. Changing the class labels of the data set (relabelling) is useful for this. Relabelling is assumed to be an important building block for the construction of monotone classifiers and it is proved that it can improve the predictive performance.

In this paper, we will address the construction of monotone datasets considering as noise the cases that do not meet the monotonicity restrictions. For the first time in the specialized literature, we propose the use of noise filtering algorithms in a preprocessing stage with a double goal: to increase both the monotonicity index of the models and the accuracy of the predictions for different monotonic classifiers. The experiments are performed over 12 datasets coming from classification and regression problems and show that our scheme improves the prediction capabilities of the monotonic classifiers instead of being applied to original and relabeled datasets. In addition, we have included the analysis of noise filtering process in the particular case of wine quality classification to understand its effect in the predictive models generated.

J.R. Cano, J. Luengo, S. García. Label Noise Filtering Techniques to Improve Monotonic Classification. Accepted in: Neurocomputing

### A15: Transforming Big Data into Smart Data: An insight on the use of k-Nearest Neighbours algorithm to obtain quality data

The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data - likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed.

I. Triguero, D. García-Gil, J. Maíllo, J. Luengo, S. García, F. Herrera. Transforming Big Data into Smart Data: An insight on the use of k-Nearest Neighbours algorithm to obtain quality data. Accepted in: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

### A16: Enabling Smart Data: Noise filtering in Big Data classification

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.

Submitted.

https://arxiv.org/abs/1704.01770

D. Garcı́a-Gil, J. Luengo, S. Garcı́a, F. Herrera. Enabling Smart Data: Noise filtering in Big Data classification. (2017) CoRR abs/1704.01770

### A17: BELIEF: A distance-based redundancy-proof feature selection method for Big Data

With the advent of Big Data era, data reduction methods are highly demanded given its ability to simplify huge data, and ease complex learning processes. Concretely, algorithms that are able to filter relevant dimensions from a set of millions are of huge importance. Although effective, these techniques suffer from the "scalability" curse as well. In this work, we propose a distributed feature weighting algorithm, which is able to rank millions of features in parallel using large samples. This method, inspired by the well-known RELIEF algorithm, introduces a novel redundancy elimination measure that provides similar schemes to those based on entropy at a much lower cost. It also allows smooth scale up when more instances are demanded in feature estimations. Empirical tests performed on our method show its estimation ability in manifold huge sets --both in number of features and instances--, as well as its simplified runtime cost (specially, at the redundancy detection step)

Submitted.

https://arxiv.org/abs/1804.05774

S. Ramírez-Gallego, S. García, N. Xiong, F. Herrera. BELIEF: A distance-based redundancy-proof feature selection method for Big Data.

# Software Libraries and Packages

### A21: Oversampling algorithms for imbalanced classification in R

Addressing imbalanced datasets in classification tasks is a relevant topic in research studies. The main reason is that for standard classification algorithms, the success rate when identifying minority class instances may be adversely affected. Among different solutions to cope with this problem, data level techniques have shown a robust behavior. In this paper, the novel imbalance package is introduced. Written in R and C++, and available at CRAN repository, this library includes recent relevant oversampling algorithms to improve the quality of data in imbalanced datasets, prior to performing a learning task. The main features of the package, as well as some illustrative examples of its use are detailed throughout this manuscript.

I. Cordón, S. García, A. Fernández, F. Herrera. Imbalance: Oversampling algorithms for imbalanced classification in R. Knowledge-Based Systems (2018) In Press. https://doi.org/10.1016/j.knosys.2018.07.035

### A22: Ruta: implementations of neural autoencoders in R

D. Charte, F. Herrera, F. Charte.

Submitted.

### A24: Smartdata: data preprocessing to achieve smart data in R

I. Cordón, J. Luengo, S. García, F. Herrera, F. Charte.

Submitted.

### A31: DPASF: A Flink Library for Streaming Data preprocessing

Data preprocessing techniques are devoted to correct or alleviate errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. We have implemented six of the most popular data preprocessing algorithms, three for discretization and the rest for feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but to maintain or even improve the original accuracy in a short time. DPASF contains useful algorithms when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.

Submitted.

https://arxiv.org/abs/1810.06021

A. Alcalde-Barros, D. García-Gil, S. García, F. Herrera. DPASF: A Flink Library for Streaming Data preprocessing.

### A32: OCAPIS: R package for Ordinal Classification And Preprocessing In Scala

Ordinal Data are those where a natural order exist between the labels. The classification and pre-processing of this type of data is attracting more and more interest in the area of machine learning, due to its presence in many common problems. Traditionally, ordinal classification problems have been approached as nominal problems. However, that implies not taking into account their natural order constraints. In this paper, an innovative R package named ocapis (Ordinal Classification and Preprocessing In Scala) is introduced. Implemented mainly in Scala and available through Github, this library includes four learners and two pre-processing algorithms for ordinal and monotonic data. Main features of the package and examples of installation and use are explained throughout this manuscript.

Submitted.

https://arxiv.org/abs/1810.09733

M. Cristina Heredia-Gómez, Salvador García, Pedro Antonio Gutiérrez, Francisco Herrera. OCAPIS: R package for Ordinal Classification And Preprocessing In Scala.

# Cases of Study

### A27: On the use of convolutional neural networks for robust classification of multiple fingerprint captures

Fingerprint classification is one of the most common approaches to accelerate the identification in large databases of fingerprints. Fingerprints are grouped into disjoint classes, so that an input fingerprint is compared only with those belonging to the predicted class, reducing the penetration rate of the search. The classification procedure usually starts by the extraction of features from the fingerprint image, frequently based on visual characteristics. In this work, we propose an approach to fingerprint classification using convolutional neural networks, which avoid the necessity of an explicit feature extraction process by incorporating the image processing within the training of the classifier. Furthermore, such an approach is able to predict a class even for low-quality fingerprints that are rejected by commonly used algorithms, such as FingerCode. The study gives special importance to the robustness of the classification for different impressions of the same fingerprint, aiming to minimize the penetration in the database. In our experiments, convolutional neural networks yielded better accuracy and penetration rate than state-of-the-art classifiers based on explicit feature extraction. The tested networks also improved on the runtime, as a result of the joint optimization of both feature extraction and classification.

This work has been published in the following reference:

D. Peralta, I. Triguero, S. García, Y. Saeys, J.M. Benitez, F. Herrera, On the use of convolutional neural networks for robust classification of multiple fingerprint captures, International Journal of Intelligent Systems 33 (1) (2018) 213-230

### A28: Adaptive fuzzy partitions for evolving association rules in big data stream

The amount of data being generated in industrial and scientific applications is constantly increasing. These are often generated as a chronologically ordered unlabeled data flow which exceeds usual storage and processing capacities. Association stream mining is an appealing field which models complex environments online by finding relationships among the attributes without presupposing any a priori structure. The discovered relationships are continuously adapted to the dynamics of the problem in a pure online way, being able to deal with both categorical and continuous attributes. This paper presents a new advanced version, Fuzzy-CSar-AFP, of an online genetic fuzzy system designed to obtain interesting fuzzy association rules from data streams. It is capable of managing partitions of different granularity for the variables, which allows the algorithm to adapt to the precision requirements of each variable in the rule. It can also work with data streams without needing to know the domains of the attributes as it includes a mechanism which updates them in real-time. Fuzzy-CSar-AFP performance is validated in an original real-world Psychophysiology problem where associations between different electroencephalogram signals in subjects which are put through different stimuli are analyzed.

This work has been published in the following reference:

E. Ruiz, J. Casillas, Adaptive fuzzy partitions for evolving association rules in big data stream, International Journal of Approximate Reasoning 93 (2018) 463-486

### A29: Alzheimer's disease computer-aided diagnosis: histogram-based analysis of regional MRI volumes for feature selection and classification

This paper proposes a novel fully automatic computer-aided diagnosis (CAD) system for the early detection of Alzheimer’s disease (AD) based on supervised machine learning methods. The novelty of the approach, which is based on histogram analysis, is twofold: 1) a feature extraction process that aims to detect differences in brain regions of interest (ROIs) relevant for the recognition of subjects with AD and 2) an original greedy algorithm that predicts the severity of the effects of AD on these regions. This algorithm takes account of the progressive nature of AD that affects the brain structure with different levels of severity, i.e., the loss of gray matter in AD is found first in memory-related areas of the brain such as the hippocampus. Moreover, the proposed feature extraction process generates a reduced set of attributes which allows the use of general-purpose classification machine learning algorithms. In particular, the proposed feature extraction approach assesses the ROI image separability between classes in order to identify the ones with greater discriminant power. These regions will have the highest influence in the classification decision at the final stage. Several experiments were carried out on segmented magnetic resonance images from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) in order to show the benefits of the overall method. The proposed CAD system achieved competitive classification results in a highly efficient and straightforward way.

This work has been published in the following reference:

E. Ruiz, J. Ramírez, J.M. Górriz, J. Casillas, Alzheimer's disease computer-aided diagnosis: histogram-based analysis of regional MRI volumes for feature selection and classification, Journal of Alzheimer's Disease 65:3 (2018) 819-842

### A30: Automatic whale counting in satellite images with deep learning

E. Guirado, S. Tabik, M. L. Rivas, D. Alcaraz-Segura, F. Herrera.

Submitted.