Data Preprocessing in Data Mining
Salvador García, Julián Luengo, Francisco Herrera
You may be also interested in the webpage of our latest journal article on the most influential data preprocessing algorithms, with all the material and information:
S. García, J. Luengo, F. Herrera, Tutorial and Practical Tips on the Most Influential Data Preprocessing Algorithms in Data Mining. Knowledge-based Systems 98 (2016) 1-29.
Data preprocessing is an often neglected but major step in the Data Mining process. The data collection is usually a process loosely controlled, resulting in out of range values e.g, impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery is more difficult to conduct. Data preparation can take considerable amount of processing time.
Data preprocessing includes data preparation, compounded by integration, cleaning, normalization and transformation of data; and data reduction tasks; such as feature selection, instance selection, discretization, etc. The result expected after a reliable chaining of data preprocessing tasks is a final data set which can be considered correct and useful for further data mining algorithms.
This book covers the set of techniques under the umbrella of data preprocessing, being a comprehensive book devoted completely to the field of Data Mining, including all important details and aspects of all techniques that belonging to this families.
Here follows a review made by Xiannong Meng for Computing Reviews, December, 2014:
“This book is a comprehensive collection of data preprocessing techniques used in data mining.
Any readers who practice data mining will find it beneficial … . This book is an excellent
guideline in the topic of data preprocessing for data mining. It is suitable for both practitioners
and researchers who would like to use datasets in their data mining projects.”
Table of Contents
Introduction - Lecture slides
Data Sets and Proper Statistical Analysis of Data Mining Techniques - Lecture slides
Data Preparation Basic Models - Lecture slides
Dealing with Missing Values - Lecture slides
Dealing with Noisy Data - Lecture slides
Data Reduction - Lecture slides
Feature Selection - Lecture slides
Instance Selection - Lecture slides
Discretization - Lecture slides
A Data Mining Software Package Including Data Preparation and Reduction: KEEL - Lecture slides
We have used the open source tool Knowledge Extraction based on Evolutionary Learning (KEEL).
Here you may find the errata list of the book.
Shall you find any typo or mistake, please contact the authors.
In the following we present a number of thematic websites related to different chapters of the book, where the reader can find material and more references.
Statistical Inference in Computational Intelligence and Data Mining, where the use of statistical test and free software is available for the practitioner to properly analyze his or her results as described in Chapter 2.
Missing Values in Data Mining, devoted to missing values as shown in Chapter 4
Noisy Data, dedicated to attribute and class noise with already nose induced data sets and comparative experiments as those shown in Chapter 5.
- Prototype Reduction in Nearest Neighbor Classification: Prototype Selection and Prototype Generation, devoted to instance selection and instance generation, partially treated in Chapter 8.
In the thematic website Big Data: Algorithms for Data Preprocessing, Computational Intelligence, and Imbalanced Classes we can find material, code, references and a starting point associated to the scalability of data preprocessing for big data (big data preprocessing approaches).