Data Preprocessing in Data Mining
Salvador García, Julián Luengo, Francisco Herrera




You may be also interested in the webpage of our latest journal article on the most influential data preprocessing algorithms, with all the material and information:

S. García, J. Luengo, F. Herrera, Tutorial and Practical Tips on the Most Influential Data Preprocessing Algorithms in Data Mining. Knowledge-based Systems 98 (2016) 1-29.


Data preprocessing is an often neglected but major step in the Data Mining process. The data collection is usually a process loosely controlled, resulting in out of range values e.g, impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis.If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery is more difficult to conduct. Data preparation can take considerable amount of processing time.

Data preprocessing includes data preparation, compounded by integration, cleaning, normalization and transformation of data; and data reduction tasks; such as feature selection, instance selection, discretization, etc. The result expected after a reliable chaining of data preprocessing tasks is a final data set which can be considered correct and useful for further data mining algorithms.

This book covers the set of techniques under the umbrella of data preprocessing, being a comprehensive book devoted completely to the field of Data Mining, including all important details and aspects of all techniques that belonging to this families.

Order the Book

Vol. 72 of Intelligent Systems Reference Library.

Springer International Publishing AG, 2015, 2015, 320 p., Hardcover
ISBN online: 978-3-319-10247-4
ISBN printed: 978-3-319-10246-7



Here follows a review made by Xiannong Meng for Computing Reviews, December, 2014:

“This book is a comprehensive collection of data preprocessing techniques used in data mining.
Any readers who practice data mining will find it beneficial … . This book is an excellent
guideline in the topic of data preprocessing for data mining. It is suitable for both practitioners
and researchers who would like to use datasets in their data mining projects.”


Table of Contents

  1. Introduction - Lecture slides

  2. Data Sets and Proper Statistical Analysis of Data Mining Techniques - Lecture slides

  3. Data Preparation Basic Models - Lecture slides

  4. Dealing with Missing Values - Lecture slides

  5. Dealing with Noisy Data - Lecture slides

  6. Data Reduction - Lecture slides

  7. Feature Selection - Lecture slides

  8. Instance Selection - Lecture slides

  9. Discretization - Lecture slides

  10. A Data Mining Software Package Including Data Preparation and Reduction: KEEL - Lecture slides

We have used the open source tool Knowledge Extraction based on Evolutionary Learning (KEEL).

Full version of the Table of Contents

Errata List

Here you may find the errata list of the book.

Shall you find any typo or mistake, please contact the authors.

Complementary material

In the following we present a number of thematic websites related to different chapters of the book, where the reader can find material and more references.

In the thematic website Big Data: Algorithms for Data Preprocessing, Computational Intelligence, and Imbalanced Classes we can find material, code, references and a starting point associated to the scalability of  data preprocessing for big data  (big data preprocessing approaches).

Keynote slides

  F. Herrera, Data Preprocessing, Summer School BDML2015, Wroclaw, Poland, May 22, 2015.