GPU-SME-k NN: Scalable and Memory Efficient k NN and Lazy Learning using GPUs

This Website contains complementary material to the paper:

P.D. Gutiérrez, M. Lastra, Jaume Bacardit, J.M. Benítez and F.Herrera, GPU-SME-k NN: Scalable and Memory Efficient k NN and Lazy Learning using GPUs. Information Sciences, submitted

Summary:

Abstract
Datasets
1. Poker Dataset
2. KDDCup 1999 Dataset
Source code
Results obtained

P.D. Gutiérrez, M. Lastra, Jaume Bacardit, J.M. Benítez and F.Herrera, GPU-SME-k NN: Scalable and Memory Efficient k NN and Lazy Learning using GPUs.

Abstract

The k nearest neighbor (kNN) rule is one of the most used techniques in data mining and pattern recognition due to its simplicity and low identification error. However, the computational effort it requires is directly related to the dataset sizes, hence delivering a poor performance on large datasets. Graphics processing units (GPU) have proven to improve the performance of the kNN rule but the computational requirements of current proposals limit the performance as the dataset size increases. In this paper we propose a new scalable and memory efficient design for a GPU-based kNN rule, called GPU-SME-kNN, that eliminates the relation between dataset size and memory footprint while delivering high performance. An experimental study of GPU-SME-kNN is presented showing a high performance, even in cases that other proposals cannot address, and keeping computational requirements suitable for most commercial GPU devices. Our design has also been applied to kNN-based lazy learning algorithms reducing run-times in a significant way.

Datasets

The proposed GPU-based kNN rule has been tested with different datasets from the UCI Machine Learning Repository. Here we provide the 5-fold cross validation partitions used in the experiments in all sizes.

Poker dataset

The poker dataset has 1,025,009 instances, 10 attributes and 10 classes. The dataset has been subsampled at sizes ranging from 50,000 to 1,000,000 instances in steps of 50,000 instances.

KDDCup 199 dataset

The KDDCup 1999 dataset has 4,898,431 instances, 41 attributes and 5 classes. This dataset has been subsampled in steps of 250,000 instances from 250,000 to 1,500,000 instances and in steps of 500,000 instances for larger sizes.

Source code

The source code is available as open source. It can be downloaded here: source.zip

Results obtained

In this section, we report the results obtained in the experimental study. We present te results for kNN rule with the Poker and KDDCup 1999 datasets and also the results of the lazy learning algorithms with the Poker dataset

We present these data in xls and ods files available for download.

Table 1: GPU-SME-kNN results
	xls file	ods file
kNN rule Poker results
kNN rule KDDCup 1999 results
Center kNN rule Poker results
kNN Adaptive rule Poker results
kNN Symmetric rule Poker results

You are here