API Dataset

One of the main components of KEEL is the API Dataset. It manages the entire process of acquisition, processing and validation of the data files, offering the data sets to the developer in a suitable way, freeing him from the task of acquiring the data needed to perform any experiment.

This section describes three key concepts of the API Dataset:

Data files grammar: The grammar employed to define the data files. Any file generated by this grammar will be a valid data file, according to the rules shown in data files section.
Semantic restrictions of the data files: Apart from the syntax restrictions, some semantic verifications are performed by the API Dataset over the data files.
Description of the classes: To close this section, the main public classes of the API Dataset are described.

Data files grammar

In this subsection is shown the grammar which describes the format of the KEEL data files. The final tokens of the grammar are:

{} = Denotes the void production. It is also known as λ or ε
IDENT = Denotes an identifier ('A' -'Z' , 'a'-'z' , '0'-'9')^*
INTEGER = Is an integer value ('0'-'9')⁺.
REAL = Is a real value ('0'-'9')⁺[.('0'-'9')^*]

Semantic restrictions of the data files

Attributes

An attribute can be defined as integer, real or nominal, as the grammar of the data files defines. It is optional to define de minimum and maximum values, or the list of values for any attribute (if they are not defined, they correct values will be extracted during the processing of the training file). Anyway, if they are defined for integer or real attributes, the minimum value defined must be lower than the maximum.

This way, the limits of the values for any attribute will be established during the processing of the training file. However, it is possible to find values in the test file which exceed the limits for a concrete attribute (i.e., in some schemes of cross-validation). Depending of the type of the attribute, the actions performed by the API dataset are the following:

Integer or Real attributes: The new value is changed by its nearest correct value (i.e, if the value is greater than the maximum, it is replaced by the maximum; if it is lower than the minimum, it is replaced by this one)
Nominal attributes: The new value is accepted, and the domain of the attribute is enlarged, adding the new value. In addition, the flag newValueInTest is marked on.

Finally, if one of these cases appears, the API Dataset throws a TestDataBoundsExcedeedException to inform about the changes performed. However, the files will be parsed correctly.

Inputs and outputs definition

The definition of inputs and outputs in the data files is optional. The API Dataset will automatically extract the missing definitions, following these rules:

If no outputs are defined:
- If no input are defined, the last attribute is taken as output. The remaining ones will be taken as inputs.
- If there are some inputs defined, the attributes not marked as inputs will be taken as outputs.
If no inputs are defined, the attributes not marked as outputs will be taken as inputs.
If inputs and outputs are defined, those attributes who are not currently defined in one of these categories, are discarded.

Also, it is important to note that the inputs and outputs attributes will be defined in the same order as they appear in the header of the data file.

Missing values

The API Dataset allows the presence of missing values in the data files, defined with the <null> or <?> tokens. However, only input attributes can present missing values. If a missing value is detected in an output attribute, a OutputValueNotKnownException will be cast, aborting the processing of the data file.

Train and test files

The semantic verifications performed by the API Dataset will vary depending on the concrete data file processed. Concretely, the actions performed are:

The definition of the attributes is taken from the training file.
During the test file reading, the definitions of the attributes are checked. If they are not consistent with the ones read from the training file, the processing of the test file is aborted. Moreover, the inputs and outputs defined by the test file must be the same which were defined by the training file. Otherwise, the processing of the test file will be aborted.

Description of the classes

The API Dataset is composed by four main classes:

InstanceSet: This class contains a complete set of instances defining a data base.
Instance: This class represents a single instance.
Attributes: This static class contains definitions about every attribute of the data contained in the Instance set.
Attribute: This class contains relevant information about a single attribute.

The next subsections will describe their main characteristics.

InstanceSet

This class contains a complete set of instances. Its public methods are:

numInstances: Returns the number of instances of the Instance Set.
getInstance: Returns a concrete instance contained in the Instance Set.
getInstances: Returns an array with all the instances of the Instance Set.

Instance

The objects of this class represents instances of the data sets. Its pubic methods are:

getInputRealValues: Returns an array containing all the input values of the instance (only the positions with INTEGER or REAL attributes values will produce a value).
getInputNominalValues: Returns an array containing all the input values of the instance (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValues: Returns a boolean array defining which input values are missing.
getInputRealValue: Returns the value of a concrete input attribute (only the positions with INTEGER or REAL attributes values will produce a value).
getInputNominalValue: Returns the value of a concrete input attribute (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValue: Returns a boolean value defining if the input value is missing.
getOutputRealValues: Returns an array containing all the output values of the instance (only the positions with INTEGER or REAL attributes values will produce a value).
getOutputNominalValues: Returns an array containing all the output values of the instance (only the positions with NOMINAL attributes values will produce a value).
getOutputMissingValues: Returns a boolean array defining which output values are missing.
getOutputRealValue: Returns the value of a concrete output attribute (only the positions with INTEGER or REAL attributes values will produce a value).
getOutputNominalValue: Returns the value of a concrete output attribute (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValue: Returns a boolean value defining if the output value is missing.
getAllInputValues: Returns an array containing all the input values. REAL values are returned as double values. INTEGER values are casted to double. NOMINAL values are transformed to INTEGER and casted to double.
getAllOutputValues: Returns an array containing all the output values. REAL values are returned as double values. INTEGER values are casted to double. NOMINAL values are transformed to INTEGER and casted to double.

Attributes

Attributes is an static class which stores the definitions of the attributes represented in the data set. It contains an array of Attribute objects, and two additional arrays storing references about the input and output attributes. The order of the attributes stored is the same order than it was found in the input data file.

Its public methods are:

getInputAttributes: Returns an array containing all the input Attributes.
getOutputAttributes: Returns an array containing all the output Attributes.
getInputAttribute: Returns a single input attribute.
getOutputAttribute: Returns a single output attribute.
getAttribute: Returns a single attribute, defined neither as input nor as output attribute.
getNumInputAttributes: Returns the number of input attributes.
getNumOutputAttributes: Returns the number of output attributes.
getNumAttributes: Returns the number attributes, including input, output and undefined ones.

Attribute

The Attribute class contains the definition an attribute of the dataset. Its public methods are:

getType: Returns a integer value defining the type of the attribute (the type is defined as NOMINAL, INTEGER
getName: Returns the name of the attribute.
getMinAttribute: Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
getMaxAttribute: Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
getNominalValuesList: Returns an array with all the values defined for the attribute (only available in NOMINAL attributes).
convertNominalValue: Converts a nominal value to its representations as integer (an integer between [0 ... N-1, where N is the number of values defined for the attribute).
getDirectionAttribute: Returns an integer showing if the attribute is defined as input attribute (INPUT), output attribute (output), or undefined (DIR_NOT_DEF).
getNewValuesInTest: Returns n array with the new values of the attribute observed in test data.

Table of Contents

API Dataset

Data files grammar

Semantic restrictions of the data files

Attributes

Inputs and outputs definition

Missing values

Train and test files

Description of the classes

InstanceSet

Instance

Attributes

Attribute