main main

Table of Contents

KEEL Reference Manual
  1. - Basic KEEL developement guidelines
  2. - Method Description files
  3. - Method Configuration files
  4. - Data files
  5. - Output files
  6. - Use Case files
  7. - API Dataset

API Dataset

One of the main components of KEEL is the API Dataset. It manages the entire process of acquisition, processing and validation of the data files, offering the data sets to the developer in a suitable way, freeing him from the task of acquiring the data needed to perform any experiment.

This section describes three key concepts of the API Dataset:

Data files grammar
The grammar employed to define the data files. Any file generated by this grammar will be a valid data file, according to the rules shown in data files section.
Semantic restrictions of the data files
Apart from the syntax restrictions, some semantic verifications are performed by the API Dataset over the data files.
Description of the classes
To close this section, the main public classes of the API Dataset are described.
main

Data files grammar

In this subsection is shown the grammar which describes the format of the KEEL data files. The final tokens of the grammar are:

  • {} = Denotes the void production. It is also known as λ or ε
  • IDENT = Denotes an identifier ('A' -'Z' , 'a'-'z' , '0'-'9')*
  • INTEGER = Is an integer value ('0'-'9')+.
  • REAL = Is a real value ('0'-'9')+[.('0'-'9')*]
Grammar1
Grammar2
main

Semantic restrictions of the data files

Attributes

An attribute can be defined as integer, real or nominal, as the grammar of the data files defines. It is optional to define de minimum and maximum values, or the list of values for any attribute (if they are not defined, they correct values will be extracted during the processing of the training file). Anyway, if they are defined for integer or real attributes, the minimum value defined must be lower than the maximum.

This way, the limits of the values for any attribute will be established during the processing of the training file. However, it is possible to find values in the test file which exceed the limits for a concrete attribute (i.e., in some schemes of cross-validation). Depending of the type of the attribute, the actions performed by the API dataset are the following:

Integer or Real attributes
The new value is changed by its nearest correct value (i.e, if the value is greater than the maximum, it is replaced by the maximum; if it is lower than the minimum, it is replaced by this one)
Nominal attributes
The new value is accepted, and the domain of the attribute is enlarged, adding the new value. In addition, the flag newValueInTest is marked on.

Finally, if one of these cases appears, the API Dataset throws a TestDataBoundsExcedeedException to inform about the changes performed. However, the files will be parsed correctly.

Inputs and outputs definition

The definition of inputs and outputs in the data files is optional. The API Dataset will automatically extract the missing definitions, following these rules:

  • If no outputs are defined:
    • If no input are defined, the last attribute is taken as output. The remaining ones will be taken as inputs.
    • If there are some inputs defined, the attributes not marked as inputs will be taken as outputs.
  • If no inputs are defined, the attributes not marked as outputs will be taken as inputs.
  • If inputs and outputs are defined, those attributes who are not currently defined in one of these categories, are discarded.

Also, it is important to note that the inputs and outputs attributes will be defined in the same order as they appear in the header of the data file.

Missing values

The API Dataset allows the presence of missing values in the data files, defined with the <null> or <?> tokens. However, only input attributes can present missing values. If a missing value is detected in an output attribute, a OutputValueNotKnownException will be cast, aborting the processing of the data file.

Train and test files

The semantic verifications performed by the API Dataset will vary depending on the concrete data file processed. Concretely, the actions performed are:

  • The definition of the attributes is taken from the training file.
  • During the test file reading, the definitions of the attributes are checked. If they are not consistent with the ones read from the training file, the processing of the test file is aborted. Moreover, the inputs and outputs defined by the test file must be the same which were defined by the training file. Otherwise, the processing of the test file will be aborted.
main

Description of the classes

The API Dataset is composed by four main classes:

  • InstanceSet: This class contains a complete set of instances defining a data base.
  • Instance: This class represents a single instance.
  • Attributes: This static class contains definitions about every attribute of the data contained in the Instance set.
  • Attribute: This class contains relevant information about a single attribute.

The next subsections will describe their main characteristics.

InstanceSet

This class contains a complete set of instances. Its public methods are:

numInstances
Returns the number of instances of the Instance Set.
getInstance
Returns a concrete instance contained in the Instance Set.
getInstances
Returns an array with all the instances of the Instance Set.

Instance

The objects of this class represents instances of the data sets. Its pubic methods are:

getInputRealValues
Returns an array containing all the input values of the instance (only the positions with INTEGER or REAL attributes values will produce a value).
getInputNominalValues
Returns an array containing all the input values of the instance (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValues
Returns a boolean array defining which input values are missing.
getInputRealValue
Returns the value of a concrete input attribute (only the positions with INTEGER or REAL attributes values will produce a value).
getInputNominalValue
Returns the value of a concrete input attribute (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValue
Returns a boolean value defining if the input value is missing.
getOutputRealValues
Returns an array containing all the output values of the instance (only the positions with INTEGER or REAL attributes values will produce a value).
getOutputNominalValues
Returns an array containing all the output values of the instance (only the positions with NOMINAL attributes values will produce a value).
getOutputMissingValues
Returns a boolean array defining which output values are missing.
getOutputRealValue
Returns the value of a concrete output attribute (only the positions with INTEGER or REAL attributes values will produce a value).
getOutputNominalValue
Returns the value of a concrete output attribute (only the positions with NOMINAL attributes values will produce a value).
getInputMissingValue
Returns a boolean value defining if the output value is missing.
getAllInputValues
Returns an array containing all the input values. REAL values are returned as double values. INTEGER values are casted to double. NOMINAL values are transformed to INTEGER and casted to double.
getAllOutputValues
Returns an array containing all the output values. REAL values are returned as double values. INTEGER values are casted to double. NOMINAL values are transformed to INTEGER and casted to double.

Attributes

Attributes is an static class which stores the definitions of the attributes represented in the data set. It contains an array of Attribute objects, and two additional arrays storing references about the input and output attributes. The order of the attributes stored is the same order than it was found in the input data file.

Its public methods are:

getInputAttributes
Returns an array containing all the input Attributes.
getOutputAttributes
Returns an array containing all the output Attributes.
getInputAttribute
Returns a single input attribute.
getOutputAttribute
Returns a single output attribute.
getAttribute
Returns a single attribute, defined neither as input nor as output attribute.
getNumInputAttributes
Returns the number of input attributes.
getNumOutputAttributes
Returns the number of output attributes.
getNumAttributes
Returns the number attributes, including input, output and undefined ones.

Attribute

The Attribute class contains the definition an attribute of the dataset. Its public methods are:

getType
Returns a integer value defining the type of the attribute (the type is defined as NOMINAL, INTEGER
getName
Returns the name of the attribute.
getMinAttribute
Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
getMaxAttribute
Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
getNominalValuesList
Returns an array with all the values defined for the attribute (only available in NOMINAL attributes).
convertNominalValue
Converts a nominal value to its representations as integer (an integer between [0 ... N-1, where N is the number of values defined for the attribute).
getDirectionAttribute
Returns an integer showing if the attribute is defined as input attribute (INPUT), output attribute (output), or undefined (DIR_NOT_DEF).
getNewValuesInTest
Returns n array with the new values of the attribute observed in test data.
main


 
 Copyright 2004-2015, KEEL (Knowledge Extraction based on Evolutionary Learning)
About the Webmaster Team
Valid XHTML 1.1   Valid CSS!