Table of Contents
KEEL Reference Manual
- - Basic KEEL developement guidelines
- - Method Description files
- - Method Configuration files
- - Data files
- - Output files
- - Use Case files
- - API Dataset
API Dataset
One of the main components of KEEL is the API Dataset. It manages the entire
process of acquisition, processing and validation of the data files, offering
the data sets to the developer in a suitable way, freeing him from the task of
acquiring the data needed to perform any experiment.
This section describes three key concepts of the API Dataset:
- Data files grammar
- The grammar employed to define the data files.
Any file generated by this grammar will be a valid data file, according to
the rules shown in data files section.
- Semantic restrictions of the data files
- Apart from the syntax
restrictions, some semantic verifications are performed by the API Dataset over the data files.
- Description of the classes
- To close this section, the
main public classes of the API Dataset are described.
Data files grammar
In this subsection is shown the grammar which describes the format of the KEEL
data files. The final tokens of the grammar are:
- {} = Denotes the void production. It is also known as λ or ε
- IDENT = Denotes an identifier ('A' -'Z' , 'a'-'z' , '0'-'9')*
- INTEGER = Is an integer value ('0'-'9')+.
- REAL = Is a real value ('0'-'9')+[.('0'-'9')*]
Semantic restrictions of the data files
Attributes
An attribute can be defined as integer, real or nominal, as the grammar
of the data files defines. It is optional to define de minimum and maximum
values, or the list of values for any attribute (if they are not defined,
they correct values will be extracted during the processing of the training
file). Anyway, if they are defined for integer or real attributes, the
minimum value defined must be lower than the maximum.
This way, the limits of the values for any attribute will be established during
the processing of the training file. However, it is possible to find values in
the test file which exceed the limits for a concrete attribute (i.e., in some
schemes of cross-validation). Depending of the type of the attribute, the actions
performed by the API dataset are the following:
- Integer or Real attributes
- The new value is changed by its nearest correct
value (i.e, if the value is greater than the maximum, it is replaced by the maximum;
if it is lower than the minimum, it is replaced by this one)
- Nominal attributes
- The new value is accepted, and the domain of the attribute
is enlarged, adding the new value. In addition, the flag newValueInTest is marked on.
Finally, if one of these cases appears, the API Dataset throws a TestDataBoundsExcedeedException
to inform about the changes performed. However, the files will be parsed correctly.
Inputs and outputs definition
The definition of inputs and outputs in the data files is optional. The API Dataset
will automatically extract the missing definitions, following these rules:
- If no outputs are defined:
- If no input are defined, the last attribute is taken as output. The remaining ones will be taken as inputs.
- If there are some inputs defined, the attributes not marked as inputs will be taken as outputs.
- If no inputs are defined, the attributes not marked as outputs will be taken as inputs.
- If inputs and outputs are defined, those attributes who are not currently defined in one of these categories, are discarded.
Also, it is important to note that the inputs and outputs attributes will be
defined in the same order as they appear in the header of the data file.
Missing values
The API Dataset allows the presence of missing values in the data files, defined
with the <null> or <?> tokens. However, only input attributes can present
missing values. If a missing value is detected in an output attribute, a
OutputValueNotKnownException will be cast, aborting the processing of the data file.
Train and test files
The semantic verifications performed by the API Dataset will vary depending on the
concrete data file processed. Concretely, the actions performed are:
- The definition of the attributes is taken from the training file.
- During the test file reading, the definitions of the attributes are checked.
If they are not consistent with the ones read from the training file, the processing
of the test file is aborted. Moreover, the inputs and outputs defined by the test
file must be the same which were defined by the training file. Otherwise, the
processing of the test file will be aborted.
Description of the classes
The API Dataset is composed by four main classes:
- InstanceSet: This class contains a complete set of instances defining a data base.
- Instance: This class represents a single instance.
- Attributes: This static class contains definitions about every attribute of the data contained in the Instance set.
- Attribute: This class contains relevant information about a single attribute.
The next subsections will describe their main characteristics.
InstanceSet
This class contains a complete set of instances. Its public methods are:
- numInstances
- Returns the number of instances of the Instance Set.
- getInstance
- Returns a concrete instance contained in the Instance Set.
- getInstances
- Returns an array with all the instances of the Instance Set.
Instance
The objects of this class represents instances of the data sets. Its pubic methods are:
- getInputRealValues
- Returns an array containing all the input values of the instance
(only the positions with INTEGER or REAL attributes values will produce a value).
- getInputNominalValues
- Returns an array containing all the input values of the instance
(only the positions with NOMINAL attributes values will produce a value).
- getInputMissingValues
- Returns a boolean array defining which input values are missing.
- getInputRealValue
- Returns the value of a concrete input attribute (only the positions
with INTEGER or REAL attributes values will produce a value).
- getInputNominalValue
- Returns the value of a concrete input attribute (only the positions
with NOMINAL attributes values will produce a value).
- getInputMissingValue
- Returns a boolean value defining if the input value is missing.
- getOutputRealValues
- Returns an array containing all the output values of the instance
(only the positions with INTEGER or REAL attributes values will produce a value).
- getOutputNominalValues
- Returns an array containing all the output values of the instance
(only the positions with NOMINAL attributes values will produce a value).
- getOutputMissingValues
- Returns a boolean array defining which output values are missing.
- getOutputRealValue
- Returns the value of a concrete output attribute
(only the positions with INTEGER or REAL attributes values will produce a value).
- getOutputNominalValue
- Returns the value of a concrete output attribute
(only the positions with NOMINAL attributes values will produce a value).
- getInputMissingValue
- Returns a boolean value defining if the output value is missing.
- getAllInputValues
- Returns an array containing all the input values. REAL values are
returned as double values. INTEGER values are casted to double. NOMINAL values are transformed
to INTEGER and casted to double.
- getAllOutputValues
- Returns an array containing all the output values. REAL values
are returned as double values. INTEGER values are casted to double. NOMINAL values are transformed
to INTEGER and casted to double.
Attributes
Attributes is an static class which stores the definitions of the attributes represented
in the data set. It contains an array of Attribute objects, and two additional arrays
storing references about the input and output attributes. The order of the attributes
stored is the same order than it was found in the input data file.
Its public methods are:
- getInputAttributes
- Returns an array containing all the input Attributes.
- getOutputAttributes
- Returns an array containing all the output Attributes.
- getInputAttribute
- Returns a single input attribute.
- getOutputAttribute
- Returns a single output attribute.
- getAttribute
- Returns a single attribute, defined neither as input nor as output attribute.
- getNumInputAttributes
- Returns the number of input attributes.
- getNumOutputAttributes
- Returns the number of output attributes.
- getNumAttributes
- Returns the number attributes, including input, output and undefined ones.
Attribute
The Attribute class contains the definition an attribute of the dataset. Its public methods are:
- getType
- Returns a integer value defining the type of the attribute (the type is defined as NOMINAL, INTEGER
- getName
- Returns the name of the attribute.
- getMinAttribute
- Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
- getMaxAttribute
- Returns the minimum value of the attribute (only available in INTEGER or REAL attributes).
- getNominalValuesList
- Returns an array with all the values defined for the attribute (only available
in NOMINAL attributes).
- convertNominalValue
- Converts a nominal value to its representations as integer (an integer between
[0 ... N-1, where N is the number of values defined for the attribute).
- getDirectionAttribute
- Returns an integer showing if the attribute is defined as input attribute (INPUT),
output attribute (output), or undefined (DIR_NOT_DEF).
- getNewValuesInTest
- Returns n array with the new values of the attribute observed in test data.