DATA EXPORT
This feature allows you to transform the datasets in KEEL format to the desired format (txt, excel, xml, html table, etc.) to KEEL format.
This conversion is done in a semi-authomatic way.
First of all, you must select the source file of the dataset and the destination directory for the transformed dataset, with the only exception that if you choose to convert from a Database, you only need to specify the source file of the dataset in the KEEL format.
Depending on the format you choosed, you must provide some information:
Once all the information requested is provided, push the 'Convert' button and your dataset will be transformed into the chosen format.
KEEL DATA FILE FORMAT
The KEEL data files must have the following format:
@relation <name>
where <name> is a string. The string must be
quoted if the name includes spaces.
The format for the @attribute statement is:
- @attribute <name> integer [{min,max}]
- @attribute <name> real [{min,max}]
- @attribute <name> {<value 1>, <value 2>, ..., <value N>}
- @inputs <name1>, <name2>, ..., <nameN>
- @outputs <name1>, <name2>, ..., <nameM>
where <name> is a string. The string must be an attribute defined with statements before.
@data
x11, x12, ..., x1N
x21, x22, ..., x2N
..., ..., ..., ...
xM1, xM2, ..., xMN
The files will be saved, by default, with the '.dat' extension.
One example of a valid KEEL file is:
@relation paint
|
The CSV file (comma-separated-values). CSV is one implementation of a delimited text file, which uses a comma to separate values. The CSV file format is very simple and supported by almost all spreadsheets and database management systems.
The characteristics of these files are the following:
The first record in a CSV file may be a header record containing name of the columns.
Each record in a file can have less fields that the number of header columns. In this case, empty values are considered missing values.
Each row must have the same number of fields separated by commas.
Two commas adjacent, or comma at the beginning or end of the line (space-characters) indicate null values.
The separation of the whole and fractional part in the actual numbers is done through a point instead of a comma.
The separation symbol for decimals numbers is a point instead of a comma.
Leading and trailing space-characters adjacent to comma field separators are ignored.
Each record is one line terminated by a newline character or a carriage return
The blank lines will be ignored.
Fields that contain double quote characters must be surrounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes.
Fields with leading or trailing spaces or commas must be delimited with double-quote characters.
The delimiter of values can be other character different to comma. Many implementations of CSV allow an alternate separator to be used, such as tab character and the resulting format is TSV (Tab Separated Values).
The last record in a file can be finished or not with the character end of line.
These files are stored, by default, with the extension ".csv".
The
CSV (Comma-Separated Values) data files must have the following format:
attribute1, attribute2,
..., attributeN value11, value12, ..., value1N ... valueM1, valueM2, ..., valueMN |
One example of valid CSV file is:
FirstName, LastName, Company, EmailAddress Johnathan,Doe,"ABC Company","johndoe@abccompany.com" Harrie,Wong,"Company Inc.","hwong@myprovider.com" Mary,"Jo Smith","Any Corp.","mjsmith@myprovider.com" |
In this example we can see the use of certain rules explained before, such as null value expressed in two consecutive commas, the use of the decimal point as a separator for real numbers and the use of double quotes to use the value of the comma simple as part of the data and not as a separator.
Another
example of valid CSV file is:
OBS,CAREXPEND,DISPOSINC,DOLLARVALUE,WAGES "1960:1",14.2,362,,270.7 "1960:2",14.1,365.9,,273.4 "1960:3",14.6,367.6,,273.9 "1960:4",13.2,369.2,,273.3 "1961:1",10.8,72.9,,273.7 "1961:2",11.7,378.4,,277.6 "1961:3",12.2,385.1,,282.2 "1961:4",13.7,393.2,,288.4 |
The TXT (Text Separated by Tabs) or TSV (Tab Separated Values), is a simple text data that allows tabular data to be exchanged between applications with a different internal format. Values separated by tabs have been officially registered as a MIME type (Multipurpose Internet Mail Extensions) under the name text/tab-separated-values.
The characteristics of these files are the following:
A file in TXT format consists of lines. Each line contains fields separated from one another by the tab character (horizontal tab, HT, code control 9 in ASCII).
Fields can be any string of characters, excluding tabs. However, tabs usually don't appear in data items that you wish to tabulate, so this is seldom a restriction. There are various other formats which are very similar to TSV but use a different separator, such as Comma Separated Values (CVS) which uses the comma as separator. Commas, spaces, and other characters often used as separators in such formats appear rather often in data to be tabulated, at least in header fields.
Each line must contain the same number of fields.
The first line contains the name of the fields or attributes, i.e. the column headers.
An empty value is displayed as an empty field between tabs.
Such files can be read and edited by any text editors.
Although TSV is a text format, this type of format is not expected appearing with a nice tabular format when it is printed with an editor or left on the screen.
The extension for this type of file is ".txt" or ".tsv".
The TXT (Text Separated by Tabulators) or TSV (Tab/Text Separated Values) data files must have the following format:
attribute1<TAB>attribute<TAB>...<TAB>attributeN value11<TAB>value12<TAB> ... <TAB> value1N ... valueM1<TAB>valueM2<TAB> ... <TAB>valueMN |
One example of valid TXT or TSV file is the following:
FirstName
<TAB> LastName
<TAB> Company
<TAB> EmailAddress Johnathan <TAB> Doe <TAB> ABC Company <TAB> johndoe@abccompany.com Harrie <TAB>Wong <TAB>Company <TAB> Inc. hwong@myprovider.com Mary <TAB> Jo Smith <TAB> Any <TAB> Corp <TAB> mjsmith@myprovider.com" |
This format has the same features and restrictions that the CSV format, the difference is the separator between fields in PRN format are spaces. However, the spaces in PRN format have a different role than in CVS files.
The characteristics of these files are the following:
The first record in a PRN file may be a header record containing name of the columns.
Each record in a file with headers in columns can have less fields than the number of headers. In this case, empty values are considered missing values.
Each row must have the same number of fields separated by spaces.
Several spaces together will be treated as a single space.
The spaces at the beginning or end of the line indicated null values.
The separation symbol for decimals numbers is a point instead of a comma.
Each record is one line terminated by a newline character or a carriage return.
The blank lines will be ignored.
The fields can contain double quote, carriage return (or any other character).
Fields that contain space character as value must be surounded by double-quotes.
A record with a single field without any value must have the requirements of type text to prevent that it is not ignored.
The last record in a file can be finished or not with the end of line symbol.
These files are stored by default, with the extension ".prn".
The PRN files have the data separated by blank spaces. So, these data files must have the following format:
attribute1 attribute2
... attributeN value11 value12 ... value1N ... valueM1 valueM2 ... valueMN |
One example of a valid PRN file is the following:
OBS DELL GE YAHOO 1 26.99 48.5 22.92 2 26 49.93 20.83 3 26.24 49.96 20.13 4 25.76 49.48 19.98 5 26.73 49.43 19.74 6 24.93 49.83 18.86 7 25.84 49.01 18.23 8 25.91 49.73 17.79 9 24.6 50.15 17.1
|
Files are encoded according to C4.5 format. This format consists of two files, one of them it is a name file with extension ".names", the other one it is a data file with extension ".data".
The characteristics of name files are the following:
The .names file contains a series of entries that describe the classes, attributes and values of the dataset. Each record is terminated with a point, but the point can be omitted if it would have been the last character on a line). Each name consists of a string of characters without commas, quotes or colon (unless escaped by a vertical bar, |).
A name can contain a point, but this point must be followed by a white space
Embedded white spaces is permitted but multiple white spaces are replaced by a single space.
The first record in the file lists the names of the classes, separated by commas (and terminated by a point). Each successive line then defines an attribute, in the order in which they will appear in the .data files, with the following format:
<attribute-name : attribute-type>
The attribute-name is an identifier followed by a colon. The attribute type which must be one of:
continuous: if the attribute has a continuous values.
discrete <n>: the word 'discrete' followed by an integer which indicates how many values the attribute can take.
ignore: indicates that this attribute should be ignored.
A | (vertical bar) means that the remainder of the line should be considered as a comment.
These files are stored, by default, with the extension ".names"
The
format of the '.name' file is the following:
class-1, class-2, ..., class-N. characteristic-1: domain. characteristic-2: domain. ... characteristic-M: domain. |
The
characteristics of data files are the following:
The file contains one line by object. Each line contains values of the attributes sorted according to .names file, followed by the class of object, with all entries separated by commas.
The format is same than CVS file (comma separated values), explained in CVS Data File Format.
A missing values are indicated by '?'.
These files are stored, by default, with the extension ".data".
The format of the '.data' file is the following:
value11,
value12, ..., value1N |
An example of an C4.5 data file is the following
Content of the '.name' file:
| Firstly the name of classes
good, bad.
|Then the attributes
bereavement: yes, no.
|
Content of the '.data' file:
2,5.0,4.0,?,none,37,?,?,5,no,11,below average,yes,full,yes,full,good 3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,bad 3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good 3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below average,yes,half,yes,full,bad |
The weak data files must have in the following format:
Headline. The relation name is defined as the first line in the ARFF file. The format is:
@ relation <name-of-relation>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
Declaration of attributes. Attribute declarations take the form of an ordered sequence of @attribute statements. Each attribute in the data set has its own @attribute statement which uniquely defines the name of that attribute and it's data type. The order the attributes are declared indicates the column position in the data section of the file. For example, if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column.
The format for the @attribute statement is:
@ attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are to be included in the name then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:
1) NUMERIC or REAL. Numeric attribute can be real numbers.
2) INTEGER. Integer attribute can be integer numbers.
3) DATE. Date attribute is an optional string specifying how date values should be parsed and printed. The default format string accepts the ISO-8601 combined date and time format: "yyyy-MM-dd'T'HH:mm:ss".
4) STRING. String attributes allow us to create attributes containing arbitrary textual values.
5) ENUMERATE. Enumerate attribute consists of a set of possible values separated by commas (Characters or strings), which can take the attribute. For example, if we have an attribute that indicates the time podr'ıa Express:
@ attribute time {sunny, rainy, cloudy}
Section data. The data section of the file contains the data declaration line and the actual instance lines. The @data declaration is a single line denoting the start of the data segment in the file. The format is:
@ data
X11, x12, ... , X1N
X21, x22, ... , X2N
Each instance is represented on a single line, with carriage returns denoting the end of the instance. Attribute values for each instance are delimited by commas. They must appear in the order that they were declared in the header section (i.e. the data corresponding to the nth @attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as in:
@data4.4,?,1.5,?,Iris-setosa
Some of the specifications of this format are:
o The name of the relationship and the attributes are string type. This string type is same than string type used on Java.
o If any name contains spaces it is necessary to include double quote.
o If you need to indicate a missing values, you have to use symbol '?'.
o The separation symbol for decimals numbers is a point instead of a comma.
o The separation symbol for data in section @ data is comma.
o A % symbol means that the remainder of the line should be considered as a comment.
o These files are stores, by default, with the extension ".arff”.
The WEKA data files must have the following format:
@relation
<relation-name> @attribute <attribute-name-1> <datatype> ... @attribute <attribute-name-N> <datatype> @data value11,value12,value1N ... valueM1,valueM2,valueMN |
One example of a valid WEKA file is:
% Comment
@relation weather |
Microsoft Excel is a spreadsheet program written and distributed by Microsoft. It is currently the most widely used spreadsheet for operative systems Microsoft Windows and Apple Macintosh. It is integrated as part of Microsoft Office.
A spreadsheet is a program that allows you to manipulate numerical and alphanumeric data. Spreadsheets are arranged in rows and columns. The intersection of a row/column is called cell
Each cell
can contain data or a formula that can refer to the contents of other cells. A
spreadsheet contains 256 columns, which are labelled with letters (from A to IV)
and the rows with numbers (from 1 to 65.536), making a total of 16.777.216 cells
by spreadsheet.
Because of the versatility of modern spreadsheets, they are used to sometimes to
make smaller databases, reports, and other uses.
Microsoft Excel format has extension ".xls".
One example of a valid EXCEL file is:
DIF (Data Interchange Format) is a text file that is used to import/export between different spreadsheet programs such as Excel, StarCalc, dBase, and so on.
This type of format is stored with the extension ". dif"
The characteristics of these files are the following:
The format consist of a header followed by a data block. The header starts with a file with ASCII text format.
o string is any string, it is often the filename or another information.
o columns is the number of columns of a excel spreadsheet by means of name.
o rows indicates the number of rows of a excel spreadsheet by means of name.
The header ends with the following:
This header is followed by the cells and records of the spreadsheet with the information.
The structure of the data record has the following format:
where
data-type admits various types: SPECIAL, NUMERIC, and STRING,
represented by -1, 0 and 1 respectively.
o SPECIAL type
where BOT and EOD are strings without quotation marks. BOT represents the start of the table and EOD the end of data section.
o NUMERIC type
where value-indicator indicates the data type stored in data:
- TRUE:1.
- FALSE: 0.
- V: any numerical value.
- NA: missing value.
- ERROR: 0.
o STRING type
where string is any text characters.
One example of a valid DIF file is the following:
Month | Week | Vehicle | Quantity |
January | 1 | Auto | 105.000 |
January | 1 | Truck | 1.050 |
January | 1 | Bus | 1.575 |
January | 1 | Truck | 2.100 |
January | 1 | Motorbike | 583 |
The internal format of DIF file generated is the following:
TABLE
VECTORS -1,0 BOT 1,0 “Month” 1,0 “Wek” 1,0 “Vehicle” 1,0 “Cantity” -1,0 BOT 1,0 “January” 0,1 V 1,0
|
“Car” 0,105.000 V -1,0 BOT 1,0 “January” 0,1 V 1,0 “Truck” 0,1.050 V -1,0 BOT 1,0 “January” 0,1 “Bus” 0,1.575 V -1,0 BOT 1,0
|
“January” 0,1 “Truck” 0,2.100 V -1,0 BOT 1,0 “January” 0,1 V 1,0 “Motorbike” 0,583 V -1,0 EOD |
XML (Extensible Markup Language) is a set of rules to define semantic labels that organize a document in different parts. XML is a meta-language that defines the syntax to define other structured label languages.
We will explain the XML format to be followed to convert data file correctly:
The first line must follow the next structure:
<? Xml version = "1.0" encoding = "UTF-8" standalone = "yes">
You can have several attributes, some mandatory and others are not:
version: indicates XML version used in the document. This field is compulsory.
encoding: indicates the way that has been encoded document. The default option is UTF-8, but could be others, as UTF-16, US-ASCII, ISO-8859-1, etc. This field is not obligatory.
standalone: specifies whether further documents, such as a DTD, are required to process the document. The default value is "no".
XML documents must follow a hierarchical structure by means of labels. XML elements can contain other elements. Elements may also have attributes, these are always expressed as name-value pairs in the element's open tag.
A well-formed document must conform to the following rules:
Element names are case sensitive, that is, the following is a well-formed matching pair: <step>…<step>, whereas this is not <step>…</step>.
Non-empty elements are delimited by both a start-tag and an end-tag.
Attribute values must always be quoted, using single or double quotes, and each attribute name should appear only once in any element.
All spaces and carriage returns are taken into account in the elements.
The element names must not begin with the letters “xml”.
The element names should not use character ":".
Although it is permissible to use the characters "." And "-" in element names, it is not recommended because the application processing XML file may interpret these signs as operators. Therefore these characters will be replaced in our tool
by the character "_”.It should not be used characters "\" in the names of elements.
The names may contain any alphanumeric character, but they can not start with a numerical or punctuation character.
Special characters can be represented either
using entity references, or by means of numeric character references. An
example of a numeric character reference is "€
",
which refers to the Euro symbol by means of its Unicode codepoint in
hexadecimal.
An entity reference is a placeholder that represents that entity. It consists of the entity's name preceded by an ampersand ("&") and followed by a semicolon (";"). XML has five predeclared entities:
& (ampersand) is &
< (less than) is <
> (greater than) is >
' (apostrophe) is '
" (quotation mark) is "
Comments can be placed anywhere in the tree, including in the text if the content of the element is text. XML comments start with <!- and end with ->.
<!- This a comment ->
XML requires that elements be properly nested, that is, elements may never overlap. For example, the code below is not well-formed XML, because the <em> and <strong> elements overlap:
<!- WRONG! NOT WELL-FORMED XML !->
<p>Normal<em>emphasized<strong>strong emphasized</em>strong</strong></p>
All XML documents must contain a single tag pair to define the root element. All other elements must be nested within the root element. All elements can have sub (children) elements. Sub elements must be in pairs and correctly nested within their parent element.
The label <root> indicates the begin of the data. This label can have any name. If all the children of <root> do not have the same name on the label <row>, the user must enter the name of this tag, otherwise it is assumed that all children have the same value.
Each label <row> is parent of as labels as attributes exist. The name on the label of each of these children will be the attribute name, and the value of the label is the data value of the attribute.
There are as labels <row> as rows of data.
One XML format valid to Keel is the following:
<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
<row1>
<attribute-name-1> attribute-value-11
</attribute-name-1>
<attribute-name-2> attribute-value-12
</attribute-name-2>
<attribute-name-N> attribute-value-1N
</attribute-name-N>
</row1>
<attribute-name-1> attribute-value-M1
</attribute-name-1>
<attribute-name-2> attribute-value-M2
</attribute-name-2>
<attribute-name-N> attribute-value-MN
</attribute-name-N>
</rowM>
</root>
|
Another XML format valid to Keel is the following:
<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
<row1>
<field
name="attribute-name-1">attribute-value-11 </field>
<field
name="attribute-name-2">attribute-value-12 </field>
<field
name="attribute-name-N">attribute-value-1N </field>
</row1>
<field
name="attribute-name-1">attribute-value-M1 </field>
<field
name="attribute-name-2">attribute-value-M2 </field>
<field
name="attribute-name-N">attribute-value-MN </field>
</rowM>
</root>
|
One
example of a valid XML file is the following:
In this example there are:
<?xml version="1.0" encoding="UTF-8"?>
<customer>
<id>5</id>
<course>66</course>
<name>My book</name>
<summary>Book summary</summary>
<numbering>2</numbering>
<disableprinting>0</disableprinting>
<customtitles>1</customtitles>
<timecreated>1114095924</timecreated>
<timemodified>1114097355</timemodified>
</customer>
<customer>
<id>6</id>
<course>207</course>
<name>My book</name>
<summary>A test summary</summary>
<numbering>1</numbering>
<disableprinting>0</disableprinting>
<customtitles>0</customtitles>
<timecreated>1114095966</timecreated>
<timemodified>1114095966</timemodified>
</customer>
</root> |
The following example has another xml structure, but the same data than the previous example. You can see that there are 9 attributes and 2 instances of this.
<?xml version="1.0" encoding="UTF-8"?>
<row>
<field
name="id">5</field>
<field
name="course">66</field>
<field name="name">My
book</field>
<field
name="summary">Book summary</field>
<field
name="numbering">2</field>
<field
name="disableprinting">0</field>
<field
name="customtitles">1</field>
<field
name="timecreated">1114095924</field>
<field
name="timemodified">1114097355</field>
</row>
<row>
<field
name="id">6</field>
<field
name="course">207</field>
<field name="name">My
book</field>
<field
name="summary">A test summary</field>
<field
name="numbering">1</field>
<field
name="disableprinting">0</field>
<field
name="customtitles">0</field>
<field
name="timecreated">1114095966</field>
<field
name="timemodified">1114095966</field>
</row> </root> |
HTML (Hypertext Markup Language) is the predominant markup language for web pages. It provides a means to describe the structure of text-based information in a document (denoting certain text as headings, paragraphs, lists, and so on) and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of labels (known as tags), surrounded by angle brackets.
HTML is an application of SGML according to the international standard ISO 8879. XHTML is a reformulation of HTML 4 as an XML application 1.0, and allows compatibility with user agents already admitted HTML 4 following a set of rules.
The basic HTML tags are:
<HTML>: is the label that defines the beginning of the document.
<HEAD>: defines the header of the document, this header normally contains information about the page such as the TITLE, META tags for proper Search Engine indexing, STYLE tags, which determine the page layout, and JavaScript coding for special effects. Within the header <HEAD> we find:
<TITLE>: defines the title of the page. This will be visible in the title bar of the viewers’ browser.
<LINK>: defines some advanced features, for example style sheets used for the design of the page.
<BODY>: contains the main content or body of the paper, this is where you will begin writing your document and placing your HTML codes. It defines common properties to the entire page, such as background colour and margins. Within the body can <BODY> you can use a great variety labels. The label which we use on our tool is <TABLE>: This label defines the beginning of a table (the <TR> represents rows and <TD> represents cells).
The format explained above corresponds to an HTML page is :
<HTML> <HEAD> ... </HEAD> <BODY> .... <TABLE> ... </TABLE> ... </BODY> </HTML> |
Tag <TABLE>
The HTML table model allows authors to arrange data -- text, preformatted text, images, links, forms, form fields, other tables, etc. -- into rows and columns of cells.
Tables are defined with the <table> tag. A table is divided into rows (with the <tr> tag), and each row is divided into data cells (with the <td> tag). The letters td stands for "table data," which is the content of a data cell. A data cell can contain text, images, lists, paragraphs, forms, horizontal rules, tables, etc.
Different Tags which will define the structure of the table for obtaining a valid data file are:
TR: The label <TR> will allow us to insert rows in the table.
TH: The label <TH> will allow us to define the table head table.
TD: The label <TD> will allow us to insert cells in each row. We can insert any element: pictures, lists, formatted text and even other tables.
The HTML format valid to Keel is the following:
<table>
</table> |
One example of a valid HTML file is the following:
<html>
</html>
|