Data preparation for Exploratory Clustering
Instances are defined by a set of attribute values.
Each instance is presented in one row of the data file.
Attribute values may be nominal and numerical. Examples of valid nominal values are: A, val,
and ha_12. Examples of valid numerical values are: 7, -3.1, and -3333.22.
Attribute values may be unknown and they must be explicitly stated by some
string whose first character is '?'. Both numerical and nominal attributes may
include unknown values.
All instances in one layer must have the same number of attributes as defined by the number of
values in the first row of the data file. It means that a formally correct
data file will contain N rows with A values, where N is the number of instances
and A is the number of attributes. For this server
maximal value for N is 1000 and maximal value for A is 1000.
The second layer, if present, must contain the same number of instances in the same order as the
first layer. The number of attributes in the second layer may be and typically is different
from the number of attributes in the first layer.
Attribute values must be separated my delimiters. Valid delimiters are
comma, semi-colon, and one or more spaces. These characters (',', ';', space, and TAB) may not be
used within attribute names and values. If, for example, an input value consists
of two strings separated by a space then the server will interpret this as two nominal
values and the row will have more values than expected. In this situation the server
immediately stops with data processing and reports an error.
Such situation represents a most often cause of problems with this server.
Besades data files the user may prepare some other files that may faciliate the understanding of the results.
The use may prepare:
Attribute names One file for each layer. All names are in one row delimited in the same way as data or
each name is in a separate row. Each attribute name may have up to 20 characters.
Example names Each name is in its own row. There must be the same number of names as there are examples in the
first and the second layer. Each example name may have up to 20 characters.
Classification of instances Classifications must be integers in the range 0-20. Zero denotes unknown classification.
Each classification must be in its row. There must be classifications as there are instances in the data layers.
For first time users it is strongly recommended to read a basic TUTORIAL using iris dataset with 150 instances described by 4 attributes.
A more advanced TUTORIAL describing a two-layer application with country data.
© 2016 LIS - Rudjer Boskovic Institute
Last modified: January 09 2017 21:21:50.