DMS Project - Level A description

Detailed description of Level A

(Documentation on this page and related pages describes how we intended to realize the system. Most of the text was written before January 2001. Up to date description of the realized system is at DMS Home .)

The learning problem size restrictions at Level A are: 150 training examples, 30 attributes, and 30 characters in every input string.

File conventions

An level A there is only the data file which can be uploaded. The data file will not include explicit information about number of examples and number of attributes. The data file must be a plain ASCII file with M+1 rows (M is number of learning examples) and N+1 (N is number of attributes) strings in each row. Each row has N+1 elements because besides N attributes there is a string representing the example class. The order of attributes (including class) strings in every row must be the same. The first row in the data file defines names of the attributes and the class variable. That is the reason that data file with M examples actually has M+1 rows.

At level A, because of required communication simplicity, the class column is identified as the one whose name begins with $ . Its position in the data file may be arbitrary. Target class examples are those that have class value equal "1" while all other string values correspond to non-target class examples.

Attributes can be of type 'description', 'discrete', and 'continuous'. Actual type of every attribute will be determined automatically during the execution of the induction process.

Basic level A scenario

At level A the potential user will practically go directly to data file submission. At the same page there will be the link that describes necessary data file format and the link about security information. The later will include downloadable program(s) for user data file encoding and the instructions how to use it. There will be also example data files which can be downloaded and used for test purposes.

The name of the data file containing learning examples will be not restricted. During execution on the server the data will be saved in the file with the internally generated name. Immediately after rule generation, all files connected with this data will be removed. Rule induction will be based on confirmation rules generated by heuristic search implemented in gerules program. The user will have options to select a) number of generated rules and b) required generalization level. The final induction result will be one or more (up to three) rules. Each rule will be presented by a conjunction of conditions so that every condition is presented in one row.

At level A, besides rules generated by gerules program which will be the main output, there will be two other preprocessing data mining tools. The first is simple statistical analysis of all input attributes for both target and non-target class examples. Results will be mean value, median, and standard deviation for every numerical attribute for each class. For categorical attributes, the results will be lists of a few most frequent strings for each class. The other data mining tool will be noise detection result obtained by standard ILLM approach. The result will be the list of potentially noisy examples. The user will not be able to select parameters of noise detection. Because both options could require substantial computation time, there will be possibility to exclude any or both of them.

Other important assumptions valid for all levels are:

Each attribute in the input data file must be of one and only one type. There are in total three different attribute types: categorical, discrete, and continuous. Categorical attributes (or s attributes ) must be strings starting with a letter a-z or A-Z. Besides letters, they can include numbers 0-9 and '_' as the only special character. Spaces are not allowed because space is the delimiter for the input strings. Exceptionally, if the first character is ?, then it means the unknown string value.

Categorical (or i attributes ) and continuous (or f attributes ) must start with numbers 0-9 or with a sign (+ or -). Exceptionally, continuous attributes may start with a point (.). Exponential input form for continuous attributes is not allowed. Spaces may not be included in the attributes. Attributes with the first character equal ? mean unknown value. Categorial attributes must be in the range 0 - 1000 , while continuous attributes must be in the range - 1 000 000 to + 1 000 000. A numerical attribute is automatically recognized as categorical if its all values (besides unknown values) are integers in the range 0-1000. Otherwise it is a continuous attribute. ILLM handles differently categorical and continuous attributes. For categorical attributes literals of the form 'equal' and 'not equal' are constructed besides standard 'less than' and 'greater than' literals. Numerical attribute must be transformed by the user, before data upload, to the integers in the range 0-1000 if literals 'equal' and 'not equal' have sense for the attribute. In case when these literals do not have sense and input values are integers in the range 0-1000, the user should write at least one of input value in the column corresponding to this attribute with the decimal point.

Project description

Detailed description of Level A

(Documentation on this page and related pages describes how we intended to realize the system. Most of the text was written before January 2001. Up to date description of the realized system is at DMS Home .)

File conventions

Basic level A scenario