TITARL Documentation

By Mathieu GUILLAME-BERT

Post-doctoral Fellow at Carnegie Mellon University

TITARL (or TITAR Learner) is a Data Mining algorithm able to extract temporal patterns from Symbolic Time Sequences. To learn about the TITAR algorithm, report to the interactive tutorial. This document shows the basic usage of the TITAR library that can be downloaded on the software page.

A. Introduction

The TITARL binary can be used in three diferent ways.:

With the command line e.g. "Titarl --learn config.xml"
Opening a file with the Titarl binary (Windows) (or equivalently using "Titarl config.xml"). If you do that, the prompt will ask you what you want to do with this file.
Write the command line in a text file and run "Titarl --filearg command.txt" (this option is especially useful for long command lines).

The TITARL binary can be use to perform various operations:

1. Learning: Extract a set of Tita rules from a (temporal) dataset (symbolic time series).
2. Greedy Learning: Extract a set of Tita rules from a (temporal) dataset (symbolic time series). Rules are extracted one by one. When a rule is learned. All the matched target events are removed, and the learning is repeated. The Greedy learning is slower than the regular learning.
3. Evaluate rules on another dataset: Evaluate a set of rule on a different dataset.
4. Filter rules: Filter rules (e.g. minimum confidence, number of condition, etc.). Metrics can be reevaluated on a given dataset.
5. Selection algorithm: Given a (large) set of rule, select a subset of non-redundant interesting rules.
6. Extract rules info: Extract rules statistics into a csv file (e.g. confidence, support).
7. Apply rules: Compute the predictions of a set of rules (several output format are available).
8. Fusion algorithm: Combine a set of rule to create a super forcasting model.
9. Display rules: Generate a HTML representation of a set of rules.
10. Display rules interactively: Create a HTTP server and interactively display the rules (the server also present simple machine learning testing tool).
11. Merge datasets: Merge datasets together.
12. Split dataset: Split a dataset into several parts.
13. Convert dataset: Load and export a dataset from one format to another (see dataset formats).
14. Filter event: Apply operations on temporal datasets.
15. Importance : Banana: Evaluate the "importance" rules based on randomization (the Banana idea).
16. Importance : Mannila: Evaluate the "importance" rules based on randomization (the Mannila idea).

B. Temporal dataset format

Titarl can use three format of dataset: text format (easy to read/write -- but induce large files which are slow to load in memory), the binary format (very fast to load, relatively small), and the meta format (a text file containing paths to other datasets). You can convert datasets from one format to another with the TITARL binary (see Convert dataset).

Text format

A dataset in the "text format" should have a ".evt" extension. Each line represent an event according to the following schema:

[time]\t[symbol]\t[value]\t[source]

where "time" is a floating point number representing a time-stamp, "symbol" is a string representing the name of a the event, "value" is a floating point number representing the value of an event, "source" is an integer that you can use to identify the where this event come from, and "\t" represent a tabulation. "value" and "source" are optional. If not specified, the value is set to 1. If not specified, the source is set to -1.

Here is an example of text dataset:

10 a
12 a
15 b 5
20 a 2
1 azerty 1 5

Meta format

A dataset in the "meta format" is basically a list of paths to other datasets. Several options are available to define how these datasets are loaded and merged together. By default, the time-stamps of the datasets to merge will remain the same. However, several option are available to shift the time-stamps to ensure that datasets time-stamps do not overlap. A dataset in the "meta format" should have a ".sevt" extension.

Each line is a command. The possible commands are:

dataset [path]: Load a dataset indicated by the path. There is not restriction of the format of this dataset (it can also be a meta dataset).
timeshift [num]: A time offset to apply to all the datasets loaded after this line. If num is "Y2000" (without quotes), the timeshift is set to -946702800 (number of seconds between 01/01/1970 and 01/01/2000.
flush: All the following datasets loaded with "data [path]" will be time shifted in order to do not overlap with the already loaded datasets.
offset [number]: A time offset to apply to separated datasets with "flush".
sequence [num] [name]: Number and name of the sequence of the dataset to load. Using this command will put markers for you to see how the merging was done.

Here is an example of meta dataset:

offset 100
sequence 1 fold
dataset d1_1.evt
dataset d1_2.evt
dataset d1_3.evt
flush
sequence 2 fold
dataset d2_1.evt
dataset d2_2.evt
flush
sequence 5 fold
dataset d5_1.evt
dataset d5_2.evt
dataset d5_3.evt
flush

Binary format

A dataset in the "binnary format" should have a ".bin" extension.

1. Learning

In order to use Titarl algorithm, you need to create a xml file that will define all the parameters of the learning. Once this file created, you can start the learning with "Titarl --learn [file path]", or by directly opening the xml file with TITARL.

Here is a simple example (with the explanation bellow ) of such xml file.

<!--
The file that contains the configuration parameters for the learning.
-->

<config>

<option name="saveRules_new" value="example_hard_learning_rules.xml" /> <!-- output file of the rules to extract -->
<option name="threads" value="-1" /> <!-- Number of threads available for main loop (uses -1 for a correct debug printing) -->
<option name="sub_threads" value="6" /> <!-- Number of threads available for sub computation loops () -->
<option name="debugLevel" value="1" />  <!-- Level of verbose of the algorithm -->
<option name="stop_at_every_loop" value="0" />  <!-- Stop the algorithm at every loop and wait for user confirmation -->

<!-- Input dataset files -->

<data path="training.evt" head="" limit="-1" /> <!-- training dataset -->
<data path="validation.evt" head="" limit="-1" type="validate" />  <!-- validation dataset (optionnal -- rules are less likely to be over trained with a validationd dataset) -->

<!-- Body/head rule configuration -->

<outputEvent> <!-- target events -->
	<predicate name="event\.target" />
</outputEvent>

<inputEvent> <!-- input events -->
	<predicate name="event\.\S+" />
</inputEvent>

<inputState> <!-- input states -->
	<predicate name="state\.\S+" />
</inputState>

<inputScalar> <!-- input scalars -->
	<predicate name="scalar\.\S+" />
</inputScalar>

<!-- rule generation parameters -->

<option name="numCaseHead" value="10" />  <!-- Number of cases of the histogram for the head of the rule -->
<option name="maxPastHead" value="-20" />  <!-- Bounds of this histogram -->
<option name="maxFutureHead" value="-1" />  <!-- Bounds of this histogram -->
<!-- Info: maxPastHead=-20 and maxFutureHead=-1 means that we are looking for rules that make predictions from 1 to 20 time units in the future -->

<option name="numCaseCond" value="10" />  <!-- Number of cases of the histogram for the body of the rul -->
<option name="maxPastCond" value="-10" />  <!-- Bounds of the histogram -->
<option name="maxFutureCond" value="0" />  <!-- Bounds of the histogram -->
<!-- Info: maxPastCond=-10 and maxFutureCond=0 means that conditions are looking from 0 to 10 time units in the past -->

<option name="histogram" value="Unif" /> <!-- Histogram bins distribution. Can be Unif,Log,InvLog,CenterLog. Look at http://mathieu.guillame-bert.com/fig/grid.png for examples of histogram bins distribution.-->
<option name="histoLogFactor" value="70" /> <!-- Parameters for histogram Log, InvLog and CenterLog -->
<option name="negation" value="0" /> <!-- Allow negative conditions i.e. "there is not event of type A between t_1 and t_2" -->
<option name="allowTrees" value="1" /> <!-- Allow trees of conditions (by opposition to paths of conditions) -->
<option name="maxConditions" value="5" /> <!-- Maximum number of condition for a rule -->

<!-- rules restrictions -->
<option name="minConfidence" value="0.04" /> <!-- Minimum confidence for a rule -->
<option name="minCoverage" value="0.04" /> <!-- Minimum coverage/support for a rule. TITARL relies on the apriori trick (http://en.wikipedia.org/wiki/Apriori_algorithm) , therefore this parameter is very important. -->
<option name="minNumberOfUse" value="10" /> <!-- Minimum number of use of a rule -->

<!-- research parameters -->
<option name="maxCreatedRules" value="-1" /> <!-- Maximum number of rules to create. If this number is reached, the algorithm stop. Since several rules can be created simultaniously, the final number of rules can be slightly higher than this parameter (-1: no limit) --> 
<option name="minimumInformationGain" value="0.002" /> <!-- Minimum information gain when building conditions -->

<option name="intermediatesmoothing" value="0" /> <!-- Experimental: Smooth rules in the main loop i.e. Generate duplicates of rules with small modification. -->
<option name="finalsmoothing" value="0" /> <!-- Smooth rules after extraction -->
<option name="division" value="1" /> <!--Allow division of rules -->
<option name="division_method" value="graph_coloration" /> <!-- Method to determine the rule division. Can be graph_coloration,matrix_clustering,exhaustive_connex -->
<option name="allowNonConnextCondition" value="0" /> <!-- Allow non connection conditions (more powerful grammar, but increase risk of overtraining -->

<option name="allow_classifier_id3" value="0" /> <!-- Allow to create conditions based on id3 decision tree on scalar events -->
<option name="classifier_id3_maxdeph" value="4" /> 
<option name="classifier_id3_minuse" value="8" />

<option name="allow_classifier_randomforest" value="0" /> <!-- Allow to create conditions based on Random forest on scalar events -->
<option name="allow_state" value="0" />
<option name="allow_time" value="1" />
<option name="allow_scalar" value="0" />

<!-- Algo stopping criteria -->
<option name="maxLoop" value="-1" /> <!-- Maximum number of loop (-1: no limit) -->
<option name="maxTime" value="-1" /> <!-- Maximum number of second of computation (-1: no limit) -->
<option name="maxTimeAfterInit" value="40" /> <!-- Maximum number of second of computation after the end of the iteration (-1: no limit) -->

<!-- Optimization -->

<option name="maxWaitingList" value="100000" /> <!-- Maximum size of the waiting list (Security to avoid overload of memory) -->

<option name="maxEvaluations" value="-1" /> <!-- Number of random sampling to evaluate in order to compute confidence and support (-1 to use all the dataset). Can greatly speed up the learning. If you use it, set it to at least 5000 -->
<option name="maxTests" value="-1" /> <!-- Number of random sampling to evaluate in order to compute entropy gain (-1 to use all the dataset). Can greatly speed up the learning. If you use it, set it to at least 5000. -->

</config>

Once executed, this configuration file will generate a file "rules_1.xml" that contain the extracted rules.

10. Display rules interactively

Create a HTTP server and interactively display the rules (the server also present simple machine learning testing tool). To start the server use the command line "Titarl --display_interactive [rules.xml]", or "Titarl [rules.xml]" and selection the "display interactive option". The server will start listening on the port 2002 or your machine.

Options are:

--database [file]
: Load an alternative dataset and evaluate the rules on it. Loading a dataset is required for some of the operations (e.g. greedy selection).
--port [number]
: Change the port to listen.
--maxRules [number]
: Maximum rules to load.
--maxevaluation [number]
: Maximum number of sampling used to evaluate rules' metrics.

Command line syntax

|isU","[1]<[2]\\1\\2[/2] [5]\\3[/5]>[/1]",$s);
     $s = preg_replace("||isU","[1][/1]",$s);
     $s = preg_replace("|<\?(.*)\?>|isU","[3][/3]",$s);
     $s = preg_replace("|\=\"(.*)\"|isU","[6]=[/6][4]\"\\1\"[/4]",$s);
     
     $s = preg_replace("||isU","[7][/7]",$s);
     
     $s = htmlspecialchars($s);
     $replace = array(1=>'0000FF', 2=>'808000', 3=>'800000', 4=>'FF00FF', 5=>'FF0000', 6=>'0000FF', 7=>'AAAAAA');
     foreach($replace as $k=>$v)
     	{
          $s = preg_replace("|\[".$k."\](.*)\[/".$k."\]|isU","\\1",$s);
     	}
    */	
	$s = htmlentities( $s );
    return $s;
	}

echo process_help(  file_get_contents("document/help.txt") );
?>