Mathieu Guillame-Bert
Switzerland
e-mail:

# TITARL Documentation

By Mathieu GUILLAME-BERT
Post-doctoral Fellow at Carnegie Mellon University
TITARL (or TITAR Learner) is a Data Mining algorithm able to extract temporal patterns from Symbolic Time Sequences. To learn about the TITAR algorithm, report to the interactive tutorial. This document shows the basic usage of the TITAR library that can be downloaded on the software page.

## A. Introduction

The TITARL binary can be used in three diferent ways.:
• With the command line e.g. "Titarl --learn config.xml"
• Opening a file with the Titarl binary (Windows) (or equivalently using "Titarl config.xml"). If you do that, the prompt will ask you what you want to do with this file.
• Write the command line in a text file and run "Titarl --filearg command.txt" (this option is especially useful for long command lines).
The TITARL binary can be use to perform various operations:
1. Learning
Extract a set of Tita rules from a (temporal) dataset (symbolic time series).
2. Greedy Learning
Extract a set of Tita rules from a (temporal) dataset (symbolic time series). Rules are extracted one by one. When a rule is learned. All the matched target events are removed, and the learning is repeated. The Greedy learning is slower than the regular learning.
3. Evaluate rules on another dataset
Evaluate a set of rule on a different dataset.
4. Filter rules
Filter rules (e.g. minimum confidence, number of condition, etc.). Metrics can be reevaluated on a given dataset.
5. Selection algorithm
Given a (large) set of rule, select a subset of non-redundant interesting rules.
6. Extract rules info
Extract rules statistics into a csv file (e.g. confidence, support).
7. Apply rules
Compute the predictions of a set of rules (several output format are available).
8. Fusion algorithm
Combine a set of rule to create a super forcasting model.
9. Display rules
Generate a HTML representation of a set of rules.
10. Display rules interactively
Create a HTTP server and interactively display the rules (the server also present simple machine learning testing tool).
11. Merge datasets
Merge datasets together.
12. Split dataset
Split a dataset into several parts.
13. Convert dataset
Load and export a dataset from one format to another (see dataset formats).
14. Filter event
Apply operations on temporal datasets.
15. Importance : Banana
Evaluate the "importance" rules based on randomization (the Banana idea).
16. Importance : Mannila
Evaluate the "importance" rules based on randomization (the Mannila idea).

## B. Temporal dataset format

Titarl can use three format of dataset: text format (easy to read/write -- but induce large files which are slow to load in memory), the binary format (very fast to load, relatively small), and the meta format (a text file containing paths to other datasets). You can convert datasets from one format to another with the TITARL binary (see Convert dataset).

### Text format

A dataset in the "text format" should have a ".evt" extension. Each line represent an event according to the following schema:

[time]\t[symbol]\t[value]\t[source]

where "time" is a floating point number representing a time-stamp, "symbol" is a string representing the name of a the event, "value" is a floating point number representing the value of an event, "source" is an integer that you can use to identify the where this event come from, and "\t" represent a tabulation. "value" and "source" are optional. If not specified, the value is set to 1. If not specified, the source is set to -1.

Here is an example of text dataset:

10 a
12 a
15 b 5
20 a 2
1 azerty 1 5

### Meta format

A dataset in the "meta format" is basically a list of paths to other datasets. Several options are available to define how these datasets are loaded and merged together. By default, the time-stamps of the datasets to merge will remain the same. However, several option are available to shift the time-stamps to ensure that datasets time-stamps do not overlap. A dataset in the "meta format" should have a ".sevt" extension.

Each line is a command. The possible commands are:

dataset [path]
Load a dataset indicated by the path. There is not restriction of the format of this dataset (it can also be a meta dataset).
timeshift [num]
A time offset to apply to all the datasets loaded after this line. If num is "Y2000" (without quotes), the timeshift is set to -946702800 (number of seconds between 01/01/1970 and 01/01/2000.
flush
All the following datasets loaded with "data [path]" will be time shifted in order to do not overlap with the already loaded datasets.
offset [number]
A time offset to apply to separated datasets with "flush".
sequence [num] [name]
Number and name of the sequence of the dataset to load. Using this command will put markers for you to see how the merging was done.

Here is an example of meta dataset:

offset 100
sequence 1 fold
dataset d1_1.evt
dataset d1_2.evt
dataset d1_3.evt
flush
sequence 2 fold
dataset d2_1.evt
dataset d2_2.evt
flush
sequence 5 fold
dataset d5_1.evt
dataset d5_2.evt
dataset d5_3.evt
flush

### Binary format

A dataset in the "binnary format" should have a ".bin" extension.

## 1. Learning

In order to use Titarl algorithm, you need to create a xml file that will define all the parameters of the learning. Once this file created, you can start the learning with "Titarl --learn [file path]", or by directly opening the xml file with TITARL.

Here is a simple example (with the explanation bellow ) of such xml file.

<!--
The file that contains the configuration parameters for the learning.
-->

<config>

<option name="saveRules_new" value="example_hard_learning_rules.xml" /> <!-- output file of the rules to extract -->
<option name="threads" value="-1" /> <!-- Number of threads available for main loop (uses -1 for a correct debug printing) -->
<option name="sub_threads" value="6" /> <!-- Number of threads available for sub computation loops () -->
<option name="debugLevel" value="1" /> <!-- Level of verbose of the algorithm -->
<option name="stop_at_every_loop" value="0" /> <!-- Stop the algorithm at every loop and wait for user confirmation -->

<!-- Input dataset files -->

<data path="training.evt" head="" limit="-1" /> <!-- training dataset -->
<data path="validation.evt" head="" limit="-1" type="validate" /> <!-- validation dataset (optionnal -- rules are less likely to be over trained with a validationd dataset) -->

<outputEvent> <!-- target events -->
<predicate name="event\.target" />
</outputEvent>

<inputEvent> <!-- input events -->
<predicate name="event\.\S+" />
</inputEvent>

<inputState> <!-- input states -->
<predicate name="state\.\S+" />
</inputState>

<inputScalar> <!-- input scalars -->
<predicate name="scalar\.\S+" />
</inputScalar>

<!-- rule generation parameters -->

<option name="numCaseHead" value="10" /> <!-- Number of cases of the histogram for the head of the rule -->
<option name="maxPastHead" value="-20" /> <!-- Bounds of this histogram -->
<option name="maxFutureHead" value="-1" /> <!-- Bounds of this histogram -->
<!-- Info: maxPastHead=-20 and maxFutureHead=-1 means that we are looking for rules that make predictions from 1 to 20 time units in the future -->

<option name="numCaseCond" value="10" /> <!-- Number of cases of the histogram for the body of the rul -->
<option name="maxPastCond" value="-10" /> <!-- Bounds of the histogram -->
<option name="maxFutureCond" value="0" /> <!-- Bounds of the histogram -->
<!-- Info: maxPastCond=-10 and maxFutureCond=0 means that conditions are looking from 0 to 10 time units in the past -->

<option name="histogram" value="Unif" /> <!-- Histogram bins distribution. Can be Unif,Log,InvLog,CenterLog. Look at http://mathieu.guillame-bert.com/fig/grid.png for examples of histogram bins distribution.-->
<option name="histoLogFactor" value="70" /> <!-- Parameters for histogram Log, InvLog and CenterLog -->
<option name="negation" value="0" /> <!-- Allow negative conditions i.e. "there is not event of type A between t_1 and t_2" -->
<option name="allowTrees" value="1" /> <!-- Allow trees of conditions (by opposition to paths of conditions) -->
<option name="maxConditions" value="5" /> <!-- Maximum number of condition for a rule -->

<!-- rules restrictions -->
<option name="minConfidence" value="0.04" /> <!-- Minimum confidence for a rule -->
<option name="minCoverage" value="0.04" /> <!-- Minimum coverage/support for a rule. TITARL relies on the apriori trick (http://en.wikipedia.org/wiki/Apriori_algorithm) , therefore this parameter is very important. -->
<option name="minNumberOfUse" value="10" /> <!-- Minimum number of use of a rule -->

<!-- research parameters -->
<option name="maxCreatedRules" value="-1" /> <!-- Maximum number of rules to create. If this number is reached, the algorithm stop. Since several rules can be created simultaniously, the final number of rules can be slightly higher than this parameter (-1: no limit) -->
<option name="minimumInformationGain" value="0.002" /> <!-- Minimum information gain when building conditions -->

<option name="intermediatesmoothing" value="0" /> <!-- Experimental: Smooth rules in the main loop i.e. Generate duplicates of rules with small modification. -->
<option name="finalsmoothing" value="0" /> <!-- Smooth rules after extraction -->
<option name="division" value="1" /> <!--Allow division of rules -->
<option name="division_method" value="graph_coloration" /> <!-- Method to determine the rule division. Can be graph_coloration,matrix_clustering,exhaustive_connex -->
<option name="allowNonConnextCondition" value="0" /> <!-- Allow non connection conditions (more powerful grammar, but increase risk of overtraining -->

<option name="allow_classifier_id3" value="0" /> <!-- Allow to create conditions based on id3 decision tree on scalar events -->
<option name="classifier_id3_maxdeph" value="4" />
<option name="classifier_id3_minuse" value="8" />

<option name="allow_classifier_randomforest" value="0" /> <!-- Allow to create conditions based on Random forest on scalar events -->
<option name="allow_state" value="0" />
<option name="allow_time" value="1" />
<option name="allow_scalar" value="0" />

<!-- Algo stopping criteria -->
<option name="maxLoop" value="-1" /> <!-- Maximum number of loop (-1: no limit) -->
<option name="maxTime" value="-1" /> <!-- Maximum number of second of computation (-1: no limit) -->
<option name="maxTimeAfterInit" value="40" /> <!-- Maximum number of second of computation after the end of the iteration (-1: no limit) -->

<!-- Optimization -->

<option name="maxWaitingList" value="100000" /> <!-- Maximum size of the waiting list (Security to avoid overload of memory) -->

<option name="maxEvaluations" value="-1" /> <!-- Number of random sampling to evaluate in order to compute confidence and support (-1 to use all the dataset). Can greatly speed up the learning. If you use it, set it to at least 5000 -->
<option name="maxTests" value="-1" /> <!-- Number of random sampling to evaluate in order to compute entropy gain (-1 to use all the dataset). Can greatly speed up the learning. If you use it, set it to at least 5000. -->

</config>

Once executed, this configuration file will generate a file "rules_1.xml" that contain the extracted rules.

## 10. Display rules interactively

Create a HTTP server and interactively display the rules (the server also present simple machine learning testing tool). To start the server use the command line "Titarl --display_interactive [rules.xml]", or "Titarl [rules.xml]" and selection the "display interactive option". The server will start listening on the port 2002 or your machine.

Options are:

--database [file]
Load an alternative dataset and evaluate the rules on it. Loading a dataset is required for some of the operations (e.g. greedy selection).
--port [number]
Change the port to listen.
--maxRules [number]
--maxevaluation [number]
Maximum number of sampling used to evaluate rules' metrics.

## Command line syntax

```Usage 1 : TITARL [file]
A prompt appears and ask you what to do with this file.

Usage 2 : TITARL [--operation] [file] {--[options] [value]}*
General command line interface.

Global options:

--bequiet 1	Don't print text in the console.

--finalpause 1	Ask to press enter at the end of the execution of the program.

Operations:

--filearg [file.txt]	Load the command line options from a file (one argument by line -- usefull if you have a lot of parameters).

--learn [learning.xml]	Learn rules as specified in the .xml file.
Syntax of the xml file:

All the configuration is contained in one <config></config> anchors.

The input training dataset is specified in the <data> anchor.
"head" is optional and is usefull to specify header over all the signal in the dataset.

The input events, input states and output events (of the rules to learn) are defined respectivly by <inputEvent> <inputState> <inputScalar> and <outputEvent> anchors.
Example: To learn all the events with their name begening by "alert"
<outputEvent>
</outputEvent>
Note: the "predicate" anchor takes a regular expression as parameter.

All this options of the algorithm are specified with the <option> anchor.
Example: <option name="parameter name" value="parameter value" />.

Look at the configuration example for more details.

--compactdatabase [database.evt]	Compact the database to database.bin with a binary (efficient) format.

--extractinfo [rules.xml]	Extract the information about rules and store it in the csv file.
--output [file.csv]	Output csv file.
--complexitystats [r_stats.csv] Use the "computeDifficulyOfData" result to compute Envelope distances.
--envelopeType {V1|V2|V3}	Type of Envelope for the.

--learnwithgreedyselect [learning.xml]	Iteratively learn some rules (as specified in the .xml file), select the best one, remove event predicted by this rule.
--maxgreedyselectloop [number]	Maximum number of rule to select (for each type of head symbol) i.e. maximum number of loops.
--globalSupport [number]	Select the smallest subset of rules with at least the given global support (heuristic).
--maxevaluation [number]	Number of sampling when evaluating rules.
--database_validation [event.evt]	Database for validation (requiered).
--database_test [event.evt]	Database for final testing.

--apply [rules.xml]	Apply rules and generate predictions.
--maxRules [number]	Number of rules to load.
--predFormat [interv|histo|center|envelope]	Predictions' format.
--output [output.pred]	Output file for predictions.
(if the option is not specified, the output in on the standard output).
--limit [X]	Maximum number of events to load.
--separateoutput	Separate the predictions of each rule in different files.

--display [rules.xml]	Display rules in a html file.
--output [output.html]	Output file.
--maxRules [number]	Max number of rule to load.
--maxRulesPrint [number]	Max number of rule to print.
(if less than maxRules, the best rules are selected).
--database [database.event]	Database to load and evaluate the rules on.
(if not specified, the metrics of the rules are taken in the input file).
--limit [number]	Maximum number of events to load.
--rename [rename.txt]	File of renaming.

--display_interactive [rules.xml]	Create a http server and interactively display the rules, databases, test plotting and machine learning libraries
--database [database.event]	Database to load and evaluate the rules on.
(if not specified, the metrics of the rules are taken in the input file).
--maxevaluation [number]	Number of sampling when evaluating rules.
--maxRules [N]	Maximum number of rules to load (default:-1)
--port [port]	Port to start the server (default:2002)

--filter [rules.xml]	filter rules.
--output [output.xml]	Output rules file.
--maxRules [number]	Max number of rule to save.
--database [database.event]	Database to evaluate the rules on.
--minConfidence [number]	Minimum confidence.
--minSupport [number]	Minimum support.
--maxCondition [number]	Maximum number of condition.
--minNumberOfUse [number]	Minimum number of use of the rules.
--maxevaluation [number]	Number of sampling when evaluating rules.
--id [ids]	Id of rule.
--idfile [file.txt]	Text file containing ids of rules (one id by line)
--limit [X]	Maximum number of events to load.
--globalSupport [number]	Select the smallest subset of rules with at least the given global support (heuristic).
--nogreedyselection 1	Do not sort the rules during the global support smallest subset of rules selection.
--evalDestination [BASE|REAL|OLD|TEST]	Where to store the rule evaluation. If "BASE", the old value of "BASE" is saved in "OLD".

--filterevents 	filter events (apply complex operator on events)
--database [database.evt]	Database of input events. (you can always specify the dataset in the filter definition script)
--output [output.evt]	Textual event output file.
--filterdefinition [filter.txt]	Filter definition script.
--limit [X]	Maximum number of events to load from the dataset.

Syntax for filter definition:
Start line with # for comments.
For each line : [command] [options 1] [options 2] {optionnal-option-key:optionnal-option-value} ...
or : \$[variable] = [command] [options 1] [options 2] {optionnal-option-key:optionnal-option-value} ...
or : \$[variable] += [command] [options 1] [options 2] {optionnal-option-key:optionnal-option-value} ...
Variables:
%ALL_DATASET% or %ALL% : All the symbols
%[name]% : Command line options. Example %tmp% is equal to 1 if there is "--tmp 1" on the command line.
\$[name] : User variable.
#[reg] : Regular expression on the signal names.
Commands:
printnames : Print all signal names.
printallvars : Print all variables.
saveall : Save all signals in the --output file.
clear [variable] : Remove signal data.
save [variable] : Save signals to --output file or any other speficied file.
sma [variable] [periode] : Simple moving average.
tma [variable] [periode] : Triangular moving average.
ema [variable] [periode] : Exponential moving average.
triggercsv signals:_ path:_ trigger:_ period:_ tmax:_ tmin:_ : Sample signal and save it to one csv file.
triggercsv2 signals:_ path:_ trigger:_ {savetime:_} {separator:_} : Sample signal and save it to several csv files.
plot_crosscomparison x:_ y:_ z:_ path:_ trigger:_ {analyse} : Scatter plot of the signals (+ analysis).
crosscorrelation [variables 1] [variables 2] : Cross-correlation (or auto-correlation).
skip [variable] [periode] {key:value}* : Downsample the signal. Optional keys : type:{last|mean|sum|count}.
shift [variable] : Shift signal.
calendar {days} {hours} : Generate calendar events (days, hours, months, etc.) from unix time.
tick [interval] : Generate tick event at constant frequency
active [variable] [max interval] {before} {after} : Detect activity of signal (if the signal is there or not).
sub [variable1] [variable2] : Substract two signals.
layer [variable1] {key:value}* : Layer analysis. Optional keys : min, max, num, step.
sd [variable] [periode] : Standard deviation.
cusum {k:0} {th:7} w:_ : Cumulative Sum.
samplinginterval [variables] : Estimate sampling interval
timesincelast [variables] [max time] : Time since last update
timeuntilnext [variables] [max time] : Time until the next update
amoc : Compute AMOC curve and AMOC analysis
Options
sequence:[symbol]	<-- symbol to cut the dataset into segments.
signal:[signal]		<-- signal to use as input for the AMOC
onlyonebysequence:true or onlyonebysequence:false <-- only consider the first alert of each segment.
timedirection:before <-- perdiction (instead of detection)
maxtime:[value] <-- size of the windows prior to an alert such that an input in this windows will be considerd as a true positive
output:[file prefix]	<-- where to save the results
lockout:600 <-- minimum time bewteen false alerts
false-positive-ratio:true <-- show false positive ratio instread of false positive
the false positive ratio is the number of false positive divided by the number of real alerts.
ymax:10		<-- maximum value of the Y axis
signal-default-direction:down or signal-default-direction:up		<-- select the direction of the signal (instead of both)
signal-direction-details:\S*random\S*=up <-- we only consider the "cross the threshold up" for the random signals
toplot:mean or toplot:median <-- what to plot
numsensitivity:[number]	<-- number of points in the AMOC (default:400 - reduce this number to speed up the AMOC compuration)
log:true <-- log scale for the Y axis
legend:false <-- disable legend
trace:true <-- output some information about the location of the false/missed/true alerts.
Output
amoc_result.svg <-- the AMOCs and Temporal ROC
amoc_result_comparisons.txt <-- comparison of the AMOCs for the different signals
amoc_result_config.txt <-- configuration file of teh AMOCs (reminder)
amoc_result_details.txt <-- details of the AMOC
*.csv <-- csv with the values for the 3 plots (tab separated)
randomalert num:_ {guid:_} {guidtimeout:_} : Generate a random signal
derivative [variables] : Compute derivatives
trueforatleast [variables] [interval] : Compute when the signal is true for at least N time units.
echo ... : Directly ouput text
filter : Filter variables with regular expression
shouldbeemptyvar [var] : Crash if a variable is not empty
populate [variables] [interval] : Populate a signal.
copy [sources] [destinations] : Copy of a signal.
invsts [variables] : Apply 1-x to a signal
disjunctionsts [variables] : Apply max( x , y ) to two signal:
removeif [variables] [after|before|lower|higher] [value] : Remove value with given condition.
multevtsts [events] [symbols] : Multiply a event and a state
self_removezeros : Remove zero values.
self_removedoubles [variables] : Remove consecutive similar values.
threshold [variables] [value] : Apply a threshold.
windowfeatures : Compute basic windows feature (mean, median, sd, range, range90)
pause : Pause the script.
exit : Stop the evaluation.

--merge 	Merge events
-d [database.event]	database of input events.
-n [name]	write the loaded tracks and clear the buffers.
-s [X]	current sequence number.
--timelimit [X]	maximum time of the ev-ents to load.
--minoffset [X]	minimum offset between begening of tracks.
--offset [X]	offset between tracks.
--output [output.event]	output rules file.
--savebinary {0,1}	save output as binary.
--savetime [X]	save the time in the variable "time" every X time units.

--evalrulestogether [rules.xml]	Eval a set of rules all together as a predictive system [EXPERIMENTAL].
--database [database.event]	database to evaluate the rules on.
--output [output.csv]	output report file.
--falsePositive_fusionWindowSize [time range]	window to merge false positive.
--thruth_fusionWindowSize [time range]	window to merge true positive.

--evalrulesbysource [rules.xml]	Eval a set of rules source by source.
--output [output.csv]	output csv file.
--database [database.event]	database to evaluate the rules on.
--limit [X]	maximum number of rules to load.

[Only available in full package]
--computeDifficulyOfData [learning.xml]	Compute the envelope distance stats (TODO:ref. of paper)
--output [base name]	Output base name.
--numberOfSampling [number] Number of randomized learning (default:3)
--numberOfRealSampling [number]	Number of real learning (default:1)
--timeOut [number]	Time distance around "timeOutSymbol"
--timeOutSymbol [symbol]	Symbol to use to define possible random locations
--pictureFormat {svg|tga}	Output debug plot format (default:svg)
--database_test	[database file] Optionnal dataset to evaluate the rules.

[Only available in full package]
--estimateMeaningfulness [learning.xml] Compte the Manilla (and Manilla inspired) meaningfulness measures. (TODO:ref. of paper)
--output [meaningfulness.csv]	Output file
--numberOfSampling [number]	Number of sampling (e.g. 1000)
--timeOutSymbol [symbol]	Symbol to use to define possible random locations
--timeOut [number]	Time distance around "timeOutSymbol"
--database [dataset.evt/.bin/.sevt/.csv]	Database to use

--splitdataset [dataset .evt/.bin/.sevt/.csv]	Split a dataset
--splits [number of split]
--constraint [SAME_DURATION|SAME_NUMBER_OF_EVENT|SPLIT_SAME_NUMBER_OF_SEGMENT]	Splitting constraint.
--constraint_symbol [symbol]	Splitting symbol (for SAME_NUMBER_OF_EVENT and SPLIT_SAME_NUMBER_OF_SEGMENT)

[Only available in full package]
--computeFusionStats [learning.xml]	Merge the rules into super forcasting model. (TODO:ref. of paper)
--database [dataset.evt/.bin/.sevt/.csv]	Database to use
--output [fusion output]	Output fusion data(without extension).
--request_symbols [symbol]	Symbol to predict.
--request_horizon [number]	Forecast horizon for the prediction.
--request_length [number]	Forecast length.
--uniformGrid {0|1}	Use a uniform grid or not.
--binary {0|1}	Export the fusion data in a binary format or not (the binary format is faster to read and produce smaller files).

[Only available in full package]
--applyFusionStats [learning.xml]	Apply a super forcasting model. (TODO:ref. of paper)
--database [dataset.evt/.bin/.sevt/.csv]	Database to use
--output [output]	Output (without extension).
--model [model]	Type of machine learning model to use. Model can be (options with default parameters are given)
max_depth=2
Random Forest
maxDepth=16
minSampleCount=5
maxNumberOfTree=200
numberOfSelectedVariables=0
SVM
KNN
k=5
Decision tree
ANN
Normal Bayes
Random Forest (Auton)
maxDepth=10
minSampleCount=5
maxNumberOfTree=30
numberOfSelectedVariables=-1
Convex Bayesian Network
smoothM=0
Stupid Rule Learner
maxCondition=2
Random Forest Attr (Auton)
maxDepth=10
minSampleCount=5
maxNumberOfTree=30
Distribution_Eq=EQ_1|EQ_X|EQ_X2|EQ_1_1pX(default)|EQ_1_1pX2
Attribute_Metric=DP_COUNT(default)|WEIGHTED_COUNT|RAW_COUNT|CORRECT
Output_Measure=Mean_Entropy|Dotproduct_Sum|Inbounds_score|Consistency|Anomalous|Expected_Accuracy|Confidence(default)
Ripr
--ml:[option] [value]	Options for the machine learning model
--fusionRecord [fusion]	Input fusion data(without extension).
--threshold [number](,[number])*
--uniformGrid {0|1}

--fun 	Some fun

--test [test name] 	Test a given part of the framework.
[test name] can be: