Temporal Data Set Readme V1.0
Mathieu Guillame-Bert
PhD Student - INRIA Rhône Alpes

Mail : mathieu.guillame-bert@inrialpes.fr
Web site : http://www-prima.imag.fr/guillame-bert

Citation:

If the data is used in a publication, please cite this web site:

Mathieu Guillame-Bert, http://www-prima.imag.fr/guillame-bert/?page=database

Download:

Download the database ( .zip format -- 70.71 Mo)
Download the database ( .tar.gz format -- 70.32 Mo)

Presentation:

This page presents our computer generated symbolic time sequence dataset for temporal data mining and machine learning algorithms' evaluation. The dataset is divided into 100 parts (experiments). Each parts contain different temporal patterns with various complexity. A description of the temporal patterns of each part is available in the dataset documentation. Several parts containing the same time patterns are given in order to perform cross validation.

Additionally, the dataset contains (for every part) a list of reference perditions of events of type "a". These predictions are (with some error level) the best predictions that can be expected to do. Predictions are different from time serie data because they are expressed with time ranges instead of time points. For example, suppose a dataset containing two types of events (A and B). Suppose the only temporal pattern is that every time there is an event A at time t, there is an event B between t+10 and t+15 (uniform distribution).

The figure 1 represents graphically this pattern. The figure 2 is an example of such dataset. The figure 3 represents the predictions of events B according to this pattern.

 


fig1 : simple temporal pattern

 


fig2 : example of time serie

 


fig3 : example of prediction

 

Uniformly, there is no meaning to predict B events with a time range under 5 time units. Formally, any predictions with time range under 5 time units will have lower confidence and lower support than the kind of predictions represented on figure 3.

A Python script is provided to automatically compare user predictions with references predictions (results of machine learning techniques). The comparison's result is presented in a Html report. The figure 4 presents an example of such report. The report contains confidence, support and temporal precision of the user and reference predictions. This report also contains various analysis of the predictions such as relations between predictions confidence and several measure of the dataset complexity e.g. number of condition of the generative pattern, noise, density of events, etc..

 


fig4 : example of html report
(clic to enlarge)

 

Dataset organization:

The data set is organized in the following way:

In order to obtain significant results, you should train and evaluate your learning algorithm on different subset of the data set. For example, you can train on "part_[X]_01.event" files and evaluate on "part_[X]_02.event" files.


fig5 : screen capture of the script
(click to enlarge)

File format :

Events files (*.event) contain line by line events. Lines follow the following schema:

[time] event.[symbol]

The regular expression is :

[+-]?([0-9]*\.[0-9]+|[0-9]+)\s+\S+

Example:

3867.4 a
3867.4 a
3874.8 b
3874.8 b
3901.4 a
3901.4 a
3908.9 b
3908.9 b

Prediction files (*.pred) contain line by line events. Lines follow the following schema:

[begin time] [end time] event.[symbol] [probability]

The regular expression is :

[-]?([0-9]*\.[0-9]+|[0-9]+)\s+[-]?([0-9]*\.[0-9]+|[0-9]+)\s+event\.\S+\s+?([0-9]*\.[0-9]+|[0-9]+)

Example:

5 10 b 0.5
13 15 b 0.7
25 28 b 1.0
61 67 b 0.8
62 66 b 0.8

Dataset descriptions files (*.txt) contain lists of complexity measures. Lines follow the following schema:

"[name of the measure]" [value of the measure]

Example:

"Density of events to predict" 0.009180
" Maximum future of prediction" 10.000000
" Temporal precision" 3.000000
" Maximum condition window distance" 20.000000
" Maximum condition window size" 10.000000
" Maximum noise ratio" 0.000000
" Number of conditions" 3
" Probability of type 0 noise" 0.500000
" Probability of type 1 noise" 0.000000
" Probability of type 2 noise" 0.500000
" Duration of experiment" 100000
" Number of events to perdict" 918

Dataset information :

For each part of dataset is given several pieces of information:

Name : The name of the part of dataset.
Number of events to predict : Number of events of type "a"
Time range of dataset : All the events will be temporally located between these boundaries.
Density of events to predict : The ratio "Number of events to predict"/"Time range of dataset".
Number of different symboles : The differents symbols in th dataset.
Maximum future of prediction : How far a prediction can be made.
Temporal precision : The time accuracy of the reference predictions.
Maximum condition window distance : Measure of far a pattern looks in the past to predict the future.
Maximum condition window size : Measure of far a pattern looks in the past to predict the future
Number of conditions : Number of conditions in the reference pattern.
Maximum ratio "number of noise events"/"number of events to predict" : *straight forward*
Symboles noises ratio : .Number of occurrence of this event / Number of occurrence of the symbol to predict
Probability of type {0,1,2} noise : Probabilities of the different types of noises.
Type 1 and 2 noise probability parameter : Parameters.
Rule random signature : Random signature for the random generation of events.
Total number of events : *straight forward*
Symbols (+number of occurrence) : The number of occurrence of each events.
Patterns : A graphical representation of the reference pattern. See the example of patterns section for more details.

Example of patterns :

Here is several example of temporal patterns:

If an event P appends at time t, therefore an event A will appends between t+0.8 and t+2.9 with 62.28% chance.

If an event I appends at time t, and an event P appended between t-15.5 and t-8.3, therefore an event A will appends between t+2.8 and t+4.5 with 62.92% chance.

If an event F appends at time t, and an event K appended at time t' with t' ∈ [ t-7.9,t-2], and an event N appended at time t'' with t'' ∈[t'-14.1,t'-5.6], therefore an event A will appends between t+1.8 and t+4.6 with 63.26% chance.

If an event K appends at time t, and there is no events D appended between t-13.6 and t-4.9, therefore an event A will appends between t+4.9 and t+7.9 with 62.27% chance.

If an event I appends at time t, and an event J appended at time t' with t' ∈ [ t-7.3,t], and there is no events E appended at time t'' with t'' ∈[t'-10.9,t'-5.4], therefore an event A will appends between t and t+1.8 with 65.15% chance.

If an event F appends at time t, and an event K appended at time t' with t' ∈ [ t-13.1,t-3.2], and an event N appended at time t'' with t'' ∈[t-7.5,t'-1.7], therefore an event A will appends between t+7.3 and t+9.4 with 64.29% chance.