This page presents several Symbolic Time Sequences data-sets that can be used with the TITARL algorithm. The format of the data-set is given here. You can use the Event Viewer software to get an interactive pre-view of the data.
- The computer generated DatabaseIf the data is used in a publication, please cite this web site:
Mathieu Guillame-Bert, https://mathieu.guillame-bert.com
This page presents our synthetic Symbolic Time Sequence data-set. This data-set can be used to test and evaluate Data-Mining algorithms for Symbolic Time Sequences. The data-set is divided into 100 parts (called experiments). Each parts contain unique temporal patterns with various complexity, noise level, correlation, etc. A description of the temporal patterns of each part is available in the data-set documentation. Each part is divided into three sub-parts of equal size. This sub-division can be used to perform cross validation and/or training/validating/testing.
Additionally, the data-set contains (for every part) a list of reference predictions. These predictions are (with some error margin) the best predictions that can be expected to do. Predictions are not Symbolic Time Sequences because they need to express temporal accuracy i.e. time ranges instead of time points. For example, suppose a data-set containing two types of events (A and B). Suppose the only temporal pattern is that every time there is an event A at time t, there is an event B between t+10 and t+15 (uniform distribution).
The figure 1 represents graphically this pattern. The figure 2 is an example of such dataset. The figure 3 represents the predictions of events B according to this pattern.
fig1 : simple temporal pattern
In this case, there is no meaning to predict B events with a time range under 5 time units. Any predictions with time range under 5 time units will have lower confidence and lower support than the kind of predictions represented on figure 3. In addition, the confidence and the support of such prediction will be linearly proportional to the 5 time units reference prediction.
A Python script is provided to automatically compare user predictions with references predictions. The comparison's result is presented in a HTML report. The figure 4 presents an example of such report. The report contains confidence, support and temporal precision of the user and reference predictions. This report also contains various analysis of the predictions such as relations between predictions confidence and several measure of the dataset complexity e.g. number of condition of the generative pattern, noise, density of events, etc..
fig4 : example of html report
(clic to enlarge)
The data set is organized in the following way:
In order to obtain significant results, you should train and evaluate your learning algorithm on different subset of the data set. For example, you can train on "part_[X]_01.event" files and evaluate on "part_[X]_02.event" files.
fig5 : screen capture of the script
(click to enlarge)
Events files (*.event) contain line by line events. Lines follow the following schema:
[time] event.[symbol]
The regular expression is :
[+-]?([0-9]*\.[0-9]+|[0-9]+)\s+\S+
Example:
Prediction files (*.pred) contain line by line events. Lines follow the following schema:
[begin time] [end time] event.[symbol] [probability]
The regular expression is :
[-]?([0-9]*\.[0-9]+|[0-9]+)\s+[-]?([0-9]*\.[0-9]+|[0-9]+)\s+event\.\S+\s+?([0-9]*\.[0-9]+|[0-9]+)
Example:
Dataset descriptions files (*.txt) contain lists of complexity measures. Lines follow the following schema:
"[name of the measure]" [value of the measure]
Example:
For each part of dataset is given several pieces of information:
Name : | The name of the part of dataset. |
Number of events to predict : | Number of events of type "a" |
Time range of dataset : | All the events will be temporally located between these boundaries. |
Density of events to predict : | The ratio "Number of events to predict"/"Time range of dataset". |
Number of different symbols : | The different symbols in the dataset. |
Maximum future of prediction : | How far a prediction can be made. |
Temporal precision : | The time accuracy of the reference predictions. |
Maximum condition window distance : | Measure of far a pattern looks in the past to predict the future. |
Maximum condition window size : | Measure of far a pattern looks in the past to predict the future |
Number of conditions : | Number of conditions in the reference pattern. |
Maximum ratio "number of noise events"/"number of events to predict" : | *straight forward* |
Symbols noises ratio : | .Number of occurrence of this event / Number of occurrence of the symbol to predict |
Probability of type {0,1,2} noise : | Probabilities of the different types of noises. |
Type 1 and 2 noise probability parameter : | Parameters. |
Rule random signature : | Random signature for the random generation of events. |
Total number of events : | *straight forward* |
Symbols (+number of occurrence) : | The number of occurrence of each events. |
Patterns : | A graphical representation of the reference pattern. See the example of patterns section for more details. |
Example of patterns :
Here is several example of temporal patterns:
If an event P appends at time t, therefore an event A will appends between t+0.8 and t+2.9 with 62.28% chance.
If an event I appends at time t, and an event P appended between t-15.5 and t-8.3, therefore an event A will appends between t+2.8 and t+4.5 with 62.92% chance.
If an event F appends at time t, and an event K appended at time t' with t' ∈ [ t-7.9,t-2], and an event N appended at time t'' with t'' ∈[t'-14.1,t'-5.6], therefore an event A will appends between t+1.8 and t+4.6 with 63.26% chance.
If an event K appends at time t, and there is no events D appended between t-13.6 and t-4.9, therefore an event A will appends between t+4.9 and t+7.9 with 62.27% chance.
If an event I appends at time t, and an event J appended at time t' with t' ∈ [ t-7.3,t], and there is no events E appended at time t'' with t'' ∈[t'-10.9,t'-5.4], therefore an event A will appends between t and t+1.8 with 65.15% chance.
If an event F appends at time t, and an event K appended at time t' with t' ∈ [ t-13.1,t-3.2], and an event N appended at time t'' with t'' ∈[t-7.5,t'-1.7], therefore an event A will appends between t+7.3 and t+9.4 with 64.29% chance.
This dataset is based on the "Home activities dataset" by Tim van Kasteren, Athanasios Noulas, Gwenn Englebienne and Ben Krose presented in "Accurate Activity Recognition in a Home Setting".
It contains the record of 28 days of sensors data and activity annotations about one person performing activities at home. A flat is equipped with sensors on doors, cupboard, fridge, freezer, etc. Activities of the person are annotated (prepare breakfast, dinner, having a drink, toileting, sleeping, leaving the house, etc.). The dataset is divided into two categories: Sensors states (sensor_fridge, sensor_frontdoor, etc.) and activities states (action_get_drink, action_get_drink, action_prepare_dinner, etc.). In addition, twenty four states are describing the time of the day (it_is_1am, it_is_2am, it_is_3am, etc.). The dataset contains 42 types of states and 2904 occurrences of state changes. The dataset also contains the .xml layout file to be visualised with the Event Viewer software.
fig6 : Small extract of the dataset
(clic to enlarge)
Download:
This dataset is a record of three years of the EURUSD exchange rate sampled approximately every minute (from June 5 2008 to June 5 2011). The dataset contains various indicators (computed from the exchange rate), and discredized events/states on these indicators.
A python script "evaluate_predictions.py" is provided with the dataset. This script takes as input a set of buying/selling order (see below for the file format of order), it evaluates the profits (or lost) of these orders, and it generates an html report. The report contains data such as:
The dataset also contains a short introduction to Forex trading for Automated trading systems. The introduction can also be read here: Read the introduction.
Download: