# Fault detection in home systems

## part I

March 16, 2020 — 7 Min Read

## Time Series Forecasting

Time series refers to an ordered series of data, where the sequence of observations is sequentially in the time dimension. Time series forecasting is about making predictions of what comes next in the series. Thus, the forecasting involves training the model on historical data and using them to predict future observations. Time series forecasting can be considered as a *supervised learning problem*, because the training data contains the desired solution, also called labels. Before we continue, let’s take a moment to better understand the form of time series and supervised learning data.

A normal machine learning dataset is a collection of observations. For example:

```
observation # 1
observation # 2
observation # 3
```

A time series dataset is different. Time series adds an explicit order dependence between observations: a time dimension. This additional dimension is both a constraint and a structure that provides a source of additional information.

*A time series is a sequence of observations taken sequentially in time.*

For example:

```
Time # 1, observation
Time # 2, observation
Time # 3, observation
```

Time series forecasting is an important area of artificial intelligence, because there are so many prediction problems that involve a time component. Since we can consider them as a supervised learning problem, we can apply the entire arsenal of ML methods — *Regression, Neural Networks, Support Vector Machines, Random Forests, XGBoost*, etc…. But at the same time, time series forecasting problems have several unique quirks and peculiarities that set them apart from typical approaches to supervised learning problems, which requires rethinking the known approaches to build and evaluate models. Here are the most important differences:

- Every time we want to generate a new prediction for the time series, we need to retrain a model.
- Due to the temporal dependencies in time series data, we cannot rely on usual validation techniques. To avoid biased evaluations we must ensure that training datasets contain observations that occurred before the ones in test datasets.
- Most of the problems have multiple input variables and we need to perform the same type of prediction for multiple devices.

In this article we will consider a real-world problem and apply known ML methods, as well as deep learning networks.

## Problem Description

Our dataset contains sensor data. In total there are 97.730 devices and measurements are made periodically, from the start date until the end date, which differs for each device. Thus, the data consists of multiple time series. The aim is to predict a target variable that is part of the measurements, in particular, we want to predict if the predicted value will be lower than or equal to a given **threshold**. The event when the predicted value goes over the threshold we will refer to as the *error* event for short.

The two biggest issues with the dataset are:

**Size**

For each device we have periodic measurements. Just the minimum requirement we make of 4 months, yields 11.520 measurements for one device.**Extreme Rare Event Problem**

The error event happens very rarely, meaning that it occurs only at 14.240 devices which accounts for 14.57% of all devices. Not only that, but even for devices where the error occurs, it will, for most devices, happen extremely rarely compared to the total amount of measurements.

We assume that the time series are independent and identically distributed random processes. Hence we consider each time series, that is, the measurements for one device, independently of the others and train models only on devices where the error event occurs. This ensures that a given model will see an error event in the training set and has to predict one in the test set.

The dataset was being dumped to storage buckets partitioned by time, in particular, year, month and day. The first thing we did was to re-partition the data by deviceId in the storage bucket. This enables us to load each time series, prepare it for a model, run the model and collect the result.

We encode the categorical features of the dataset and normalize the input variables. We train on all measurements from the start date until the end of October and use the months of November and December for testing. To ensure that we have enough data to train on, we require that there is a minimum of four months' measurements.

As mentioned before, time series forecasting can be classified as a supervised learning problem which can be further classified into:

*Regression:*A regression problem is when the output variable is a real value, such as “dollars” or “weight.”*Classification:*A classification problem is when the output variable is a category, such as “red” and “blue” or “disease” and “no disease.

In our case we have a classification problem; we want to predict if the error event occurs or not. We can add a binary feature classifying the target variable; 1 in case of an error event and 0 otherwise. We will refer to the class marked as 1 as the positive labeled class, and the class marked as 0 as the negative labeled class. The aim is to predict the positive labeled class.

Unfortunately, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples, if any, than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extreme rare event problem, we have less than 3% positively labeled data.

We follow two approaches to attempt to solve the problem. First, we consider the problem as a regression problem and the aim is to predict the actual values of the target variable. After predicting, we label both the actual and predicted values corresponding to positive and negative labels. We then can count to see how many times we predicted an error event **correctly**. The number of correct predictions is described or referred to as the *true positive* value.

The second approach is anomaly detection, where a model learns the pattern of a ‘normal process’, that is, where the error event does not occur. Anything that does not follow this pattern is classified as an **anomaly**.

## 1. Approach - Supervised Regression Problem

Before we can apply a model, machine- or deep learning, our time series forecasting problem must be re-framed as a supervised learning problem. From a sequence to pairs of input and output sequences. A supervised learning problem consists of input patterns (X) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns. For example:

```
X y
1 2
2 3
3 4
```

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem by using previous time steps as input variables and use for example the next time step (or the next 2 steps, ect… ) as the output variable. For example, if we have the time series **[1, 2, 3, 4, 5, 6]** and the goal is to predict the series 2 steps ahead, using the last two measurements, also referred to as *lags* or *lookbacks*, we obtain the following representation:

```
t-2 t-1 t+2
1 2 4
2 3 5
3 4 6
```

In our case we have multiple different features, this type of forecasting problems are referred to as *Multivariate Forecasting*; use measurements of different features and predict one or more of them. For example, we may have two features of time series measurements var1 and var2 and we wish to forecast var1 two steps ahead, using the last one lookback:

```
var1(t-1) var2(t-1) var1(t+2)
0.0 50.0 3
1.0 51.0 4
2.0 52.0 5
3.0 53.0 6
4.0 54.0 7
5.0 55.0 8
6.0 56.0 9
7.0 57.0 10
8.0 58.0 11
```

To obtain a supervised representation for the time series, we need to choose several lags to use as input as well as a forecast time. We use the last 5 measurements for each feature to predict the predicted value one week ahead. We note that there is no right answer for choosing the number of lags or lookbacks, instead, it is a good idea to test different numbers and see what works.

As mentioned before, we use devices where the error event occurs. To ensure that a model can train on measurements with error events and to be able to see if a model can predict error events, both the train and test datasets need to have error events occurring.

We trained three linear models *Linear-*, *Huber-* and *SGD Regression* and two ensemble models *Gradient Boosting* and *AdaBoost* on 230 devices where the error event occurred a total of 4.890 times and obtained the following result:

Model | True Positive |
---|---|

Linear Regression | 76 |

Huber Regression | 161 |

SGD Regression | 9 |

Gradient Boosting | 42 |

AdaBoost | 0 |

The best model Huber Regression predicts only 3% of the error event; an **unsatisfactory performance**. Of course, 230 devices does not reflect how the models will behave for the remaining devices and just because the true positive values are not satisfactory, does not mean that the models make bad predictions. For example, the model Huber Regression obtained an overall mean absolute error of 0.25. Running the model on 75 more devices where the error event never occurs yields a mean absolute error of 0.19. This might suggest that linear models perform better when the target variable is more ‘stable’, that is, with less fluctuation which is the case for devices where the error event occurs.

Unfortunately, true positive values are the values we care about and need to optimize. Moreover, since we can only use this approach on devices where the error event occurs, but most devices are eventless, **we need another approach which will be the topic of our next blog post**.