Fault detection in home systems
April 29, 2020 — 7 Min Read
In the first part we talked about time series, time series forecasting, introduced our problem and applied five common ML methods. Due to the nature of the dataset the methods were not able to solve our problem, which can be classified as a rare event problem. Here we will discuss another approach, having the advantage that we can apply it to devices with or without error events in the past.
Classifying rare events is quite challenging. Recently, Deep Learning has been quite extensively used for classification. However, the small number of positively labeled samples prohibits Deep Learning application. No matter how large the data, the use of Deep Learning gets limited by the amount of positively labeled samples.
Why should we still bother to use Deep Learning?
This is a legitimate question. Should we think of using some other Machine Learning approach?
The answer is subjective. We can always go with a Machine Learning approach. To make it work, we can undersample from negatively labeled data to have a close to a balanced dataset. If we have for example about 0.6% positively labeled data, the undersampling will result in roughly a dataset that is about 1% of the size of the original data. A Machine Learning approach, e.g. SVM or Random Forest, will still work on a dataset of this size. However, it will have limitations in its accuracy. And we will not utilize the information present in the remaining ~99% of the data.
If the data is sufficient, Deep Learning methods are potentially more capable. It also allows flexibility for model improvement by using different architectures. We will, therefore, attempt to use Deep Learning methods.
The idea of the second approach is to use a LSTM and a LSTM Autoencoder networks to build a rare event classifier.
Long Short-Term Memory networks, or LSTMs for short, is a type of Recurrent Neural Network (RNN). RNNs, in general, and LSTM, specifically, are used on sequential or time series data. The architecture is built so that it remembers data and information for a long period of time. This is an important aspect because in time series forecasting, the output of a given time period is dependent on the output of previous times. Thus, we combine the Autoencoder with an LSTM network.
The Autoencoder model was designed for prediction problems where both input and output sequences are given. So-called sequence-to-sequence problems, such as translating text from one language to another. This model can be used for multi-step time series forecasting and is similar to anomaly detection in case of classification problems. As its name suggests, the model consists of two sub-models: the encoder and the decoder.
The encoder is a model responsible for reading and interpreting the input sequence, that is, it learns the underlying features of a process. The output of the encoder is a fixed-length vector that represents the model’s interpretation of the sequence.
The decoder uses the output of the encoder as an input and can recreate the original data from these underlying features.
We make a binary label of the data: positively labeled samples (error event) and negatively labeled samples (no error event). The negatively labeled data is treated as the normal state of the process. A normal state is when the process is eventless. We ignore the positively labeled data, and train a network on only negatively labeled data. The main difference here, is that obejctive is to reconstruct a given input, rather than predicting a target variable;
|1. Approach (supervised learning problem)||2. Approach (anominal detection)|
|given input X -> predict output y||given input X -> reconstruct input X|
Evaluating a network is done by computing the absolute values of the difference between the true- and prediction values, which we refer to as the reconstruction errors. As mentioned earlier, if the reconstruction error is high, we will classify it as an error event, i.e. mark it as a positive label.
Now the question remains, how high needs the reconstruction error to be; how do we choose the threshold?
There is of course no right answer.
We choose to use the median of the reconstruction errors as a threshold and make a positive label if the reconstruction error is bigger than the median, otherwise, we make a negative label.
The median is considered better than the average if the data set is known to have some extreme values (high or low). In such cases, the median gives a more realistic estimate of the central value. The average is considered a better estimate of the central value if the data is symmetrically distributed. Since we assume that the reconstruction error will be high in case of an ‘unusual event’, that is, an error event, the reconstruction errors will have extreme values, hence the median is more appropriate.
Let's examine the LSTM Autoencoder we used first
lstm_autoencoder = Sequential() # Encoder lstm_autoencoder.add(LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)) lstm_autoencoder.add(LSTM(16, activation='relu', return_sequences=False)) lstm_autoencoder.add(RepeatVector(timesteps)) # Decoder lstm_autoencoder.add(LSTM(16, activation='relu', return_sequences=True)) lstm_autoencoder.add(LSTM(32, activation='relu', return_sequences=True)) lstm_autoencoder.add(TimeDistributed(Dense(n_features)))
The network takes a 2D array as input. One layer of LSTM has as many cells as the timesteps or lags. Setting the return sequences=True makes each cell per timestep emit a signal.
Our input data has 5 timesteps and 29 features.
Layer 1, LSTM(32), reads the input data and outputs 32 features with 5 timesteps for each because return sequences=True.
Layer 2, LSTM(16), takes the timesteps x 32 input from Layer 1 and reduces the feature size to 16. Since returnsequences=False_, it outputs a feature vector of size 1x16. The output of this layer is the encoded feature vector of the input data.
Layer 3, RepeatVector(timesteps), replicates the feature vector timesteps times. The RepeatVector layer acts as a bridge between the encoder and decoder modules. It prepares the 2D array input for the first LSTM layer in Decoder. The Decoder layer is designed to unfold the encoding. Therefore, the Decoder layers are stacked in the reverse order of the Encoder.
Layer 4, LSTM (16), and Layer 5, LSTM (32), are the mirror images of Layer 2 and Layer 1, respectively.
Layer 6, TimeDistributed(Dense(nfeatures)), is added in the end to get the output. The TimeDistributed layer creates a vector of length equal to the number of features outputted from the previous layer. In this network, Layer 5 outputs 32 features. Therefore, the TimeDistributed layer creates a 32 long vector and duplicates it nfeatures times. The output of Layer 5 is a timestep x 32 array, let us denote it by U, and that of TimeDistributed in Layer 6 is 32 x nfeatures array, denoted by V. A matrix multiplication between U and V yields a timesteps x nfeatures output.
The objective of fitting the network is to make this output close to the input. Note that this network itself ensured that the input and output dimensions match.
Next we compare the LSTM Autoencoder with a regular LSTM Network:
lstm = Sequential() lstm.add(LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)) lstm.add(LSTM(16, activation='relu', return_sequences=True)) lstm.add(LSTM(16, activation='relu', return_sequences=True)) lstm.add(LSTM(32, activation='relu', return_sequences=True)) lstm.add(TimeDistributed(Dense(n_features)))
Differences between Regular LSTM network and LSTM Autoencoder:
We use returnsequences=True_ in all the LSTM layers. That means, each layer is outputting a 2D array containing each timestep. Thus, there is no one-dimensional encoded feature vector as the output of any intermediate layer. Therefore, encoding a sample into a feature vector is not happening. The absence of this encoding vector differentiates the regular LSTM network for reconstruction from an LSTM Autoencoder. However, the number of parameters is the same in both, the Autoencoder and the Regular network. This is because the extra RepeatVector layer in the Autoencoder does not have any additional parameters. Most importantly, the reconstruction accuracies of both Networks are similar.
Since, we can also build a regular LSTM network to reconstruct a time-series data, will that improve the results?
The hypothesis behind this is, due to the absence of an encoding layer the accuracy of reconstruction can be better in some cases (because the time-dimension is not reduced). Unless the encoded vector is required for any other analysis, trying a regular LSTM network is worth a try for a rare-event classification.
We applied the LSTM and LSTM Autoencoder model on 2.302 devices where the error event occurred. In the months of November and December there are in total 89.613 error events occurring. A LSTM network predicted 69.280, i.e. 77.31%, while a LSTM Autoencoder network predicted 64.086, i.e. 71.51%.
Remark: The 89.613 error events account for only 2.7% of all observations in the test dataset of two months.
Here are some ideas which might be further exploited to perhaps optimize the above result:
- units for the LSTM layers in the network
- epochs and batchsize_ numbers while fitting the network
- n lag numbers to train the network
- Apply regularization methods for the LSTM layers
- Explore other possibilities for the choice of a threshold
Exploiting for example the above ideas needs to be done on devices where the error event occurs, making it possible to see how well the model performs. Once we have a model with a satisfactory prediction result, we should be able to apply it on devices where the error event does not occur.
This is because the LSTM and LSTM Autoencoder network are trained only on negative classes, i.e. the error events are removed before training the model. Thus, the model applied on devices only with negative labels, should be able to predict an error event as soon as ‘something unusual happens’.