the Viterbi algorithm is not able to calculate all the possible paths and cannot give a sequence with certainty. To prevent this, we applied a smoothing technique to give a small probability to every transmission and emission even if it never happened through the whole dataset. In this experiment we used the Pseudoemissions and Pseudotransitions parameters in the MATLAB’ s HMM function.
3.3 Data Pre-processing Let be our model and the training data extracted from our datasets is, where each y is the training label and each x in the sensor reading sample. If we consider that occurred at time t = 1, and at time t = n, which is when the last annotation in the dataset, we need to specify the time in between each time step t = 1,2,3 … This is what we call timeslices. As the dataset has a resolution of miliseconds, we could assign a timeslice as small as that value. However, if we opt to do that we would be creating 1000msx60sx60mx24h = 86.4x106 samples per day, which is computationally unmanageable. Timeslices from a second long might be considered but previous works as well as our own experiments suggest that this length is too costly in terms of computation times and don’ t really give a real improvement in terms of model accuracy.
As discussed in previous sections, D1 dataset is divided into three different scenarios named House A, House B and House C; comprising 25, 14 and 19 days of data respectively. We have evaluated the models using different sample lengths. We evaluated the three models for timeslices ranging from 30 seconds per sample to 10 minutes. We evaluated each scenario independently performing a cross validation dividing the data into days, testing one of them while leaving the rest for training( leave one out), and then calculating the average on the results.
In the case of Dataset 2, two different setups were considered: Timeslice Approach( TA) and Chunk Data Approach( CDA).
3.3.1 Timeslice Approach( TA): As with D1, for the second dataset timeslices of 60 seconds were also the length of choice for the data hashing. Therefore, for 56 days with 1440 secods per day, a total of N = 80600 samples were initially generated. However, the datasets we are using for our experiments are not fully labelled. This means that not every xn has yn label associated. This issue can be addressed in two different ways. The first solution would be to create an‘ idle’ activity and consider it as another different class, assigning every empy yn to that label. The other option would be simply removing those samples from the training data.
In a previous experiment using D1, the absence of activity associated with sensor firings was considered as activity‘ idle’, and the preliminary results suggested yielded similarities between including this data or not. We decided to keep this samples in order to maximize the use of the readings in D1. However, for D2 this approach was not advisable based on the fort attempts to include‘ idle’ in this dataset. The amount of unlabelled data for D1 was just of 12 %, 7 %, and 19 % for House A, B and C respectively. Nevertheless, for D2, the amount of information from sensors unassociated to any activity( the frequency of empty yn’ s) accounted for more than the 80 % of the total samples. Due to this, the classifiers trained with D2 data predicted just class‘ idle’ for all the test points due to the massive imbalance of this new class‘ idle’ in addition to the fact that any sensor firing combination could have an‘ idle’ label associated since we don’ t know what activities were actually occurring during those blank timesteps. To solve this, all the information from sensors events that was not related to any label was just removed from the feature array, thus the label‘ idle’ was not considered. Ultimately, the TA approach
Modelling occupant activity patterns for energy saving in buidings 247