Forecasting Energy Consumption using Machine Learning

Read original article here

Managing electrical energy consumption is crucial, simply because of one fact: Electricity cannot be stored, unless converted to other forms. It is best for produced electricity to be instantly consumed; otherwise, additional resources and costs are incurred to store convert and store the excess energy. Energy-efficient buildings provide both economic and environmental benefits, maximising profits and social welfare. Conversely, underestimating energy consumption could be fatal, with excess demand overloading the supply line and even causing blackouts, leading to operational downtime. Clearly, there are tangible benefits in closely monitoring the energy consumption of buildings — be they office, commercial or household.

With the advent of machine learning and data science, accurately predicting future energy consumption becomes increasingly possible. This provides two-fold benefits: firstly, managers gain key insights into factors affecting their building’s energy demand, providing opportunities to address them and improve energy efficiency. Not only that, forecasts provide a benchmark to single out anomalously high/low energy consumption and alert managers to faults within the building. A key assumption behind time-series forecasting is that energy consumption follows recurring trends: an office building might have similar daily energy demand patterns across working days. By exploiting these cyclical trends or ‘seasonality’, educated predictions can be made about future energy consumption on a multitude of scales, from 1 hour ahead to 1 day ahead.

However, the difficulty lies in the nonlinearity and volatility of real-time energy usage, which is highly susceptible to changes in external factors. For instance, ambient temperature is known significantly influence a building’s energy demand via heating and air-conditioning [1]. Furthermore, there can be unexpected surges and drops in energy consumption due to equipment failure, supply failure, or simply random fluctuations.

Our task was to predict a building’s energy consumption 1 day ahead of time based on 2-year historical energy demand data provided in 15-minute intervals, from July 2014 to May 2016. In addition, we were given temperature data from 4 locations of varying (undisclosed) distances from the building, in the order wx1 (nearest), wx2, wx3 and wx4 (farthest). We used a conventional Artificial Neural Network as it is capable of capturing complex, non-linear relationships between diverse numerical data, and relatively fast to build and train, compared to more sophisticated architectures like Long Short-Term Memory networks (LSTMs).

We used two metrics to evaluate our model: Mean Squared Error (MSE) (noting that reducing MSE translates into reducing Root Mean Squared Error or RMSE) and Lag. Mean Squared Error measures the average of the squares of errors (actual values — predicted values):

As for our model’s loss function, we use Euclidean loss, mathematically analogous to MSE. The strength of MSE is that it punishes the model for larger errors due to its squared nature, reducing our model’s likelihood of making extreme predictions which would be costly or even dangerous. Minimally, our model must achieve lower MSE than persistence, a trivial benchmark forecast where the ” predicted future value 1 day ahead = observed present value”. Persistence is a good benchmark to start with because of the highly periodic nature of energy consumption [2].

As for lag, our goal is a peak lag of 0 between our predictions and actual energy consumption values, where our model, on average, would not be delayed in its predictions and can capture rising/decreasing energy consumption on time.

Our workflow to this problem involved:

As the dataset given is anonymised with minimal context, we first scrutinised it to gain a comprehensive intuition for effective feature engineering. Arguably, this is the most important step of any machine learning project, and we spent close to an entire week (out of ~2 weeks) on this, as firm believers of ‘garbage in, garbage out’. We discovered that both the energy and temperature data contain a non-trivial amount of missing values, necessitating an effective method of filling those values. Further, wx4 has very sparse data (only containing data from 2016), so it is unlikely for us to make use of it.

Firstly, we plotted the energy data in 2015, the year with the most complete data, unlike 2014 and 2016. Mean monthly values were superimposed to offer clearer overview of trends across months.

As seen in the graph, temperature around the building ranges from sub 0 to 30 °C; given that cold months are from December to February and warm months from June to August, the building should be from the Northern Hemisphere with latitude >30°. Interestingly, two local maxima of energy consumption exist, occurring at the two tail ends of temperature: once during the coldest months, and again during the hottest month (July), suggesting that air-conditioning and heating are significant drivers of energy demand. Across the year, we identified 3 different energy-temperature regimes:

Winter, December to February: Frequent and large fluctuations in energy consumption, with relatively large mean energy consumption. Temperature is generally below 10 °C.

Summer: June to August: Frequent but smaller variations in energy consumption compared to winter. Energy consumption steadily increases with temperature. Temperature is generally above 20 °C.

Transition: March to May & September to November: Relatively constant and stable energy consumption pattern, with small fluctuations. Temperature ranges from 10 to 20 °C.

This analysis inspired two dummy variables (values either 1 for True or 0 for False): 1) is_season_winter & 2) is_season_transition to facilitate better learning of the neural network. Note that an is_summer column will have value 1 (True) whenever both is_winter & is_summer columns are 0 (False); thus, we dropped the is_summer column to avoid the Dummy Variable Trap [3] where one variable can be straightforwardly inferred from one or more other variables leading to multicollinearity issues.

Moving on, we plotted the time series of energy consumption over the entire time frame available.

We realised that energy consumption for July-Oct 2014 was anomalously low. There could be a variety of reasons: the building could be newly built and slowly ramping up operations (hence not full load) or undergoing maintenance. While discarding data is normally discouraged, we decided to do so here as such anomalous data would hurt more than help our model which relies on historical data to make predictions. Thus, our energy data begins from 29 October 2014.

We then visualised energy consumption across different days of the week. To do so, we calculated the mean, max and min consumption for each day of the week for the entire year, excluding public holidays first.

We observed that, on the whole, energy consumption was significantly lower during the weekends, implying the building is likely an office building — busy on weekdays, empty on weekends, rather than a shopping mall or a library. To exploit this pattern, we created a dummy variable on whether the day being predicted for was a weekend, called is_weekend.

Next, we plotted the distribution of energy consumption for each month, categorised into weekdays, weekends and public holidays.

We were able to make two observations from the figure above: Firstly, energy consumption on weekdays were clearly higher than on weekends and public holidays in general. Secondly, while there are significant counts of anomalously high energy demand on weekends, the general distribution of energy consumption for weekends and public holidays are very similar. This implies that we should not expect weekdays that are also public holidays to have similar energy consumption patterns to other normal weekdays. This means interpolating a weekday that is also a public holiday with other non-public-holiday weekday values would most likely be overestimating the energy consumption. Instead, a value from the previous, nearest weekend or public holiday would offer a more reliable proxy to fill the data.

Generally, on weekdays, energy consumption picks up sharply at 7 am and drops off sharply after 6 pm, most likely the standard working hours of that building. Note that some of the plots look strangely shaped or have strange axes labels because of missing values, which further illustrates the need to fill these gaps. Zooming into a plot of average energy demand by hour in the training dataset (70/30 split) (to avoid data leakage where our observations only apply to the test data but not the training data) for a higher resolution, we found that on average, there is a noticeable drop in energy consumption around 12 pm, which we attribute to office lunchtime hours.

As such, we decided to introduce the dummy variables: is_lunchtime (when hour = 12 on weekdays that are not public holidays) and is_working hours (Between 7 am and 6 pm on weekdays that are not public holidays), to further assist the neural network in identifying recurring trends.

Moving on, we plotted an autocorrelation plot of energy consumption to identify cyclical patterns backed by statistical analysis rather than ‘eye-balling’.

As the data is given in 15-minute intervals, 24 hours apart corresponds to 96 timesteps, and 12 hours to 48 timesteps etc. Energy consumption for a particular hour each day was most strongly correlated to the same hour of the day before. This relationship weakens as the number of days increases but peaks again at 672 timesteps or 1 week apart, which in fact has stronger correlation than 1 day apart. On the other hand, autocorrelation was the weakest 12 hours apart. This hinted to us that strong predictive features may include T:-576 (6 days ago from current time, but 1 week ago from time being predicted for), T:0 (1 day ago from time being predicted) and T:-96 (2 days ago from time being predicted).

Now, let us delve into the relationship between energy consumption and temperature. We plotted energy consumption against temperature and attempted to fit a polynomial trend, inspired by Valor et al 2001 [1].

While the best fit lines are clearly not ideal, the scatterplots still reveal useful insights. At the tail ends of temperature (too hot or too cold), energy consumption tends to rise, most likely due to increased air conditioning or heating respectively. Moreover, the relationship is unlikely to be purely linear, showing hints of a quadratic one with a ‘most comfortable’ temperature at about 19 °C. To explore this further, we crafted a correlation heatmap using the Python Seaborn library, dividing the data into winter, summer and transition months.

Firstly, we observed that wx3 has noticeably higher absolute correlation value across all periods (0.29 vs 0.24), winter (0.051 vs 0.0079 & 0.016) and transition months (0.17 vs 0.11). In summer, it is slightly lower (0.43 vs 0.46) than wx1 and wx2. This was also confirmed by preliminary investigations with feature importance values in XGBoost, which consistently ranked wx3 at T+96 higher than that of wx1 or wx2. Thus, we focused on creating windowed features for temperature mostly off wx3.

Next, a quadratic relationship seems to slightly outperform the linear one with a higher absolute correlation values in summer (for all 3 temperatures) and winter (for wx3). Therefore, on top of the raw energy values, the squared value of wx3 at T+96 (time being predicted) might be a useful feature to consider.

Lastly, in winter, both raw and squared temperature have very poor correlation with energy. This might be due to greater and more extreme fluctuations in temperature, even as the building’s heating systems may be turned on indefinitely with less ‘sensitivity’ to temperature.

We conducted data pre-processing in Python instead of AutoCaffe, mainly because our team is more proficient in various Python libraries than Smojo, and we also had greater control in creating more specific features like dummy variables. A general outline of the pre-processing pipeline involved aligning temperature data to 15-minute intervals, interpolation and normalisation.

There are 9238 missing energy values from our selected period (2014–10–29 00:00:00 to 2016–05–26 20:15:00), approximately 17% of the entire timeline. We first implemented linear interpolation by replacing missing values with the mean of the values just before and just after the missing data. However, we realised two critical mistakes. First, and most importantly, this leads to data leakage as calculating the mean involves data from future (after the missing timestamp), which is not available in the real-world. Secondly, we observed that the missing values usually occur for the whole period of up to two days, which led to complete blank of 96, 192 or even more timestamps (24h = 96 x 15mins). Therefore, linear interpolation fails to capture the inherent seasonality of energy consumption data due to many factors like temperature and working hours. We then tried simple forward filling, where the missing data is replaced by the data exactly 24 hours before. However, this did not reflect the weekly trend well, because of the difference between weekdays and weekends/public holidays.

Therefore, we implemented a customized filling method, considering the type of day (public holiday/weekend/weekday). If the missing value day is not a public holiday, the missing value would be replaced with the value exactly one week before (given that this is not missing too! Which luckily does not happen in this dataset), as our data analysis had suggested a strong correlation. Otherwise, if the missing value day is a public holiday, it would be replaced with the nearest, previous weekend value as supported by our data analysis. For example, if the energy consumption value on Monday 11:00am is missing and that Monday happened to be a public holiday, it would be replaced with the energy value of Sunday 11:00am. All in all, we ensured that the type of day (weekend/public holiday/weekday) being brought forward is the same as the missing value day and that no NaN values are being forwarded.

We believe that this method more accurately reflects the seasonality and is relatively easy to implement as opposed to more sophisticated methods. In fact, we did consider two improvements:

As for interpolating temperature where we have access to future data in the next 24 hours from weather forecasts, we chose linear interpolation for two reasons. Firstly, unlike energy consumption values, missing temperature values do not occur for a long period of time. Therefore, linear interpolated data could still capture most of the trend. Secondly, as we are given future data, data leakage will not be an issue (as confirmed by Arnold). Interpolation was not done for wx4, due to the sparsity of the dataset.

After interpolating all the missing values, we normalized all the values in the energy and temperature datasets using “MinMax scaling”, which refers to scaling all the values in the data into the range [0,1]. This standardisation for all feature inputs is critical for neural networks to ensure that any differences in feature importance is solely due to the feature itself and not its numerical magnitude. We also took care to take the min/max values from the training data to prevent data leakage from the test set.

Our approach to generating the best possible model involved training the model in AutoCaffe and adding/dropping feature by feature based on the test score and lag achieved, a methodology akin to One-at-a-time sensitivity analysis. However, while AutoCaffe and the compute resources provided allowed for fast training, a limitation was that we could not ‘automate’ the permutation of features and had to do it manually. To minimise time spent tediously permutating features, we relied on our data analysis, domain knowledge, extensive feature engineering & XGboost SHAP feature importance values to cut down the feature combinations to experiment with.

One big assumption made in adopting the one-at-a-time approach is that features are independent of one another with minimal feature interaction. Given that this assumption is not likely to always hold, we also ran certain combination of features together based on our intuition and domain knowledge gained from reading the scientific literature. We were also careful to conduct sufficient repeats and consider the variance in final test losses due to the random Xavier initialisation of weights.

During our literature review, we discovered a creative time data manipulation method: Moon et al 2019 [2] transformed calendar time to 2d continuous format. While calendar data like month, day, hour and minute have periodic properties, they are represented by sequential data which loses some of that periodicity. For example, 0000hrs follows right after 2359hrs, but this periodicity is not well captured by the 1d representation of sequential calendar time at all. To reflect such periodic properties, Moon et al [2] utilised the following equations (EoM = end of month, equivalent to the number of days in that month e.g. February’s EoM for non-leap year is 28):

Using the above equation, we observed that our test loss performance improved by at least 2.5 to 4% when we used the 2d representation time across multiple feature combinations.

Another interesting feature we wanted to explore was the concept of “public holiday /weekend inertia”, proposed by Valor et al 2001 [1]. Their data analysis suggested that energy consumption in office buildings were systematically low on working days after weekends (ie Mondays) or after public holidays, because of inertia caused by economic activity reduction from the non-working day. A feature to exploit this phenomenon could be ‘days since last public holiday’ and ‘days since last weekend’. However, careful analysis of our data suggested that this “inertia” effect was not present for our dataset, and we did not pursue it further.

For preliminary investigations, we prepared a Pandas dataframe containing the raw values of energy, temperature (wx1 to wx3), datetime features (like month, day) and windowed features (like min, max, mean, range, first-order differences, mean of first-order differences, second order differences and so on). Our choice of windows was guided by our understanding of the cyclical pattern of energy consumption:

We considered using windows larger than 1 week, but did not ultimately do so, because it would shorten the already limited data we had, and from the autocorrelation plot, since energy values past 1 week ago are less strongly correlated with energy at T:0, we felt that they may only serve to introduce more noise into our data.

With the windowed features of energy and wx3 temperature, dummy variables and 2d time features, we conducted preliminary experiments on AutoCaffe to eliminate unhelpful features, such as ‘range’, ‘skew’ and ‘kurtosis’. We had hoped that ‘skew’ and ‘kurtosis’ could signal to our model the recent presence of extreme values (e.g. a short, sudden heatwave with higher than normal temperatures against a background of normal temperatures) that might increase its robustness in anticipating unexpected events. Unfortunately, these features did not improve our test loss and lag despite repeated experiments. Regarding wx4, we did try our best to utilise it, such as by having a ‘previous month’s average temperature’ calculated across all 4 sensors, but these did not improve our results.

After about 150 experiments, we generated a refined list of ~130 features.

At this stage, to provide rigorous justification to our feature selection process, we tapped on the Python XGBoost library, a fast and user-friendly implementation of the gradient-boosting decision trees algorithm. We chose decision trees as they are better at handling high-dimensional datasets (>100 columns of features) than deep neural networks, which are more prone to drawing poor decision boundaries due to the curse of dimensionality and unimportant inputs.

We fed the ~130 features into an XGBoost regressor model to predict the difference between T:0 and T+96 energy values (mimicking the ‘difference’ neural network in AutoCaffe). Interestingly, this model with the following rather standard hyperparameters achieved a test MSE of 0.010853 (after factor of 0.5), which already beats persistence of 0.019377 by 44%, although we did not visualise its lag correlation nor scatterplot.

Using the Python SHAP library [4, 5], we could easily visualise the contribution of various features to the XGBoost model’s outputs. SHAP was chosen due to its consistency and accuracy across models, which many feature importance regimes lack [6], including XGBoost’s/scikit-learn’s built-in versions. We ranked the top features out of the ~130 by SHAP values (higher = more important), and focused on permutating these top features during an additional round experimentation on AutoCaffe. These top features are very likely to facilitate better clustering of the data, which should be transferable to neural networks. Of course, we understood that inherent differences exist in the algorithms of gradient-boosted trees vs neural networks; the SHAP feature importance values are not the be-all and end-all, and we did include other features occasionally.

Our best neural network achieved a test loss of 0.00907207 (with 50 repeats) and test lag of 0, beating our persistence of 0.01937720 by 53%.

The 36 features we used for our difference network were (many of them from figure 14)

2. Day of the week (0 = Monday, 6 = Sunday, scaled by MinMax to [0,1])

For most experiments, the number of layers was kept at 3 or 4, screening first-layer perceptron counts of 32, 64, 128 and occasionally 256, with ReLU activation for fast training. We quickly found that perceptron count of 32 to 64 gave the best results, and that tanh activation gave superior test loss, albeit at the cost of slower training. The intuition behind why tanh seems to give us better results than ReLU might be its additional non-linearity, which might be crucial for mapping complex relationships in energy/temperature data. tanh also avoids some issues faced by ReLU like the ‘dying ReLU’ where a neuron with negative activation is unlikely to ever become positive again.

2 layers were insufficient for optimal learning, while 5 or 6 layers quickly overfitted. Control experiments with the SGD optimiser gave dismal results. Trials with square perceptrons, autoencoders/scaling and force/momentum losses did not improve our results.

Below are zoomed in graphs of predictions vs actuals, which again highlight the strength of our model in closely predicting the seasonal patterns in early winter and the spring months (green boxes), and its weakness in failing to predict some extreme values during February to March 2016 (red boxes).

The SHAP feature values are an extremely rich source of information about the multiple relationships that exist in our dataset. In contrast, it is more difficult to delve into the ‘blackbox’ of neural networks to understand how weights and biases at each layer tell a story about different features. Afterall, it would be more beneficial for the energy forecasting community if we can gain insights into how certain features shape the model’s output, rather than blindly hunting for the lowest test loss, where the resultant model may not be transferable to different contexts. Therefore, we made it a point to visualise the SHAP feature importance graphs again on a ‘difference’ XGBoost regressor fitted just on our best 36 features.

For working hours, it is no surprise that a clear separation exists, with a high value (i.e. =1) at the time being predicted for tends to increase energy consumption, while a low value (i.e. =0) decreases it. Still, the presence of a range of predictions hints that is_working_hours is interacting with other features.

More interestingly, the graph suggests that high values of energy consumption 1 day and 15 minutes ago (ET:-1) are more likely to result in decreased energy consumption right now, with the converse being true too (although the magnitude is much lower for the converse. Additionally, for high values of ET:-1, we see hints of feature interaction from the variance in model output ranging from -0.25 to ~0. Lastly, we observe that a high maximum energy consumption over the past week from T+96 (Emax_0to576) tends to slightly increase the model’s output, while low values have minimal effect. As for temperature, it seems to imply that low values of moving weekly average temperature (wx3mean_-0to-672) can both increase and decrease the model’s output, while high values seem to have no effect. The reasons for these are not immediately clear, and follow-up studies can be conducted to examine them further.

In conclusion, we have built a relatively accurate neural network to predict energy consumption for a building 1 day ahead, trained on about a year of historical energy data as well as temperature forecasts up to a day ahead.

The strengths of ANNs include the sheer performance bump it has over other machine learning methods, especially in conventionally difficult problems. However, the trade-off for this performance bump is that large amounts of data are required (the curse of dimensionality) and thus computational costs could get heavy in both time and monetary aspects. The inner workings of how the ANN learns also remains a “black box”, contributing to a significant need for manual evidence-based and statistical feature selection to explain the resulting model, hence time was most heavily invested in data analysis, feature engineering and selection.

It must be noted that the results of this project are a “validation” loss since we have used the test loss values to change our feature combinations and improve our model. Given the limited data we had, we did not have the privilege of a validation set and relied on test loss as a proxy. This may have caused subtle overfitting to the validation set given that we have conducted a few hundred experiments. Thus, it would be ideal to evaluate our model again on an unseen energy dataset for the same building for a more unbiased estimate of its predictive power.

While we hope that our findings are applicable to other contexts, the type of building and its climate should always be considered. The same trends may not apply for a residential or commercial building, or an office building located in an equatorial climate.

Overall, we have truly learnt a tonne from this end-to-end experience, from exercising our object-oriented Python programming skills in building the pre-processing, feature engineering and windowing pipelines, sharpening our data and statistical intuition with extensive visualisations and literature reviews, to understanding the caveats behind different machine learning approaches and making cautious decisions based on algorithms’ results. We brainstormed so many ideas during the competition (including having a model for winter & transition season and a separate model for summer) but had only so much time (and data) to try them all. To end off, we thank ai4impact and NTU CAO for organising such a fun and valuable opportunity!

1. Valor, E., V. Meneu, and V. Caselles, 2001: Daily Air Temperature and Electricity Load in Spain. J. Appl. Meteor., 40, 1413–1421, https://doi.org/10.1175/1520-0450(2001)0402.0.CO;2.

2. Moon, J., Park, S., Rho, S., & Hwang, E. (2019). A comparative analysis of artificial neural network architectures for building energy consumption forecasting. International Journal of Distributed Sensor Networks. https://doi.org/10.1177/1550147719877616

4. Lundberg, S.M., Erion, G., Chen, H. et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2, 56–67 (2020). https://doi.org/10.1038/s42256-019-0138-9

5. Lundberg, S.M., Nair, B., Vavilala, M.S. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat Biomed Eng 2, 749–760 (2018). https://doi.org/10.1038/s41551-018-0304-0

Images Powered by Shutterstock

The Data Daily

Forecasting Energy Consumption using Machine Learning