UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING Marin Matijaš ELECTRIC LOAD FORECASTING USING MULTIVARIATE META-LEARNING DOCTORAL THESIS Zagreb, 2013 SVEUČILIŠTE U ZAGREBU FAKULTET ELEKTROTEHNIKE I RAČUNARSTVA Marin Matijaš PREDVIĐANJE POTROŠNJE ELEKTRIČNE ENERGIJE MULTIVARIJATNIM METAUČENJEM DOKTORSKI RAD Zagreb, 2013. UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING Marin Matijaš ELECTRIC LOAD FORECASTING USING MULTIVARIATE META-LEARNING DOCTORAL THESIS Supervisor: Professor Slavko Krajcar, PhD Zagreb, 2013 SVEUČILIŠTE U ZAGREBU FAKULTET ELEKTROTEHNIKE I RAČUNARSTVA Marin Matijaš PREDVIĐANJE POTROŠNJE ELEKTRIČNE ENERGIJE MULTIVARIJATNIM METAUČENJEM DOKTORSKI RAD Mentor: Prof.dr.sc. Slavko Krajcar Zagreb, 2013. i Doctoral thesis submitted to the Department of Power Systems, Faculty of Electrical Engineering and Computing of the University of Zagreb. Supervisor: Professor Slavko Krajcar, PhD Doctoral thesis has 143 pages. Doctoral thesis nr. ________ . ii About supervisor Slavko Krajcar was born in Krajcar brijeg 1951. He received BSc., M.Sc. and Ph.D. degree from the University of Zagreb, Faculty of Electrical Engineering and Computing (FER), Croatia. From January 1974 he is working at the Power System Department at FER. He became associate professor in 1997 and full professor in 2002. His research of interest includes large scale energy systems with narrow field of development of methodology for modelling complex electricity market structures and development of optimization methodology for planning and development of distribution networks. The results of his work are more than one hundred scientific, technical papers and studies, most published in the proceedings of international conferences, in international journals or as chapters in books. He was project leader for numerous R&D projects funded by Ministry, International Agencies and Business sector. He was the dean of the Faculty in two mandates and member of the University Senate for four mandates. He was the president or member of different Governing or Steering Committees (HEP, CARNet, SRCE, Digitron, ECS, IRB and other) and member of National Competitiveness Council. Professor Krajcar is a member of IEEE (senior member), CIRED, CIGRE and HDR (Croatian Lighting Society). He is a full member of Croatian Academy of Engineering. He participated in several conference international programs committees and is member of several journal editorial boards and he serves as a technical reviewer for various international journals. He received “Josip Lončar“ award – Golden Plaque for distinctive success in education, research and development of Faculty in 2002. O mentoru Slavko Krajcar, rođen je 1951. godine u Krajcar brijegu, općina Žminj. Godine 1969. je upisao, a 1973. diplomirao na tadašnjem Elektrotehničkom fakultetu Sveučilišta u Zagrebu (danas Fakultet elektrotehnike i računarstva), doktorirao 1988 godine, 1990. izabran je u znanstveno nastavno zvanje docenta, 1997. godine u zvanje izvanrednog profesora, a 2002. godine u zvanje redovitog profesora, a 2007. u redovitog profesora u trajnom zvanju. Slavko Krajcar radi na Zavodu za visoki napon i energetiku FER-a od siječnja 1974. Šire mu je znanstveno područje energetika, a uže optimizacijske metode u planiranju i razvoju razdjelnih mreža te modeliranje odnosa na tržištu električne energije. Vodio je mnoge domaće i inozemne znanstvene projekte te na desetine projekata suradnje s gospodarstvom. Objavio je više od 100 znanstvenih radova publiciranih u časopisima, zbornicima radova i poglavljima knjiga. Bio je dekan u dva mandata (1998. do 2002.). Bio je predsjednik ili član mnogih nadzornih i upravnih odbora poduzeća i ustanova (Hrvatska elektroprivreda, Digitron, ECS, SRCE, CARNet, Ruđer Bošković i drugi.). Bio je član Nacionalnog vijeća za konkurentnost. Prof. Krajcar član je stručnih udruga IEEE, CIGRE, CIRED, HDR i Akademije tehničkih znanosti Hrvatske (HATZ). Sudjeluje u više programskih odbora znanstvenih konferencija, član je uredničkih odbora tuzemnih i inozemnih znanstvenih časopisa te sudjeluje kao recenzent u većem broju inozemnih časopisa. Godine 2002. primio je zlatnu plaketu "Josip Lončar" FER-a za nastavni, znanstveni i svekoliki razvoj Fakulteta. iii Mojoj Teni iv Acknowledgments I am very grateful to… my families, for their love, support and contribution to this thesis. my supervisor Slavko, for his unbounded patience and encouragement. my host Johan, for his encouragement, wisdom and confidence in me. Carlos, for being awesome, funny and a well of knowledge. Tomislav, for his help and many inspiring ideas. Siamak, Xiaolin, Sarah, Philippe, Marin, Kris, Milan, Amin, Bojana, Matko, Mauro, Boško, Diego, Rocco, Ms. Mirjana, Dries, Jurica, Griet, Ivan, Vedran, Laure, Raf, Marco, and all other great people at ESAT, KU Leuven and FER, University of Zagreb. Roko, Vanja, Darko, Vlado and my other co-workers, my conference buddies, Rapid-I team, PowerTAC team, Tao, Ben, and all others who were help and support on this journey. Thank you! v Abstract Nowadays, the modern economies depend on electricity. Its production and consumption (load) have to be in equilibrium at all times since storing electricity, in a substantial quantity, results in high costs. Load forecasting is a task of predicting future electricity consumption or power load. The main goal, of load forecasting, is to reduce the load forecasting error. A reduction of the load forecasting error leads to lower costs and higher welfare. Besides keeping the power systems in balance and for power system control, load forecasting is important for power system planning; power market operation; power market design; and security of supply. Traditionally, load forecasting is solved, at a level of a particular task, with periodic aims of creating a more general framework for doing so. A general framework, for load forecasting, would be beneficial to most market participants in dealing with their heterogeneous load forecasting tasks. The deregulation of power systems created new markets and the advent of the Internet and new technology provided the opportunity for new functionalities and services which had more added values to the end customers. Mainly at the retail level, where profit margins are not high and, a small participant faces, in everyday operations, the problem of modelling taking into account a lot of uncertainty and different types of forecasting tasks. The general framework, for load forecasting, would be beneficial in modelling heterogeneous load forecasting tasks within a single forecasting system and would help the non-experts, in algorithm design, to choose the most suitable algorithm for their load forecasting task. In this doctoral thesis, I propose the use of meta-learning as an approach to electric load forecasting; this includes the framework for electric load forecasting model selection. I choose meta-learning because it is rooted in the idea of learning to solve better the problems of different characteristics. It is well-suited for load forecasting because it learns and evolves over time, just as the power systems live and evolve over time. So far, it has not been applied to either power systems engineering, or has it been applied to multivariate time-series forecasting. The proposed approach is based on the new method of using meta-learning for forecasting multivariate time-series. It is modular and it learns on two levels: on the meta-level; and the load forecasting task level. The proposed solution uses an vi ensemble of seven algorithms for classification at the meta-level and, seven other algorithms for forecasting at the load forecasting task level. On 3 load forecasting tasks, made from publicly available data, I show that, for a recurrent real-life simulation of one year of forecasting, a meta-learning system, built on 65 load forecasting tasks, returned a lowest load forecasting error when compared to 10 wellknown algorithms. On the fourth task in load forecasting, similar results were obtained. However, these were for a shorter period of forecasting and by using an ex-ante approach. I propose, also, the approach for definition and selection of the meta-features for the time-series; this resulted in new meta-features applicable to the meta-learning of load forecasting. The approach resulted in the introduction of new meta-features of traversity; fickleness; granularity; and highest ACF. The proposed meta-learning framework is intended to be used in practice; therefore, it is parallelized; componentbased; and easily extendable. Keywords: Artificial Neural Networks; Demand Forecasting; Electric Load: Electricity Consumption; Retail Electricity; Estimation; Gaussian Processes; General Load Forecaster; Least Squares Support Vector Machines; Meta-Learning; Meta-features; Power; Prediction; Short-Term Load Forecasting; STLF; Support Vector Regression; SVM; Time-Series vii Sažetak Predviđanje potrošnje električne energije multivarijatnim metaučenjem Moderna gospodarstva danas ovise o električnoj energiji. Njena proizvodnja i potrošnja moraju u svakom trenutku biti u ravnoteži jer je trošak za pohranu značajne količine električne energije velik. Kako bi se troškovi smanjili, koristi se predviđanje potrošnje električne energije. Stogodišnji problem, rješava se na brojne načine čitavo vrijeme. Osnovni zadatak predviđanja potrošnje je smanjiti pogrešku predviđanja. Smanjenje pogreške predviđanja dovodi do nižih troškova i višeg probitka. Osim za održavanje elektroenergetskih sustava u ravnoteži i upravljanje elektroenergetskim sustavima, predviđanje potrošnje je važno za: planiranje elektroenergetskih sustava, aktivnosti na tržištu električne energije, dizajn tržišta električne energije i sigurnost opskrbe. Kako bi se moglo dobiti brzu ocjenu procjene financijskog troška opskrbljivača, jer je trošak uzrokovan pogreškom predviđanja važan opskrbljivačima, predlažem metodu zasnovanu na uprosječavanju. Predloženu metodu primjenjujem na dva primjera simulacije troška predviđanja za čitavo tržište Finske odnosno Francuske, s relativnom pogreškom simulacije u odnosu na izračunati trošak uravnoteženja od -8,3 odnosno -3,1 posto. Tradicionalno se predviđanje potrošnje rješava na razini pojedinog zadatka s povremenim stremljenjem da se izradi općenitiji okvir. Općeniti okvir za predviđanje potrošnje bio bi koristan većini tržišnih sudionika za njihove heterogene zadatke predviđanja potrošnje. Bio bi posebno važan opskrbljivačima. Deregulacija elektroenergetskih sustava stvorila je nova tržišta, dok je pojava Interneta i novih tehnologija omogućila priliku za pružanje funkcionalnosti i usluga s više dodane vrijednosti krajnjim kupcima. Uglavnom u opskrbi, gdje profitne marže nisu visoke, mali sudionik suočava se u svakodnevnom poslovanju s problemom modeliranja mnogo nesigurnosti u različitim tipovima zadataka predviđanja potrošnje. Općeniti okvir za predviđanje potrošnje koristio bi modeliranju heterogenih zadataka predviđanja potrošnje u jednom sustavu za predviđanje i pomogao onima koji nisu eksperti u dizajnu algoritama, pri izboru najprikladnijeg algoritma za njihov zadatak predviđanja. viii U ovom doktorskom radu predlažem pristup za predviđanje potrošnje korištenjem meta-učenja koji uključuje okvir za izbor modela predviđanja potrošnje električne energije. Odabrao sam meta-učenje jer je ukorijenjeno u ideji učenja kako bi se bolje riješili problemi različitih karakteristika. Prikladno je za predviđanje potrošnje jer uči i evoluira s vremenom kao što elektroenergetski sustavi žive i evoluiraju s vremenom. Do sada nije primijenjeno u elektroenergetici, niti je primijenjeno za predviđanje multivarijatnih vremenskih serija. Predloženi pristup zasniva se na novoj metodi za predviđanje multivarijatnih vremenskih serija korištenjem meta-učenja. Modularan je i uči na dvije razine, na meta razini i na razini zadatka predviđanja. Predloženo rješenje na meta razini koristi jednako uravnoteženu kombinaciju sljedećih sedam metoda za klasifikaciju: euklidsku udaljenost, CART stabla odlučivanja, LVQ mrežu, višerazinski perceptron, AutoMLP, metodu potpornih vektora i Gaussove procese. Na razini zadatka, predloženo rješenje koristi drugih sedam algoritama za predviđanje, i to: nasumičnu šetnju, ARMA-u, slične dane, višerazinski perceptron, po razinama ponavljajuću neuronsku mrežu, 𝜈 regresiju potpornim vektorima i robusnu regresiju potpornim vektorima zasnovanu na metodi najmanjih kvadrata. Na 3 zadatka napravljena od javno dostupnih podataka pokazujem kako, za ponavljajuću simulaciju jedne godine predviđanja iz stvarnog svijeta, sustav za meta-učenje napravljen na 65 zadataka predviđanja potrošnje, vraća manju pogrešku predviđanja nego 10 poznatih algoritama. Na četvrtom zadatku predviđanja potrošnje dobiveni su slični rezultati, no za kraći period predviđanja i korištenjem ex-ante pristupa. Vrijeme izvođenja predloženog sustava je 29% vremena potrebnog za izvođenje svih kombinacija što opravdava uvođenje meta-učenja. Također predlažem pristup za definiranje i izbor metaosobina vremenskih serija korištenjem ReliefF algoritma za klasifikaciju, koji kao rezultat daje nove metaosobine primjenjive na meta-učenje predviđanja potrošnje. Pristup rezultira uvođenjem: prelaznosti, nemirnosti, granularnosti i najvećeg ACF-a. Postupak za odabir metaosobina korištenjem ReliefF algoritma za klasifikaciju potvrdio je važnost predloženih metaosobina dodjeljujući im visoke težine, dok je CART stablo odlučivanja potvrdilo tako dobiveni rezultat. Predloženi okvir za meta-učenje namijenjen je kako bi se koristio u praksi zbog čega je paraleliziran, zasnovan na komponentama i lako proširiv. ix Ključne riječi: električna energija, estimacija, Gaussovi procesi, kratkoročno predviđanje potrošnje električne energije, meta-učenje, metaosobine, metoda potpornih vektora, metoda potpornih vektora najmanjim kvadratima, neuronske mreže, opskrba, potrošnja, procjena, regresija potpornim vektorima, snaga, vremenske serije x Contents 1 Introduction ........................................................................................ 1 1.1 Thesis structure ........................................................................... 1 1.2 Forecasting .................................................................................. 2 1.3 Electric Load Forecasting............................................................. 4 What is electric load forecasting? ......................................................... 4 1.3.1 Definition ................................................................................... 4 1.3.2 History ....................................................................................... 4 1.3.3 Recent development (State-of-the-art) ...................................... 5 1.3.4 Motivation.................................................................................. 5 1.3.5 Important factors in load forecasting.......................................... 6 1.3.6 Division of Load Forecasting ................................................... 12 1.3.7 General Framework for Load Forecasting ............................... 14 1.3.8 Load Forecasting Cost Estimate ............................................. 14 1.4 Learning ..................................................................................... 17 1.4.1 Motivation................................................................................ 17 1.4.2 Definition ................................................................................. 17 1.4.3 History ..................................................................................... 17 1.5 Meta-Learning ............................................................................ 19 1.5.1 Definition ................................................................................. 19 1.5.2 Motivation................................................................................ 19 1.5.3 History ..................................................................................... 22 1.5.4 Meta-Learning for Time-Series Forecasting (State-of-Art) ....... 23 xi 2 Problem Statement .......................................................................... 25 3 Methodology .................................................................................... 29 3.1 Organization of a Meta-learning System .................................... 30 3.2 Programming Language............................................................. 32 4 Multivariate Meta-Learning: Meta-Learning System Modules ........... 33 4.1 Load data ................................................................................... 33 4.2 Normalization ............................................................................. 34 4.3 Learn Meta-features ................................................................... 36 4.3.1 The Ensemble ......................................................................... 36 4.3.2 Euclidean Distance ................................................................. 37 4.3.3 CART Decision Tree ............................................................... 38 4.3.4 LVQ Network ........................................................................... 39 4.3.5 MLP ........................................................................................ 41 4.3.6 AutoMLP ................................................................................. 42 4.3.7 ε-SVM ..................................................................................... 44 4.3.8 Gaussian Process ................................................................... 46 4.4 Feature selection ....................................................................... 51 4.5 Error Calculation and Ranking ................................................... 52 5 Forecasting Algorithms .................................................................... 56 5.1 Random Walk ............................................................................ 56 5.2 ARMA ........................................................................................ 58 5.2.1 ARMA in Load Forecasting ..................................................... 58 5.2.2 Implemented Algorithm ........................................................... 59 5.3 Similar Days ............................................................................... 67 xii 5.3.1 Application of Similar Days ...................................................... 69 5.4 Neural Network .......................................................................... 70 5.4.1 Definition ................................................................................. 71 5.4.2 History of Neural Network Development.................................. 71 5.4.3 Types of Neural Networks ....................................................... 74 5.4.4 Activation Function .................................................................. 77 5.4.5 Back-propagation .................................................................... 78 5.4.6 Layer Recurrent Neural Network ............................................. 82 5.4.7 Levenberg-Marquardt algorithm .............................................. 83 5.4.8 Multilayer Perceptron .............................................................. 84 5.5 Support Vector Regression ........................................................ 85 5.5.1 Statistical Learning Theory ...................................................... 86 5.5.2 Vapnik-Chervonenkis Dimension ............................................ 87 5.5.3 Structural Risk Minimization .................................................... 88 5.5.4 Support Vectors ...................................................................... 89 5.5.5 Regression in Linear Form ...................................................... 90 5.5.6 Lagrange Multipliers ................................................................ 92 5.5.7 Optimization problem .............................................................. 94 5.5.8 Karush-Kuhn-Tucker Conditions ............................................. 95 5.5.9 Kernels .................................................................................... 96 5.5.10 ε-SVR ................................................................................... 99 5.5.11 ν-SVR ................................................................................. 100 5.6 Least Squares Support Vector Machines ................................. 101 5.6.1 Robust LS-SVM .................................................................... 103 xiii 5.6.2 Cross-validation .................................................................... 104 5.7 Common for all algorithms ....................................................... 105 6 Collecting the Data ......................................................................... 106 7 Meta-features ................................................................................. 109 7.1 Types of Meta-features ............................................................ 110 7.2 Definition of meta-features ....................................................... 111 7.2.1 Minimum ............................................................................... 112 7.2.2 Maximum .............................................................................. 112 7.2.3 Mean ..................................................................................... 113 7.2.4 Standard deviation ................................................................ 113 7.2.5 Skewness.............................................................................. 113 7.2.6 Kurtosis ................................................................................. 113 7.2.7 Length ................................................................................... 113 7.2.8 Granularity ............................................................................ 113 7.2.9 Exogenous ............................................................................ 113 7.2.10 Trend .................................................................................. 114 7.2.11 Periodicity........................................................................... 114 7.2.12 Highest ACF ....................................................................... 114 7.2.13 Traversity ........................................................................... 114 7.2.14 Fickleness .......................................................................... 114 7.3 Selection of meta-features ....................................................... 114 8 Experimental results....................................................................... 118 9 Conclusions ................................................................................... 129 10 References ................................................................................. 132 xiv List of Tables Table 1: Simulation of the Load Forecasting Cost Estimate for Finland and France ............................................................................................ 16 Table 2: Accuracy on the Meta-level ................................................... 125 Table 3: Forecasting Error Comparison .............................................. 126 Table 4: Error Statistics of Task A ....................................................... 128 xv List of Figures Figure 1: The Correlation between Load and Temperature for One of the Load Forecasting Tasks .......................................................................... 8 Figure 2: Algorithm selection seen through the Rice’s Framework ........ 21 Figure 3: Illustration of the Proposed Meta-learning System ................. 30 Figure 4: The Flowchart shows the Organization of the Modules in the Meta-learning System ........................................................................... 31 Figure 5: Activation of a Neuron in the Output Layer selecting RobLSSVM as the Label for a Random Example given to the LVQ Network ................................................................................................. 40 Figure 6: A Snapshot of the AutoMLP Operator where Ensemble Parameters are being set ...................................................................... 43 Figure 7: ε-loss Function is Zero inside the ε-tube, and grows linearly outside of it. It is, named, also, hinge-loss function because it resembles a hinge .................................................................................................. 44 Figure 8: Three Random Functions drawn from the Posterior of a Randomly Selected Distribution. Dashed Grey Line represents a confidence interval of Two Standard Deviations for Each Input Value ... 48 Figure 9: Graphical Overview of the Loop which runs through the Four Ensemble Components ......................................................................... 49 Figure 10: An Example of Three Different Random Walks .................... 57 Figure 11: MAPE for RW using Various Skips from 1 to 168 on the example of Task 38. 𝑝𝑒𝑟 is 24 for this example and what can be seen as low MAPE ............................................................................................. 58 Figure 12: Comparison shows a typical case, 𝐴𝑅𝑀𝐴(𝑟, 𝑟 − 1) has higher accuracy compared to 𝐴𝑅(𝑟) and 𝑀𝐴(𝑟) which indicates it also returns lower error ............................................................................................. 66 Figure 13: Load forecasting using ARMA for first 10 days of Task 38 ... 66 xvi Figure 14: Three Best Candidates (Similar Days: SD1, SD2 and SD3), forecast as a Median SD and the Actual Load for a Random Day of Task 38 .......................................................................................................... 69 Figure 15: Comparison of the Number of Load Forecasting Scientific Papers which mentioned different forecasting approaches ................... 70 Figure 16: Weights are Connections between Neurons ........................ 71 Figure 17: Main Parts of a Human Neuron [110] ................................... 73 Figure 18: Neuron as an Activation Unit ................................................ 73 Figure 19: A Simple Single-layer Feed-forward Neural Network............ 74 Figure 20: Multilayer Feed-forward Network with 8-4-6 Topology.......... 75 Figure 21: Recurrent Neural Network with One Hidden Layer ............... 76 Figure 22: Tan-sig Function Plot ........................................................... 77 Figure 23: Simple LRNN Topology ........................................................ 82 Figure 24: Shattering of Points in a Two-Dimensional Space ................ 87 Figure 25: Nested Function Subsets with Associated Vapnik- Chervonenkis Dimensions..................................................................... 89 Figure 26: Linear Support Vector Machines with ε-loss Function .......... 92 Figure 27: Non-linear SVM after the Introduction of an Appropriate Kernel ................................................................................................... 99 Figure 28: Load Forecasting using LS-SVM and Dependence of MAPE on the Number of CV Folds ................................................................. 104 Figure 29: Hourly Load in Duration of One Year with a Stable and Frequent Periodic and Seasonal Pattern found often in Loads above 500 MW ..................................................................................................... 107 Figure 30: Illustration of the Difference between Types of Meta-features based on the Data and the Models...................................................... 110 Figure 31: The ReliefF Weights show that Highest ACF; Fickleness; and Granularity are the Important Meta-features........................................ 116 Figure 32: MDS of Tasks and the Meta-features ................................. 116 xvii Figure 33: Difference of Load and Temperature is shown and Winter and Summer Periods can be distinguished. Summer Period lasts from 15 April to 15 October .............................................................................. 119 Figure 34: Overview of the Five Layers of the Forecasting Procedure Used in [153] ....................................................................................... 119 Figure 35: The Forecasting Error Increases with the Step Size ........... 121 Figure 36: Sample Autocorrelation Function of a Periodic Toy Set Timeseries used for Testing ........................................................................ 122 Figure 37: Dependence of the Forecasting Error of a Periodic Timeseries in terms of Number of Lags of the Autocorrelation Function used for Forecasting .................................................................................... 122 Figure 38: Decrease of Forecasting Error in terms of MAPE with the Increase of Border............................................................................... 123 Figure 39: MDS of Tasks Forecasting with MASE of all Algorithms to 2 Dimensions ......................................................................................... 124 Figure 40: The CART Decision Tree shows that the proposed Metafeatures are used more than those frequently encountered which might indicate them as good candidates in the application of Meta-learning to Load Forecasting ................................................................................ 125 Figure 41: Typical Difference between Actual and Forecasted Load for Seven Cycles of a Randomly Selected Task using the Meta-learning System ................................................................................................ 127 Figure 42: MASE Difference for Task A between Meta-learning System and Best Solution Averaged per Cycle ................................................ 127 xviii Abbreviations ACF Autocorrelation Function BP Back-propagation CART Classification And Regression Tree CCWA Closed Classification World Assumption CRISP-DM Cross-Industry Standard Process for Data Mining CV Cross-validation ED Euclidean Distance FF Feed-forward FPE Final Prediction Error GI Gini impurity Index GP Gaussian Process KKT Karush-Kuhn-Tucker kNN k Nearest Neighbours LRNN Layer Recurrent Neural Network LS-SVM Least Squares Support Vector Machines LVQ Learning Vector Quantization MAD Median Absolute Deviation MAE Mean Absolute Error MAPE Mean Absolute Percentage Error MASE Mean Absolute Scaled Error MDS Multidimensional Scaling MLP Multilayer Perceptron MSE Mean Square Error NFL No Free Lunch NN Neural Network NRMSE Normalized Root Mean Square Error PACF Partial autocorrelation function RMSE Root Mean Square Error RobLSSVM Robust Least Squares Support Vector Machines RW Random Walk RBF Radial Basis Function xix SOM Self-Organizing Map SVM Support Vector Machines SVR Support Vector Regression VBMS Variable-Bias Management System “The secret of getting ahead is getting started” Mark Twain 1 1 Introduction This doctoral thesis is about solving, an old problem, load forecasting by using metalearning as a new approach, and coming up with a new general framework and new insightful results. These are: fickleness; granularity; highest ACF; traversity; and forecasting performance. It is about trying to step in with a solution to the old problem and to help the new participants in the game of electricity markets. 1.1 Thesis structure This thesis is organized as follows: 1. Introduction • Forecasting; • Load Forecasting; • Learning (motivation; definition; history); and • Meta-Learning (motivation; definition; history). 2. Problem Statement 3. Methodology 4. Multivariate Meta-Learning: Meta-Learning System Modules 5. Forecasting Algorithms 6. Collecting the Data 7. Meta-features 8. Experimental Results 9. Conclusions 10. References. 1 Samuel Langhorne Clemens (30 November 1835 – 21 April 1910), better known by his pen name Mark Twain, was an American author and humourist. -1- “Prediction is very difficult, especially about the future” Niels Bohr 2 1.2 Forecasting It is beneficial to know the future outlook. It brings power to anticipate things; to better adapt to and to evade or change outcomes in one’s favour. Because of the advantages which forecasting gives, people started to use it a long time ago. There is historical evidence that, around 650 B.C., the Babylonians tried to predict short-term weather changes based on the appearance of clouds and optical phenomena such as haloes [1]. The ability to predict things better brings wealth to societies since it increases the chances of successful outcomes. The advantage of good forecasting is most obvious in the long run when a good principle is reused many times to arrive at good future predictions. Having a long history and sparking people’s interest, forecasting was defined in different ways. I give the following three different definitions of forecasting: • “To calculate or estimate something in advance; predict the future” [2]; • “The process of analysing current and historical data to determine future trends.” [3]; • Attempting to accurately determine something in advance. Prognosis, projection, estimation and prediction are synonymous with forecasting. Nowadays, forecasting corresponds frequently to estimation of qualitative or quantitative data by means of calculation. Although applied in many areas, it is mainly human driven due to its complexity. It could be misused because of uncertainty about the future. Sometimes, forecasting is associated with fortunetelling. The difference is in transparency of how things are predicted in advance. While it is possible to repeat the forecasting process by using the same data and modelling to obtain the same or similar results, this is not the case with fortunetelling, where one has to rely on the crystal ball and the fortune-teller’s words. Unlike fortune-tellers, who unravel the future, forecasters aim merely to create predictions as accurately as possible. 2 Niels Henrik David Bohr (7 October 1885 – 18 November 1962) was a Danish physicist, philosopher and footballer. He received the Nobel Prize in Physics in 1922 for his foundational contributions to understanding atomic structure and quantum mechanics. -2- Forecasting advanced into wider usage with the introduction of technology, especially computers, to exploit it efficiently. Weather forecasting is, probably, most famous since it is used for many different purposes. Financial and technical forecasts are widespread too. Forecasting, in these areas, is vital because small errors can save a lot of resources such as avoiding crop losses and increasing agricultural production output due to better local weather forecasts. Many processes such as the operation of power systems; transport networks; and logistics management rely on forecasting. In time-series forecasting, there are similarities between the different types of application of some pre-processing and approaches. Forecasting algorithms can be designed or applied only for a certain type of forecasting. In time-series forecasting, there is a general tendency to create approaches which, with lower forecasting errors, would be able to solve more and more problems. The solution, proposed in this thesis, is a step in that direction because it handles different types of electric load forecasting. Electricity price forecasting is a type of forecasting which is most similar to the load forecasting. There are a lot of similarities between them. The main difference is the existence of price spikes which result from the electricity’s inelasticity as a commodity. The following approach in electricity price forecasting [4], is similar to the approaches of short-term load forecasting which use neural networks. It combines statistical techniques of data pre-processing and multilayer perceptron (MLP) to forecast the price of electricity and, also, to detect price spikes. The detection of spikes is the main difference since it is encountered less often in load forecasting. The next section presents an overview of load forecasting. Subsection 1.3.6 provides a detailed description of the load forecasting types. -3- “It's not the load that breaks you down - it’s the way you carry it” Lou Holtz 3 1.3 Electric Load Forecasting What is electric load forecasting? 1.3.1 Definition Electric load forecasting, or shorter load forecasting, comes under a variety of synonyms: electricity load forecasting; electricity demand forecasting; consumption forecasting; electricity load prediction; load demand; power demand; load demand prediction; and load estimation etc. By definition, electric load forecasting is a “realistic estimate of future demand of power” [5]. 1.3.2 History Load forecasting came to life with the introduction of power systems. Samuel Insull, a pioneer of electric utility industry, was one of the first people to become involved with load forecasting. Insull figured out that certain consumers have patterns of energy consumption, e.g. commercial buildings and households use more electricity in the daytime, while street lighting and industry use it more in the night-time. With a good mix of different customer types, he figured out that more electricity could be produced from the same generating unit and, thereby, maximizing a return on investment [6]. To my knowledge, Reyneau’s 1918 paper [7] is the oldest scientific paper on load forecasting. Throughout its centennial history, load forecasting matured together with the power systems. Early statistical methods and, then, Box-Jenkins methodology [8], in the 1970's paved the way for new time-series approaches. In the 1980's and 1990's artificial neural networks (NNs) gained recognition and their successful applications began. With the development of computer science in the ensuing years, machine learning approaches grew in number. With the introduction of support vector machines (SVM) [9], and SVM taking the first place in the EUNITE competition on load forecasting [10], support vector based approaches and the kernel methods were 3 Louis Leo Holtz (born 6 January 1937) is a retired American football coach; author; and motivational speaker. He is the only college football coach in USA who lead six different programs to bowl games. -4- established as a new branch in the development of load forecasting. A good overview, of the development of load forecasting, is present in the surveys [11–19]. 1.3.3 Recent development (State-of-the-art) Recently, Wang et al. [20] proposed a hybrid two-stage model for short-term load forecasting (STLF). Based on the influence of relative factors to the load forecasting error, their model selected the second stage algorithm for forecasting between linear regression; dynamic programming; and SVM. Espinoza et al. [21] proposed fixedsize least squares support vector machines (LS-SVM) using an autoregressive exogenous - nonlinear autoregressive exogenous structure, and showed that, in the case of STLF using large time-series, it outperformed the linear model and LS-SVM. Hong [22] proposed a new load forecasting model based on seasonal recurrent support vector regression (SVR) which, for optimization, used a chaotic artificial bee colony. It performed better than ARIMA (autoregressive integrated moving average) and trend fixed seasonally adjusted ε-SVR. Recently, Taylor proposed several new univariate exponentially weighted methods of which one, using singular value decomposition, had shown some potential for STLF [23]. 1.3.4 Motivation The power system has a tremendous importance for the modern society. It presents a backbone from which almost all other infrastructures and electrical appliances operate today. It is difficult to imagine modern society without the use of power systems. The most important parts of a power system, are the power plants which produce the electricity; the transmission and distribution infrastructure used to transport it through the electric network, and the end customers who take electricity from that network, and use it for their own needs. As known nowadays, it is impossible to store significant quantities of electricity. Therefore, it is necessary, all the time, to maintain equilibrium between production and consumption. Electricity is spread through a power system by electro-magnetic waves which have the speed of light. To keep the equilibrium between the production and the consumption is a huge and important challenge because it maintains the stability of the power system. If this equilibrium is disrupted, the occurrence of a possible blackout could cause damage and cost billions of euros. End customers and power system operators are more exposed to the technical risk of load forecasting, -5- whilst producers and suppliers are more exposed to the economic risk of load forecasting. From the operational perspective, load forecasting is important in maintaining the afore mentioned equilibrium between production and consumption. It is important, also, for: power system planning; power market operation; power market design; power systems control; and security of supply. It is very important, in power system planning, since infrastructure and ventures are capital intensive and good forecasts can save a substantial part of investments [24]. Nowadays, as a prerequisite for loans, banks demand frequently approved load forecasts in order to limit exposure to their debtors. In power market operation, market participants use load forecasting to manage their costs and strategies. Market participants, who are, also, balancing responsible parties, use load forecasting as a basis for their obligatory daily schedules on which the balancing cost is calculated. Due to low profit margins, in the industry, a low balancing cost is important. A conservative estimate, by Hobbs et al. [25], showed that a reduction in the load forecasting error by 1 % lowered the costs by $1.6 million annually. For the planning and quick estimation of financial costs of load forecasting, Matijaš et al. [26] proposed a mean-based method which approximated the real financial costs of load forecasting. For the market design; security of supply; and power system control, a lower load forecasting error leads to lower operational costs. Nowadays, many market participants have dozens of considerably different load forecasting tasks. As these tasks appeared throughout time, they are solved often with different software; algorithms; and approaches in order to retain the particular knowledge and achieve the lowest possible load forecasting error. Market participants have access to many different electricity markets, at the same time, and to different load forecasting tasks in those electricity markets. It would be beneficial to all and especially to small market participants if they could use the same load forecasting approach to obtain the lowest load forecasting errors for all heterogeneous load forecasting tasks. 1.3.5 Important factors in load forecasting Through a history of load forecasting, the influence of different factors, on the pattern of electricity load was identified, researched and utilized. It was discovered, very early, that the type of consumption influences the shape of the overall load. Since investment had a better time of usage [6], matching the appropriate customers with the production flattened the load curve, made forecasting easier and led to a higher -6- profit margin. With long-term planning and forecasting, the economic factors, introduced in the subsection 1.3.5.2, were important since they determined the development of the electricity network. Social factors were important as well, because the migration of people and social structures influence the nature of the load and its growth rates. Both influencing factors remain vivid in China as can be seen from the load growth factor in North-eastern China in [27] and in the basis for the load modelling in [20], whereby, due to different populations, the growth in the load is different in both size and shape. There are, also, other factors such as outages; riots; special events; and random occurrences. There was some research on which factors relate to the load forecasting error. For example, Matijaš et al. investigated the relationship of the forecasting error, at the country level, and the various factors for European countries [28]. It is an interesting finding which shows that although, amongst European countries, there is a non-linear relationship of load forecasting error in terms of mean absolute percentage error (MAPE) to the average load, in respect of this rule, the Netherlands; Belgium; Germany; and Slovenia, with very developed markets, were outliers. In terms of the influence of a single factor to the load forecasting error, the average load was the only one which, at the significance level of 95% (p<0.05), showed a statistically significant relationship with the mean absolute percentage error. The higher the load, it is easier to forecast it and, therefore, the error is lower. Between the other features, only the installed hydro capacity, at the significance level of 90% (p<0.10), showed a statistically significant relationship with Mean absolute percentage error. The multiple regression showed that the average load and the largest producer’s market share had a statistically significant influence on MAPE (p<0.05). These conclusions were more specific, and, in the following subsections I explain the important factors which influence load forecasting generally. 1.3.5.1 Weather factors Weather factors are the most used type of factors for load forecasting. The following are used frequently: temperature; precipitation; wind speed and wind direction; cloud cover; humidity; and weather conditions (e.g. fog; breeze; haze and tornado) expressed numerically or more often nominally. -7- Temperature is the most important of all of them. Temperature measurements have a long history due to the effect of temperature on human beings. When the temperature reduces enough it becomes cold and people need more energy to maintain the same temperature. When it increases after a certain threshold, people use energy to lower the temperature. Nowadays, most of our activities involve consumption of electricity. Load and temperature are linked to some extent, because, driven by the change in temperature, we change activities and behaviours. A strong correlation, between load and temperature is known for a long time. It is utilized in load forecasting approaches. Figure 1 is an example of the afore mentioned correlation. You can see that load increases when temperature declines below a certain low threshold and load reduces when the temperature increases over a certain high threshold; this is a common behaviour. Figure 1: The Correlation between Load and Temperature for One of the Load Forecasting Tasks Nowadays, almost all the newly proposed short-term load forecasting approaches included temperature, in some form, as the basis for the forecast. -8- Precipitation is important for the outdoor activities and some industry processes rely on it. It is used for the forecasts, up to a horizon of one year, because it influences the load significantly enough. Due to its local character and without information about the spatial load distribution, it can be unreliable for lower levels of load aggregation. It reduces loads relying on the outdoor activity and increases loads involving indoor activity (e.g. fruit picking and fruit processing). Wind speed and wind direction formed a wind vector. They are used to tune forecasting models because a strong wind, over a certain high threshold, reduces the load and a strong wind, below a certain low threshold, increases the load. Cloud cover is important for the short-term load forecasting in respect of commercial buildings. Although cloud cover can be very local, it is used because it increases the usage of lighting and it has a measurable influence on the load. Weather conditions are nominal (sometimes, also, ordinal) features since they are condensed descriptions of the weather. It is possible to quantify weather conditions and use them as load forecasting input because, during more severe conditions such as blizzards; thunderstorms; and heavy rain, load is higher as compared to the load when (with all other things being equal) weather conditions are clear, calm and normal. Humidity is measured often and, therefore, it is a frequently available weather factor. Together with other weather factors such as temperature it can describe some of the load behaviour in some of the climatic conditions. It is used in short-term load forecasting but not often. 1.3.5.2 Economic factors Economic factors are influencing factors which are described in economic terms, mostly in monetary units. The most important are: gross domestic product; oil price; heating fuel price; and standard tariff change. The gross domestic product represents a market value of all a country’s final goods and services produced in a given period. For some time, gross domestic product per capita is used as a measure of the country’s standard of living. It is used as an economic factor for medium-term and longer-term load forecasts. Load is correlated positively with the gross domestic product per capita. -9- Oil price is important, for medium-term load forecasts, due to its influence on the industrial activity. Load has a negative correlation with the price of oil and a lag in months. Heating fuel price is important for many industries’ consumption and residential consumption which relies on heating fuels during the colder part of the year. In this period of time, the load is correlated negatively with the price of heating fuel. A change in standard tariffs has an important influence on the consumers’ habits and directly on the load. There is not much historical data which can be used to analyse and make assumptions for future forecasts. Consequently, it is a difficult task to forecast how exactly, in a certain area, the loads would move in relation to the new change in tariff schemes. 1.3.5.3 Social factors Social factors are different social effects which influence load and can be used in load forecasting to reduce further the load forecasting error. Social factors are: calendar information; holidays; migration of people; social structure; union strikes and riots; and other special days (elections; start and end of school year etc.) Calendar information is the most important of the social factors. Human behaviour is often periodic in terms of days, weeks, months, seasons and years. Most load patterns have some of that periodicity. Even when it is not the case, it is possible to find a good data representation based on different calendar information. Holidays are a special type of calendar information. Holidays are different between locations and, even more so in different locations; the same holidays can have different effects on the load. Although there is some empirical evidence that including holidays does not decrease load forecasting error, they are used in majority of the load forecasting applications. In general, compared to the load during the working days, the load is lower on holidays. Migration of people influences the load since people consume electricity. When people move from one place to another, consumption falls in the place, from which people move, and increases in the place which the people move to. The pattern changes, also, e.g. people, in rural areas, use electricity mainly for heating and cooking and almost all of it in the daytime, whilst people, in urban areas, use a lot of -10- electricity for entertainment and lighting in the night-time; this produces a different load pattern. Social structure is important for long-term load forecasting since different social structures are known to result in different amounts of consumption and with different patterns. Special days, such as union strikes; riots; election days; and the first and last days of the school year, are important social factors in short-term load forecasting. Special days are modelled in real-world applications in order to reduce further the load forecasting error. 1.3.5.4 Other influencing factors Other influencing factors are not modelled frequently. These consist of: duration of a day; outages; special events; and random occurrences. The duration of a day is tied to the weather factors and the calendar information. Since lighting is influenced by the duration of a day; it is especially important in the forecasting of public lighting load. When the night-time is longer, electric load, at its boundaries, is higher due to lighting. Outages or faults of components, in a power system, are important because they can influence the load severely. In some situations, it is possible to quantify the risk of outage and include it in a load forecasting model. Special events are used in both the very short-term load forecasting and the shortterm load forecasting. Special events, a common term in short-term load forecasting, refer usually to important events of a cultural; sporting; political; and other nature. For some power systems, there is already enough information about the load pattern during football matches to use it for load forecasting. Some approaches, in pattern recognition such as anomaly detection, can be used to find and classify automatically previous special events. Random occurrences are patterns whose sources are unknown. These are difficult to model but some approaches try to include them. A small earthquake could be an example of a random occurrence. -11- In [29] the techniques which were developed to detect anomalies in the load timeseries are mentioned. Those techniques can detect new factors, which can then be used in load forecasting, to reduce further the load forecasting error. All these factors are used but, throughout time, it was shown that some are more suited to load forecasting of a certain type and time horizon. Division of load forecasting is very important and the most common division is based on the forecasting horizon. In the following subsection, you can find which types of factors are used as an input for load forecasting and how these factors influence the actual load. 1.3.6 Division of Load Forecasting Load forecasting does not include forecasting of consumption on a very small scale such as forecasting of electricity consumption on a chip inside an electronic device. Due to different nature of balance responsible parties and different forecasting requirements, approaches and models, used for forecasting, differ tremendously from one to another. Based on voltage level, forecasting can be: high voltage or above; middle voltage; low voltage; or appliance level. Forecasts differ, also, based on the forecasting horizon. A common division of load forecasting, used in [28], is based on the following forecasting horizons: • VSTLF: very short-term load forecasting which has a forecasting horizon of under 1 hour; • STLF: short-term load forecasting which has a forecasting horizon of under 30 days; • MTLF: medium-term load forecasting which has a forecasting horizon of under 1 year; and • LTLF: long-term load forecasting which is used for a forecasting horizon of over 1 year. In the following subsections I describe each of the load forecasting types 1.3.6.1 Very Short-Term Load Forecasting – VSTLF Very short-term load forecasting is used for load or generation optimization in a very short-term [30–34]. It is important for load-frequency control and short-term -12- generation scheduling [30]. In VSTLF, granularity of the time-series goes down to the level of seconds. Unlike wind power forecasting where measures of exogenous data are available as an input, VSTLF is characterized often as a univariate time-series task. 1.3.6.2 Short-Term Load Forecasting – STLF Short-term load forecasting is the mostly used type of load forecasting because it is a base for mandatory schedules for many market participants in many power systems. It is done usually on a daily basis for the day ahead with hourly or lower granularity down to 5 minutes. STLF is usually a multivariate load forecasting task which uses, at least, temperature and calendar information to obtain the forecast of load in the future. Many approaches for short-term load forecasting are available, with a lot of different input data and the pre-processing and feature extraction methods of that data. The goal, of each approach, is to show that it has better results than other competing methods. STLF is researched widely with hundreds of papers published each year [35]. 1.3.6.3 Medium-Term Load Forecasting – MTLF Medium-term load forecasting is a task which is the basis for many market participants’ strategies. In some power systems, large end customers are obliged to plan, one year in advance, their load to their power system operators. Sometimes, MTLF is a basis for credits or subsidies. Medium-term load forecasting is used, also, on a higher level for power system control and power system management. In that case, economic factors like the price of heating oil; seasonal weather forecast; and expected change in gross domestic product are used to achieve the very low load forecasting error. 1.3.6.4 Long-Term Load Forecasting – LTLF Long-term load forecasting is the basis for power system planning. It is important for the welfare of societies because investments, in power systems, are long-term and capital intensive. For the long-term load forecasting, important are economic factors such as gross domestic product; then social factors such as expected development of urban areas and migration of the population; and then weather factors such as expected change of weather conditions for a certain area. -13- 1.3.7 General Framework for Load Forecasting There are different load forecasting approaches based on the type of load; available historic data; and the nature of the load. One of the goals, of load forecasting research, is to find a general framework for load forecasting which could be used to find a satisfying solution for every type of load. Although the annual number of scientific papers, on load forecasting, increased from around one hundred, in 1995, to more than a thousand in recent years [35], the majority of proposed approaches are suited to specific, regional data [36]. For load forecasting, Osowski and Siwek [37] proposed a combination of a selforganizing network and the multilayer perceptron, which could be applied universally to any power system. The mean absolute percentage error of 1.84 % for Poland’s power system is higher than a real-life forecasting error of that power system which could be obtained from freely available data at [38]. Marin et al. [39] proposed the Kohonen maps and Elman networks, based load forecasting, and compared it with other types of neural networks on the central Spanish regulation area load. The authors claimed that their proposals’ robustness and adaptability, to other regulation areas, is based on the capability of Kohonen maps to extract non-evident exogenous factors. From time to time, a scientific work is published which does not claim its universality or widely applicable approach, but demonstrates that it performs well on different loads. Approaches in hierarchical forecasting or forecasting model creation based on load clustering such as [40] are an example of it. One of the goals, of the proposed approach, is to contribute in that direction and to bring, to the table, a general framework for load forecasting selection. As noted in the Introduction, there are various reasons to aim for a further reduction in the load forecasting error in many load forecasting tasks. And how do we determine the effect of the reduction in load forecasting? In the next subsection, there is a simple and straightforward way to calculate the financial cost of load forecasting. 1.3.8 Load Forecasting Cost Estimate The financial cost, of load forecasting, is important mainly for suppliers. However, in addition, the growing number of wind power producers, without the feed-in tariff, has to manage them, as well. For a balancing responsible party, the financial cost of load forecasting consists of the following costs: energy data management system; -14- forecasting system; data exchange; and balancing energy. Energy data management system and data exchange are inherent parts of balance responsible parties’ business solutions because business intelligence is imperative for their successful day-to-day operation. To some degree, regulatory requirements, on data exchange, are in place in all markets. Due to those two facts, only a proportion of these systems’ costs are tied to load forecasting costs. Such a proportion is so small that those costs are considered to be a part of the forecasting system cost. Forecasting system cost is a fixed cost to balance responsible parties because every balance responsible party needs a person to model the data and algorithms and hardware to operate them. The cost of average salary expenditures along with fixed running costs is estimated to be €100,000/year. If specialized software for forecasting was chosen, the cost could increase significantly depending on the choice of the forecasting solution [26]. The majority, of the financial cost of load forecasting, is represented by the balancing energy cost. In order to calculate the financial cost, the proposed mean-based calculation, which relies on the balancing cost calculation is as follows: Where: 𝐿𝐹𝐹𝐶 = 𝑀𝐴𝑃𝐸𝑙𝑓 ∙ 𝐸𝑎𝑣 ∙ 𝑝𝑎𝑣 ∙ 𝑇𝑡𝑜𝑡 + 𝐶𝑓𝑠 𝐿𝐹𝐹𝐶 is the calculated cost in monetary units; 𝑀𝐴𝑃𝐸𝑙𝑓 is the mean absolute percentage error of load forecasting; 𝐸𝑎𝑣 is the average load, e.g. [MWh/h]; 𝑝𝑎𝑣 is the average price of the imbalance (balancing energy cost), e.g. [€/MWh]; 𝑇𝑡𝑜𝑡 is the number of time intervals [ ], e.g. hours; and 𝐶𝑓𝑠 is the total cost of a forecasting system e.g. [€]. Due to the correlation between price and energy at different periods, different price 𝑝𝑎𝑣 returns different quality of results. For the average price calculation, the differences between mean price of positive and negative imbalance and imbalance price are used. Data, of the real market participants with their procurement prices, -15- would be more applicable solution since cost of load forecasting would be reflected in losses incurred through load forecasting error. Validity, of the proposed equation, is tested on a balancing cost simulation for balance responsible parties in a role of a supplier in Finland and France for the period 1st January to 31st December 2010. In each case, the whole country’s consumption is used as a balance responsible party. Data, for France, can be obtained from [38]. Balancing prices, for Finland, were taken from [41], and the other data from [38]. Table 1 presents both data sets’ key values shown as average values and their standard deviations. Table 1: Simulation of the Load Forecasting Cost Estimate for Finland and France Country Finland France Av. load [MW] 9,713±1,800 58,142±13,921 MAPE [%] 2.37 1.35 Balancing price [€/MWh] 13.46±17.76 11.82±7.28 Av. period cost [€] 3,487±3,281 9,875±11,898 Balancing energy cost [€] 29,524,245 83,659,135 LFFC [€] 27,073,006 81,034,761 Relative error [%] -8.3 -3.1 The proposed equation presents a quick way to calculate the financial cost of load forecasting. For example, a balance responsible party, with an average load of 1,000 MW; an average forecasting error of 5.0 %; an average price of the imbalance of 10 €/MWh; and forecasting system cost of €100,000/year would have a load forecasting financial cost of €4,480,000. Whilst an average price of the imbalance remains the same, an increase in MAPE of 1 %, would result in additional cost, of €876,000, to that balance responsible party [26]. -16- “Learning is a treasure that will follow its owner everywhere” Chinese Proverb 1.4 Learning 1.4.1 Motivation Learning is an important aspect because, by using some information, it improves the chances of acquiring the certain goal. 1.4.2 Definition Learning is defined as a process of acquiring knowledge. The most common definitions of learning are: • “Learning is acquiring new or modifying existing knowledge, behaviours, skills, values, or preferences and may involve synthesizing different types of information. The ability to learn is possessed by humans, animals and machines. It does not happen all at once, but builds upon itself, and it is shaped by what we already know.” [42]; • “The act or experience of one that learns.” [43] and [44]; • “Knowledge of skill acquired by instruction or study.” [43] and [44]; • “Modification of a behavioural tendency by experience.” [43] and [44]. 1.4.3 History Learning has an important place in human culture. Throughout the history, those, who learned a lot and acquired great knowledge, had an important role in the society. With our technological advancement, the role of learning increased and, nowadays, learning presents a basis for Europe’s economic development. The idea to model learning processes, in machines, was developed in artificial intelligence and in machine learning; these are fields which emerged in the second half of 20th century with the development of computer science. Artificial intelligence was recognised quickly and became popular with the reinvention of neural networks. Nowadays, due to the improvements and new innovations which it enabled, there is extensive research in machine learning, a subfield of artificial intelligence dedicated to learning algorithms. A well-known definition of machine learning is: “a computer program is said to learn from experience E with respect to some class of tasks T and -17- performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”. [45] Through the development of artificial intelligence and machine learning, the application of learning spread everywhere very quickly. Nowadays, artificial intelligence and machine learning techniques are applied frequently and everywhere around us. Sophisticated algorithms arrange data; look for the most probable options; route traffic; and, most of the time, select choices when we use electrical and electronic devices. -18- “Learning how to learn is life's most important skill” Tony Buzan 4 1.5 Meta-Learning 1.5.1 Definition Meta-learning is learning about learning. Aha [46] used the term first. 1.5.2 Motivation No Free Lunch (NFL) theorem for supervised learning, states that, if one is interested in off-training-set error in a noise-free scenario where the loss function is the misclassification rate, then, there are no a priori distinctions between learning algorithms [47]. Consequently, it follows that no algorithm is best for all load forecasting tasks. The NFL theorem is the motivation for meta-learning because it proved that no single algorithm will solve all the problems with the best performance. However, even the meta-learning, itself abides by the No Free Lunch theorem. Similar to applying the NFL theorem to a particular (learning) task, we can apply it to a particular meta-learning system. When, due to the No Free Lunch theorem, a new task, like a new forecasting task, appears, there are two possible approaches [48]: • Closed Classification World Assumption (CCWA): we assume that all tasks, likely to occur in real applications, form some well-defined subset of the universe. As designers, we know that our novel algorithm performs better than others on that set. As practitioners, we select any of the algorithms which perform well on that set; • Open Classification World Assumption (OCWA): we assume that no structure, on the set of tasks, is likely to occur in real applications. As designers, we characterize as precisely as possible, the class of tasks on which our novel algorithm outperforms others. As practitioners, we have some way of determining which algorithm(s) will perform well on our specific task(s). Attempts to characterize "real-life" tasks such as classification are rare; this suggests that it is not a trivial task. In time-series forecasting, similar to algorithm design generally, most practitioners share the following pattern: 4 Anthony "Tony" Peter Buzan (born 2 June 1942) is an English author and educational consultant who started mind mapping and coined the term mental literacy. -19- • We propose new algorithms which overcome known limitations. Unless we accept the CCWA, this simply shifts the original question of how to overcome the targeted limitations to the equally difficult question of determining on which applications the proposed approach works well; • We propose novel algorithms on the basis of limited empirical results, leaving the burden of proof to the users. It is not trivial to know how well any new approach will generalize beyond the problems which it had been tested against so far [48]. Because, each year, more than a thousand of papers is dedicated to the topic of load forecasting alone [35], of which the majority is following the above mentioned pattern, it is very difficult to find the best performing algorithm for a certain load forecasting task. Even more so, in load forecasting, it is important to find the algorithm, which would yield the best results, in terms of load forecasting error, due to the gains which this provides (see subsection 1.3.4 on Load Forecasting Motivation). Because of the increasing number of possible algorithms, it would be advantageous to use the system which, due to the unlimited search space of all the possible algorithms, would give the best possible approach in a reasonable runtime. Based on the similarity of the new task with the previous tasks, meta-learning uses the knowledge, from previous tasks, to find the ranking of the possible algorithms in order to solve the new task. The idea of meta-learning is based on Rice’s framework. Rice’s Framework In 1976 Rice proposed a formalized version of an algorithm selection problem as follows: “for a given problem instance 𝑧 ∈ 𝑃, with features 𝑠(𝑧) ∈ 𝐹, find the selection algorithm 𝑆(𝑠(𝑧)) in algorithm space 𝐴, in the way that the selected algorithm 𝑎 ∈ 𝐴 maximizes the performance mapping 𝑢(𝑎(𝑧)) ∈ 𝑈 in terms of a performance measure 𝜋” [49]. In a meta-learning system, features 𝐹 from the Rice’s formulation are called meta-features and they represent inherent characteristics of a given task 𝑧. Figure 2 presents an illustration of Rice’s framework. In meta-learning, 𝐴 is a space of base algorithms and 𝑆 is the algorithm which we are looking for. Obtaining 𝑆 or doing meta-learning has the following practical aspects: -20- • Creation of the meta-features and other metadata by building characterization • function 𝑠 based on problems in 𝑃 fed into 𝐹; • Creation of 𝐴 by finding the good selection of base algorithms; and Computational cost of 𝑆 and 𝑠. Figure 2: Algorithm selection seen through the Rice’s Framework If 𝑧 is a load forecasting task, 𝐹 can be composed of the degree to which the load was skewed; the kurtosis of the load; and the number of exogenous features that are available for the load forecasting. If we gather enough knowledge about different load forecasting tasks and load forecasting errors from distinct algorithms on them, it is possible, for each of those load forecasting tasks, to rank algorithms by load forecasting error. For a new load forecasting task, it would be possible to rank the algorithms based only on its characteristics with the assumption that, for similar characteristics of the load forecasting tasks, the same algorithm will return a similar load forecasting error. On this assumption, it is unnecessary to test all algorithms on every new load forecasting task. The choice between selection and model combination, like hybrids or ensembles, is a difficult one since it opens up a can of worms as to when we stop combining. An open problem is which algorithm combinations improve results in a robust metalearning system. Throughout the last few decades, meta-learning was tackled and addressed in different communities such as operations research; neural networks; artificial -21- intelligence; and machine learning. A tutorial, on meta-learning, covers the whole idea and development in more detail [48]. 1.5.3 History Meta-learning does not have a long history, and being short, it is interesting in its own way. This subsection covers the major contributions and, in chronological order, highlights the key points in meta-learning history. Firstly, there was a program used to “Shift To A Better Bias” which was known as STABB. It showed that a learner’s bias can be adjusted online, whilst the learning happens, and, in that sense, it was a precursor of the meta-learning [50]. Next, there came the “Variable-Bias Management System” known as VBMS. It is a robust learning system based on integration; optimization; and meta-learning. It is based on the number of data points and the number of features, in the dataset, and selection from three symbolic learning algorithms [51]. Then, there came the “Machine Learning Toolbox” [52] which was a more complex project aimed at building a system which could help users, in machine learning, by leading them to the best performing solutions for their problems. It gave us a lot of practical aspects of meta-learning which were used to build Consultant and, later, Consultant-2, a guidance system for users. “Machine Learning Toolbox” was followed by a well-funded StatLog project [53]. StatLog is called after statistical and logical learning algorithms (StatLog is a shorter name for “Comparative Testing of Statistical and Logical Learning”). It was a research project, between 1990 and 1993 which extended the “Variable-Bias Management System” to more algorithms; features; and datasets for classification. It aimed to find, in which part of the feature space, which algorithms have good generalization performance. It was the first large project devoted to meta-learning. It introduced a notion that, sometimes, it is more efficient to use the simpler algorithm instead of calculating the features for another algorithm. In that project, an important question was posed: Could we learn to estimate performance based exclusively on simple algorithms? Certainly, we could. Later, the resultant approach was named landmarking. Landmarking is nowadays one of the branches of meta-learning development [54]. Now, something a bit different emerged. -22- The Cross-Industry Standard Process for Data Mining (CRISP-DM) project [55] was more oriented towards finding the optimal meta-learning modelling from the beginning to the end of an underlying data mining task. This project was not directly a development of meta-learning. This was interesting because it helped to formalize data mining processes and meta-learning can be used as a tool for data mining. In a meta-learning community, well-known “A Meta-Learning Assistant for Providing User Support in Data Mining and Machine Learning” project, or briefly, METAL project [56] had two goals. Firstly, it was to find classification algorithm ranking for a task at hand; and, secondly, to characterize tasks by their features and to identify important tasks and features. The binary focus on classification, with many binary and categorical features which StatLog introduced, was abandoned in favour of algorithm ranking which became the standard used today. CRISP-DM provided a web based meta-learning system which could be used for classification. It extended an idea, to give the solution with the best performance, by providing new criteria which included the runtime of each classification algorithm. After this project, there were many new attempts of which Intelligent Discovery Assistant is particularly interesting. Intelligent Discovery Assistant [57] is a relatively new idea in meta-learning which, by using both ontology and case-based reasoning, introduced an assistant to guide the end user in the selection and solution of the presented problem. It is an expert system built from human experts’ ontology and machine learning rules, and organized together in features. At some point, meta-learning systems became more sophisticated and developed into selection beyond algorithm selection by combining different algorithm parameterizations such as common parameters of neural networks and support vector machines. The work by Ali and Smith-Miles [58] is a good example of this. In it, they generalized kernel selection for SVM classification. A particular area of metalearning development, interesting for this thesis, is the application of meta-learning to time-series forecasting. 1.5.4 Meta-Learning for Time-Series Forecasting (State-of-Art) Meta-learning, applied to time-series forecasting, came after the application to classification. [59] was one of the bold examples of meta-learning application to time-23- series forecasting. Arinze [59] proposed a machine learning approach, for the timeseries forecasting, as an extension of a rule-based expert system. Similar to the previous StatLog project, his approach used the presented Rice’s paradigm. Arinze built a meta-learning system around adaptive filtering; Holt’s exponential smoothing; and Winter’s method, which are well-known algorithms for time-series forecasting. Prudencio and Ludermir [60] built a meta-learning system in order to select, by rule induction with a decision tree, a forecasting algorithm between time-delay neural network and simple exponential smoothing. The decision tree had been built using J4.8 which is a Java implementation of the C4.5 5 decision tree algorithm. Afterwards, they extended the data to the data of a well-known competition, M3 6 and they changed the forecasting algorithms. More recently, Wang et al. [61] proposed meta-learning, on the univariate timeseries, by using four forecasting methods and a representative database of univariate time-series of different and distinct characteristics. Their results show that, depending on the characteristics of the time-series, ARIMA and neural networks had interchangeably the best performance between the candidates. Exponential smoothing models and random walk (RW) lagged in forecasting performance. Lemke and Gabrys built an extensive pool of meta-features on datasets for the NN3 and NN5 competitions 7. They showed that a meta-learning system outperformed approaches representing competition entries in any category. On the NN5 competition dataset, their Pooling meta-learning had a symmetric mean absolute percentage error of 25.7 which is lower than the 26.5 obtained by Structural model; this was the best performing of 15 single algorithms [62]. If an approach, with performance close to or better than the meta-learning system, is found, many metalearning systems can assimilate those candidates easily and, therefore become even stronger. 5 C4.5 is a version of a well-known decision tree algorithm by Ross Quinlan, ID3. C4.5 can be used for classification M3 was a third time-series forecasting competition organized by Spyros Makridakis. A competition was held in 1998 and consisted of 3003 time-series of various types 7 NN3 (2006-2007) and NN5 (2008) competitions, abbreviated from Neural Network Forecasting, were time-series forecasting competitions supported by SAS and the International Institute of Forecasters 6 -24- “The difficult problems in life always start off being simple. Great affairs always start off being small” Lao Tzu 8 2 Problem Statement The time-series forecasting goal is simply to determine the values of the time-series at certain time in the future, and same holds for the load. However, obtaining good forecasts of load or other time-series can be very difficult. It depends on many factors, mainly on the time-series or the load itself. A class of time-series forecasting tasks, which have additional information about the task and which can be written in a form of one or more time-series, is known as multivariate time-series. Multivariate time-series forecasting is important because many real-world applications rely on it. Often, the use of a univariate time-series, if a multivariate time-series is available, would bring lower performance. Meta-learning was demonstrated to distinguish well between different tasks and their characteristics and, therefore, returned low errors. Consequently, it is a good candidate for time-series forecasting and forecasting model selection for multivariate time-series. When non-experts face the problem of time-series forecasting, they use the forecasting algorithm that they have available or know how to use. Chances are, this is not a good solution for some, or majority of their problems. I showed some available approaches where time-series forecasting is used, in the previous section. In order to cover different, heterogeneous problems for the non-expert users, I will design and put in practice a system based on meta-learning. By capturing the knowledge from not only previous instances of the same task, but also of different tasks, I will try to show that forecasting error can be decreased and in this way, nonexperts can get help with their forecasting tasks. Based on it, I set the first hypothesis: H-1. Using the meta-learning system for time-series forecasting, obtains, in most cases, less than or equal forecasting error, relative to the following specific algorithms: random walk, ARMA, Similar days, multilayer perceptron, layer recurrent 8 Laozi, also, known as Lao Tzu; Lao-Tsu; Lao Tse; Lao Tu; Laotze; Laosi; and Laocius (6th century BC), was a Chinese philosopher, best known as the author of the Tào Té Chīng. His association with the Tào Té Chīng led him to be traditionally considered as the founder of philosophical Taoism. He is revered as a deity in most religious forms of Taoist philosophy. -25- neural network, Elman network, 𝜈-support vector regression and robust least squares support vector machines; In order to put the proposed meta-learning system in practice, a way to define and select meta-features for time-series would be useful. An alternative such as, using a set of pre-specified meta-features which is used in majority of the applications is not the optimal solution. In order to get to a good set of meta-features, I will propose a way to define and select meta-features for the time-series problems, particularly for problems involving load. The hypothesis in this regard is the following: H-2. Establishment of procedure for defining and selecting time-series meta-features for the problems of electric load will provide structured and systematic process of selecting meta-features that stand out as favourable; The meta-learning system that I will propose can be a step towards finding the load forecaster which would return the lowest load forecasting error on each load forecasting task. It can be utilized as a model selection and used with the ensemble for classification at the meta-level; this would provide more stable results than if only one classification algorithm is used. The third hypothesis is the following: H-3. Using the proposed model selection meta-learning, based on bi-level learning, with an ensemble for classification on the higher level, for model selection, will make easier the process of selecting the optimal load forecasting model and return results more stable than a single classification algorithm on the meta-level such as 𝜀-SVM; While practicing load forecasting, it is always useful to find new rules which in general help to reduce the load forecasting error. In this regard I want to test how the introduction of special features for holidays and seasons improves the load forecasting error. The last hypothesis is: H-4. Introducing calendar features for seasons or calendar features for holidays decreases the load forecasting error. Now let’s take a look at other problems that might arise. Firstly, I will propose a model which can be used for multivariate time-series forecasting and which, will help forecasting practitioners who use multivariate timeseries. I will apply the proposed model to load forecasting since this is one of the -26- most common time-series problems. In order to get to that point, I will give methodology and the outlook of the meta-learning system. One of the early open problems is to choose the programming environment which will be optimal. Once the, programming environment is selected and the system is sketched, next problem is how to manipulate the data. Since the answer is never straightforward and it depends on the application, I have to consider various approaches. These approaches are mainly tied to pre-processing (normalization and feature selection), but are also important for other parts, such as parameter selection (kernel types and their parameters, tuning parameters for neural networks) or evaluation metrics. Some of these will be available through modules, in order to see which one works the best. The organization of learning at the meta-level is an open problem in its own regard. The state-of-art in meta-learning shows us that each approach is unique and can be quite different from previous research. How to organize the learning and extract the knowledge from the data acquired by it? Next is the forecasting itself. There are numerous ways in which each algorithm can be implemented. In this part, approaches differ and, in many cases, established versions of algorithms are available, but not in all of the cases. Each of the algorithms has its logic and standard rules which have to be taken into account. With that in mind, with some algorithms the question of choice is what type of optimization I should use and which parameters should I optimize. When all is set, some issues arise from the data. Do I use it raw or do something with it? If I am to do something, it may be that additions to an expected pre-processing will be needed in order for the data to enter the forecasting system. Which type of preprocessing the data will need depends on the format of the data and the type of the problem. Once the data is ready, the knowledge extraction from it, at the meta-level, has to be set up. For that purpose I have to see which meta-features I can make from the data. Then from the created meta-features I have to choose the most appropriate ones. Then I have to experiment and find some information that improves the results. One of the interesting things is to find optimal parameters common for more algorithms, or behaviour of important parameters in terms of parameter settings and the used -27- performance measure. Those results which are to be improved might differ based on the type of the load forecasting tasks. Nowadays, load forecasting is separated into four types. We could see, in the subsection 1.3.6, the differences between those tasks. In the same way, a different approach is used to forecast loads depending on the type of load forecasting. I will apply the proposed model as a meta-learning system which can be used to forecast load of all types whilst maintaining a comparably low forecasting error. It learns from the past knowledge and, therefore, the more tasks of a certain kind, which it has, the higher the chances that it will perform well on that task. Because the majority of load forecasting tasks, in both research and practice, are short-term load forecasting, more emphasis will be placed on the short-term load forecasting tasks. The majority, of the examples, will be STLF tasks and one of the algorithms will be applicable only to STLF tasks. The goal is to achieve a general framework for load forecasting which can be used for load forecasting model selection. So, off we go. -28- “Organizing is what you do before you do something, so that when you do it, it is not all mixed up” Winnie-the-Pooh 9 3 Methodology This section is an overview of the proposed meta-learning system. The fourth and fifth sections are dedicated to particular parts of a meta-learning system. The fourth section about meta-learning system modules, covers all but the most important module. The fifth section relates to forecasting algorithms and, in its subsections, gives a detailed overview of the forecasting algorithms module which was the core of the meta-learning system. A substantial part of this thesis is given in [63]. The proposed meta-learning system can be applied to both continuous and interval time-series. It does not make a distinction between them because it can make metafeatures from both those types. Also, algorithms can run concurrently on both types of time-series. It is common, in machine learning, that algorithms, for time-series regression, can be applied to both continuous and interval time-series. The proposed system could be extended to work with streams of data (infinite time-series). However, the creation of online versions, of the used algorithms, and their application, for the purpose of meta-learning, is outside the scope of this thesis. Whilst the majority, of the load forecasting and meta-learning approaches, learn on a single level, the proposed meta-learning system learns on two levels: load forecasting task (Task) level; and meta-level. Examples of three Tasks are STLF of a small industrial facility; a MTLF of a whole supply area; and a LTLF of the whole country. Meta-learning is introduced to select, the best candidate algorithm, for a new Task, based on similarity between Tasks and the performance of algorithms on observed Tasks. Figure 3, a more detailed version of Figure 2, depicts the working of the proposed meta-learning system. The learning, at the forecasting level, is represented by the lower right cloud in which are written the names of the composing forecasting algorithms. At the forecasting level, feature space and feature selection ought to be smaller clouds in that cloud. However, due to simplicity, these are not illustrated in Figure 3 in which the meta-level is represented as the arrows between all five clouds. 9 Winnie-the-Pooh, also called Pooh Bear, is a fictional bear that first appeared in the book Winnie-the-Pooh (1926) by A.A. Milne. Later Disney created a cartoon character for which Winnie-the-Pooh is well-known. -29- Learning, at the meta-level, is in the ensemble illustrated by a central cloud with seven words representing classification algorithms the ensemble consists of. Metafeatures, created for each Task, make the basis for learning at this level. Learning about learning is represented in the similarity of the meta-features. For all the Tasks (except the new ones), in the meta-learning system, there is the previously calculated available performance of forecasting algorithms. Using the notion that, for similar load forecasting tasks, algorithms would have similar rankings, the proposed metalearning system associates the algorithm ranking to a new Task. Figure 3: Illustration of the Proposed Meta-learning System 3.1 Organization of a Meta-learning System I selected a component-based, modular approach because it gives the possibility to extend the meta-learning system easily and to emphasize the meta-learning’s strength to become stronger over time. The proposed meta-learning system consists of the following six modules: • Load data; • Normalization; • Learn meta-features; -30- • Feature selection; • Forecasting algorithms; and • Error calculation and ranking. These modules can be separated into the following three distinctive parts: the first two modules are pre-processing; the next three modules are meta-learning; and last module represents the results. The pre-processing part’s role is to prepare the Tasks for forecasting. The meta-learning part’s role depends on whether the meta-learning system is in learning or in working mode. In learning mode, it will learn on different combinations of feature selection and forecasting algorithms and, in working mode, it will perform forecasting based on meta-features similarity. In both modes, it will create meta-features for the Task at hand. In the results module, the forecasting error is calculated; algorithm ranking is given; knowledge at the meta-level is recalculated; and the results are stored and displayed. Figure 4 illustrates the proposed metalearning system’s flow chart. Figure 4: The Flowchart shows the Organization of the Modules in the Meta-learning System -31- Outer loop of the meta-learning system loops through combinations of forecasting algorithms and feature selection and, in recurrent simulation for a particular combination, the inner loop carries out the forecasting. The third loop envelopes the meta-learning part and covers all new Tasks. The proposed meta-learning system creates the meta-features for each Task. In order to implement the meta-learning system, we will look for the best environment, which would increase the solution’s usability. Therefore, in the next subsection, we consider possible programming languages and their properties. 3.2 Programming Language The decision, about which programming language to choose, was mainly between two classes: older more widely used programming languages like C++, Matlab and R; and new Java based programming languages such as WEKA and RapidMiner. Although MATLAB lacks the speed of C++, and the open source quality, of other candidate programming languages, it was the best choice. The most important reason was that, in this research field, it is the most used tool and provides a lot of possibilities and readily available code. Its second advantage is the best balance between the language complexity and easiness to use since the user is able both to see the details on a low level and do not spend too much time developing frequently used code and data structures. Two important advantages, which not all of the other programming languages have, are parallelization and the compactness of the code in terms of vectorization. Both of these decrease the runtime. As I will note in the next section, there are some classification algorithms, at the meta-level, which, due to their availability, I did not implement in Matlab and I rather used RapidMiner/Java instead. -32- “The only reason for time is so that everything doesn't happen at once.” Albert Einstein 10 4 Multivariate Meta-Learning: Meta-Learning System Modules The section overviews all the modules with the exception of the forecasting algorithms. Together, by using meta-learning [63], they create the workflow for multivariate forecasting. 4.1 Load data The first module is simple, it is used to load the data; set the parameters; and choose the new Task. Firstly, the program is initialized by adding the wanted additional library. The timer, which measures the runtime, is started: %% Initialization clear all; close all; clc addpath ('E:\PhD\Softver\MATLAB\LSSVMlabv1_8_R2009b_R2011a'); %Start timing tic Then, the data, of all the datasets which reside in a large textual file, are loaded, followed by loading the data about the meta-features. %Read .txt data = dlmread('LoadSet.txt', '\t', 0, 0); %Load metafeatures table load('Dataset_meta.mat'); In order to be able to manipulate the calendar information in a more compact form, timestamps are created. %Create timestamps 10 Albert Einstein (14 March 1879 – 18 April 1955) was a German physicist who developed the general theory of relativity, and received the 1921 Nobel Prize in Physics. He is well-known for his mass-energy equivalence formula E = mc2 -33- starter1 = [2006 1 1 0 59 0; 2006 1 1 1 59 0]; timestamp = datenum(starter1); diff1 = timestamp(2) - timestamp(1); for i=2:51264, timestamp(i) = timestamp(i-1)+diff1; end The information, for MTLF and LTLF Tasks, is created easily, based on the information about STLF Tasks from the initial data. %Daily values dataset data2 = reshape(sum(reshape(data, 24, 2136*78)), 2136, 78); data2(:,[2:11 36 37 65:76])=data2(:,[2:11 36 37 65:76])/24; %Monthly values dataset datatemp3 = data2(1:2130,:); data3 = reshape(sum(reshape(datatemp3, 30, 71*78)), 71, 78); Finally, a single dataset is selected: %Select the dataset Y = data(Dataset(k).from:Dataset(k).to,Dataset(k).Yc); X = data(Dataset(k).from:Dataset(k).to,Dataset(k).Xc); TSused = timestamp(Dataset(k).from:Dataset(k).to,:); description = 'This is for my thesis. Hi Mum!'; The next subsection is about the initial pre-processing of the data. 4.2 Normalization The datasets are normalized using one of the following: • Standardization; • [0, 1] scaling; • [-1, 1] scaling; and • Optimal combination. -34- We apply normalization to the data in order to put it on the same scale. Neural networks and support vector machines use data, normalized in this module, as the input data for forecasting. Standardization or z-score is taking the difference of each value with the mean and dividing that difference with the standard deviation. The resultant code is: %Normalize load and temperature using standardization s(1) = Dataset(k).stdY; s(2) = std(X(:,1),1); m(1) = Dataset(k).meanY; m(2) = mean(X(:,1)); Y = (Y - m(1))./s(1); X(:,1) = (X(:,1) - m(2))./s(2); Normalization parameters are stored for later. After the forecasts are obtained, those parameters are used to return them to the level of non-normalized values. Normalization, using minimum and maximum to scale linearly to the interval [0, 1], is straightforward: %Normalize load and temperature using [0, 1] m(3)=min(Y); m(4)=max(Y); m(5)=m(4)-m(3); m(6)=min(X(:,1)); m(7)=max(X(:,1)); m(8)=m(7)-m(6); Y=(Y-m(3))/m(5); X(:,1)=(X(:,1)-m(6))/m(8); Normalization, to the interval [-1, 1], works in the analogue way: %Normalize load and temperature using [-1, 1] m(3)=min(Y); m(4)=max(Y); m(5)=m(4)-m(3); m(6)=min(X(:,1)); m(7)=max(X(:,1)); m(8)=m(7)-m(6); -35- Y=2*(Y-m(3))/m(5)-1; X(:,1)=2*(X(:,1)-m(6))/m(8)-1; The final type of normalization selects only different types of normalization for different time-series. 4.3 Learn Meta-features The learn meta-features module incorporates the learning at the meta-level. It consists of creating the meta-features for new Tasks. New meta-features can be added easily and the forecasting error ranking recalculated without the need to repeat any computationally expensive parts like forecasting. If there is no model on which to learn, the learn meta-features module runs in the learn mode and builds the knowledge on the meta-level using the ensemble of classification algorithms. Otherwise, the algorithm is in the working mode. In the working mode, after creating the algorithms for a new Task, it will use first the ensemble to determine the ranking list of the recommended best candidates. Later, based on more results of the forecasting using the Task, it is possible to update the built model based on the ensemble including the information about the forecasting results of the new Task. Section 7 explains the meta-features and the approach in creating them. This subsection explains the ensemble and all the components which built it. 4.3.1 The Ensemble Ensembles are voting machines built based on two or more models of the same algorithm or models of different algorithms. The notion, to build ensembles came from the idea that different algorithms could enhance overall performance in certain areas, e.g. algorithms 1 and 2 are used for three different problems. On the first problem, algorithm 1 has superior performance and its performance is average on other problems, whilst with the exception of the second problem on which it has superb performance, algorithm 2 has average performance on the problems. Using an equally weighted ensemble, it is possible to obtain better results with an ensemble compared to the results from each of the algorithms which build it. There are different popular types of ensembles such as bootstrap aggregating (bagging); boosting; and Bayes optimal classifier. -36- The proposed ensemble is an equally weighted ensemble built from the following classification algorithms: • Euclidean distance; • CART Decision tree; • LVQ network; • MLP; • AutoMLP; • ε-SVM; and • Gaussian Process (GP). The following subsections provide more information about each of the classification algorithms which built the ensemble. 4.3.2 Euclidean Distance Euclidean distance (ED) is the simplest algorithm included in the ensemble. It is the calculation of a distance in Euclidean space, between each two different Tasks, as: 𝑛𝑚𝑓 2 𝑑 𝑇𝑖 ,𝑇𝑗 = � ��𝑚𝑓𝑇𝑖 ,𝑘 − 𝑚𝑓𝑇𝑗,𝑘 � 𝑘=1 Where: ∀𝑖 ≠ 𝑗 𝑇𝑖 and 𝑇𝑗 are Tasks; 𝑚𝑓𝑇𝑖 ,𝑘 is the value of the 𝑘-th meta-feature for the Task 𝑇𝑖 ; and 𝑛𝑚𝑓 is the number of meta-features. The following code is used for the calculation of the Euclidean distance: %Calculate the Euclidean distance between the new task and the database temp = TargetDT; for i=1:size(DatasetRun,2), d(i) = 0; d(i) = d(i)+(DatasetRun(i).minY-temp.minY)^2; -37- d(i) = d(i)+(DatasetRun(i).meanY-temp.meanY)^2; d(i) = d(i)+(DatasetRun(i).stdY-temp.stdY)^2; d(i) = d(i)+(DatasetRun(i).skewY-temp.skewY)^2; d(i) = d(i)+(DatasetRun(i).kurtY-temp.kurtY)^2; d(i) = d(i)+(DatasetRun(i).periodicityY-temp.periodicityY)^2; d(i) = d(i)+(DatasetRun(i).lengthY-temp.lengthY)^2; d(i) = d(i)+(DatasetRun(i).ACFY-temp.ACFY)^2; d(i) = d(i)+(DatasetRun(i).vol-temp.vol)^2; d(i) = d(i)+(DatasetRun(i).trend-temp.trend)^2; d(i) = d(i)+(DatasetRun(i).mr-temp.mr)^2; d(i) = d(i)+(DatasetRun(i).gran-temp.gran)^2; d(i) = d(i)+(DatasetRun(i).exog-temp.exog)^2; d(i) = sqrt(d(i)); end [C,I] = min(d); The code’s last line is the ranking based on the ascending Euclidean distance which could be used as a quick ranking preview. 4.3.3 CART Decision Tree CART Decision Tree [64] is a powerful and widely used type of binary decision tree for classification and regression. It works with both ordered and unordered data and can work, also, if some of the data are missing. It is easy to prune the tree by optimizing the sub-tree sizes and specifying the minimum number of parents and leafs for the tree. It is possible to assign weights and prior probabilities to each data point which, in our case, is represented by 𝑚𝑓𝑇𝑖 ,𝑘 , a value of the 𝑘-th meta-feature for a certain Task 𝑇𝑖 with the target class being the class of the best ranked forecasting algorithm for 𝑇𝑖 . Figure 40, in a later subsection, illustrates the classification results with a CART decision tree. Although, in CART, it is possible to employ different means of determining the classification model, the version, which I employed, uses the Gini impurity index (GI). Gini impurity index is the measure of misclassified cases over a distribution of labels in a given set. For each node, the Gini impurity index GI is equal to: 𝑛𝐹𝐴 𝐺𝐼 = 1 − � 𝑟𝑐𝑙 2 𝑐=1 -38- Where: 𝑛𝐹𝐴 is the number of forecasting algorithms in the meta-learning system; 𝑟𝑐𝑙 is the percentage of records belonging to a class 𝑐; and 𝑐𝑙 is a class index which corresponds to the index of the best ranked forecasting algorithm. The following code is used for the CART decision tree: t = classregtree(XRun(1:size(DatasetRun,2),:),Y(1:size(DatasetRun,2)),... 'names', {'min' 'mean' 'std' 'skew' 'kurt' 'periodicity' 'length'... 'ACF' 'vol' 'trend' 'mr' 'gran' 'exog'}); yfit = eval(t,TargetX); 4.3.4 LVQ Network Learning Vector Quantization network or shorter LVQ network is a special type of neural networks. LVQ network is based on Hebbian, “winner takes all” type of learning. LVQ networks are very similar to Kohonen’s Self-organizing maps (SOMs) and preceded them. The main difference is that SOMs affect neuron topology. They can organize neurons and project the results as a two-dimensional map, whereby LVQ networks end up with one neuron carrying the decision. From the weights of other neurons, in the competitive layer, it is possible to find how a particular LVQ network learns. Similar to the SOM, Teuvo Kohonen [65] invented the LVQ network. LVQ network can be used for classification. Also, it is organized in a competitive and linear layer. In the competitive layer, there are always more neurons than in the output layer. Those neurons represent subclasses and output neurons represent classification classes. If there are three times more neurons in the competitive layer, it means that each three separate neurons are used in a competition to deduce a decision for a single output class. Figure 5 illustrates an example of the LVQ network with neuron firing. Learning is based on the pairing of the input and output values in training cycles. The important part, of LVQ networks, is the distance measure used for training since it influences in which direction the LVQ network chooses the best candidate [66]. LVQ1 is the original algorithm which Kohonen invented. In this implementation, the LVQ1 algorithm is used for training. -39- LVQ’s advantages are: that it can be used easily to illustrate how, similar to other types of neural networks, learning happens through the neural network; and that experts, with the domain knowledge, can set up weights easily [67]. Figure 5: Activation of a Neuron in the Output Layer selecting RobLSSVM as the Label for a Random Example given to the LVQ Network In the end, after the model is trained, only one output neuron is activated which represents, the class to which a new example belongs. Based on the good empirical results, the topology 13-14-7-(1) is used with a learning rate of 0.01 and 50 training epochs. The local change, of the learning rate and the number of training epochs, affects neither the results nor the runtime much. The following code is used for the LVQ network: %LVQ Network T = ind2vec(Y2Run'); targets = full(T); net = newlvq(XRun',35,prototypes); Rez = sim(net,XRun'); Yc = vec2ind(Rez); net.trainParam.epochs = 50; -40- net = train(net,XRun',T); Rez = sim(net,XRun'); Yc = vec2ind(Rez); Rez = sim(net,TargetX'); Yc1 = vec2ind(Rez); Ylvq = Ime{Yc1}; 4.3.5 MLP MLP is the well-known type of neural networks and, probably nowadays, the most used learning algorithm. Sometimes, it is referred to as neural network with a feedforward (FF) back-propagation (BP). It is based on Rosenblatt’s Perceptron and back-propagation algorithm. MLP can distinguish data which is not linearly separable [68]. Subsection 5.4.8 describes it in more detail. The main difference, from the implemented MLP for time-series forecasting described later, is that, at the metalevel, the MLP version is for classification and not for the regression. This MLP, for multiclass classification, is a special type of MLP for regression, described later. It can be seen as rounding the numerical result to an affiliation with a certain class. The best configuration, for MLP, was found empirically by the grid search for the number of neurons in the hidden layer. That configuration is when the number of hidden neurons is 6. In the end I am using the following parameters: • Sigmoid activation function; • Gradient descent using Levenberg-Marquardt; • Momentum 0.2; and • Learning rate 0.3. Subsection 5.4 explains these settings. The following code is used for the MLP: <process expanded="true" height="640" width="165"> <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation (3)" width="90" x="45" y="30"> <parameter key="number_of_validations" value="5"/> <parameter key="parallelize_training" value="true"/> <parameter key="parallelize_testing" value="true"/> <process expanded="true"> <operator activated="true" class="neural_net" compatibility="5.2.000" expanded="true" name="Neural Net"> <list key="hidden_layers"/> -41- <parameter key="training_cycles" value="300"/> </operator> <connect from_port="training" to_op="Neural Net" to_port="training set"/> <connect from_op="Neural Net" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" name="Apply Model (4)"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" name="Performance (4)"> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model (4)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/> <connect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/> <connect from_op="Performance (4)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (3)" to_port="training"/> <connect from_op="Validation (3)" from_port="model" to_port="output 3"/> <connect from_op="Validation (3)" from_port="training" to_port="output 2"/> <connect from_op="Validation (3)" from_port="averagable 1" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> 4.3.6 AutoMLP AutoMLP is a modern version of MLP organized as an ensemble of MLPs which, uses the genetic algorithm (GA) and stochastic optimization, to search for the best combination of MLPs and their parameters [69]. It is used to determine the size of the learning rate and the optimal number of hidden neurons for the MLP combination. Based on the empirical testing I am using its RapidMiner implementation with 4 ensembles and 10 generations. The results of the classification, using AutoMLP, are fed to Matlab. Figure 6 shows that part graphically. -42- Figure 6: A Snapshot of the AutoMLP Operator where Ensemble Parameters are being set AutoMLP is nested in the following code: <process expanded="true" height="640" width="165"> <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation (2)" width="90" x="45" y="30"> <parameter key="number_of_validations" value="5"/> <parameter key="parallelize_training" value="true"/> <parameter key="parallelize_testing" value="true"/> <process expanded="true" height="640" width="432"> <operator activated="true" class="auto_mlp" compatibility="5.2.000" expanded="true" height="76" name="AutoMLP" width="90" x="171" y="30"/> <connect from_port="training" to_op="AutoMLP" to_port="training set"/> <connect from_op="AutoMLP" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="640" width="432"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator compatibility="5.2.000" activated="true" expanded="true" height="76" class="performance_classification" name="Performance (3)" width="90" x="238" y="30"> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model (3)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/> <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/> <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (2)" to_port="training"/> <connect from_op="Validation (2)" from_port="model" to_port="output 3"/> <connect from_op="Validation (2)" from_port="training" to_port="output 2"/> <connect from_op="Validation (2)" from_port="averagable 1" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> -43- <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> 4.3.7 ε-SVM Support Vector Machines (SVMs) are an important class of algorithms which Vladimir Vapnik developed. Similar to the MLP two previous subsections, support vector machines for classification, is a special case of support vector machines for regression or SVR. ε-SVM for classification, is the classical example of SVM. Sometimes, it is called, also, Vapnik’s SVM. ε-SVM preceded ν-SVM, an algorithm which is implemented for time-series forecasting and described later. Unlike some other classes of algorithms using the support vectors, it is based on the hinge-loss function. Figure 7 shows an example of a hinge-loss function in ε-SVM. Subsection 5.5 provides a detailed explanation of support vector machines. ξ -ε 0 +ε Figure 7: ε-loss Function is Zero inside the ε-tube, and grows linearly outside of it. It is, named, also, hinge-loss function because it resembles a hinge The implementation uses the grid search as a means to obtaining the optimum values of 𝐶 and 𝜎 2 with 𝜀 = 10−3. It uses the Radial Basis Function (RBF) kernel for fundamental reasons, same as stated later for the case of support vector regression. The code of the implementation is as follows: <process expanded="true" height="640" width="165"> <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.000" expanded="true" height="112" name="Optimize Parameters (2)" width="90" x="45" y="30"> <list key="parameters"> -44- <parameter key="SVM (2).gamma" value="[0.001;2000;10;logarithmic]"/> </list> <process expanded="true" height="640" width="914"> <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation (4)" width="90" x="380" y="30"> <parameter key="number_of_validations" value="5"/> <process expanded="true" height="640" width="432"> <operator activated="true" class="support_vector_machine_libsvm" compatibility="5.2.000" expanded="true" height="76" name="SVM (2)" width="90" x="171" y="30"> <parameter key="svm_type" value="nu-SVC"/> <parameter key="gamma" value="2000.0"/> <list key="class_weights"/> </operator> <connect from_port="training" to_op="SVM (2)" to_port="training set"/> <connect from_op="SVM (2)" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="640" width="432"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (5)" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator compatibility="5.2.000" activated="true" expanded="true" height="76" class="performance_classification" name="Performance (5)" width="90" x="238" y="30"> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model (5)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (5)" to_port="unlabelled data"/> <connect from_op="Apply Model (5)" from_port="labelled data" to_op="Performance (5)" to_port="labelled data"/> <connect from_op="Performance (5)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (4)" to_port="training"/> <connect from_op="Validation (4)" from_port="model" to_port="result 1"/> <connect from_op="Validation (4)" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Optimize Parameters (2)" to_port="input 1"/> <connect from_op="Optimize Parameters (2)" from_port="performance" to_port="output 1"/> <connect from_op="Optimize Parameters (2)" from_port="parameter" to_port="output 2"/> <connect from_op="Optimize Parameters (2)" from_port="result 1" to_port="output 3"/> -45- <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> 4.3.8 Gaussian Process Gaussian Process (GP) is a relatively new class of algorithms in machine learning. It represents one of the most important Bayesian machine learning approaches. It can be shown that the GP is related to support vector machines [70]. Similar to neural networks and support vector machines, Gaussian processes for classification are a special case of GPs for regression. Output probability, for membership of a data point to a certain class, is a number in an interval [0, 1] but, prior to Gaussian processes, is not limited to that interval. For classification, GP can be obtained by putting a logistic function (4.1) on the output; and, thereby, limiting the solution to the interval [0, 1]. The logistic function is: 𝑓(𝑧) = (1 + exp(𝑧))−1 (4.1) Binary classification can be extended easily to a multiclass by doing multiple binary classifications and checking if a data point belongs to a single class. The Gaussian process can be defined using the mean function 𝑚𝑓 (𝑋) and covariance function 𝑘(𝑋, 𝑋′) of a real process (which is the function of the input data), 𝑓(𝑋) as shown in the following: 𝑚𝑓 (𝑋) = 𝔼[𝑓(𝑋)] 𝑘(𝑋, 𝑋 ′ ) = 𝔼��𝑓(𝑋) − 𝑚𝑓 (𝑋)��𝑓(𝑋′) − 𝑚𝑓 (𝑋′)�� Also, we write the Gaussian Process as Where: 𝔼 is the expectation 𝑓(𝑥)~𝒢𝒫�𝑚𝑓 (𝑋), 𝑘(𝑋, 𝑋 ′ )� -46- Notice that 𝑘(𝑥, 𝑥′) is the same as the kernel in support vector machines. For regression, Gaussian processes are analogous to the ridge regression. Usually in GPs, 𝑚(𝑥) is set to be equal to zero and it is unused. The reason is that it is possible to optimize always the solution by only optimizing 𝑘(𝑥, 𝑥′) or, by optimizing the covariance function, we optimize the mean function, too. For the regression case, the GP’s marginal likelihood is: 2 1 − 2 �𝑌𝐺𝑃 − 𝑓𝑤 (𝑋𝐺𝑃 )� 𝑝(𝑌|𝑋, 𝑤) ∝ � 𝑒𝑥𝑝 � � 𝜎𝑛𝑜𝑖𝑠𝑒 2 𝐺𝑃 With parameter prior: 𝑝(𝑤) And posterior parameter distribution by Bayes rule 𝑝(𝑖|𝑗) = 𝑝(𝑗|𝑖)𝑝(𝑖) 𝑝(𝑗) Returns: 𝑝(𝑤|𝑋, 𝑌) = Where: 𝑝(𝑤)𝑝(𝑌|𝑋, 𝑤) 𝑝(𝑌|𝑋) 𝑓𝑤 is the underlying function of a model; 𝑤 are weights; 𝑋𝐺𝑃 ∈ 𝑥 is input data; 𝑌𝐺𝑃 ∈ 𝑦 is a label, output data; 𝜎𝑛𝑜𝑖𝑠𝑒 2 is the squared bandwidth of a noise of a Gaussian process; and 𝐺𝑃 is the index of a GP. In a continuous form, the marginal likelihood can be expressed as: 𝑝(𝑌|𝑋) = � 𝑝(𝑤)𝑝(𝑌|𝑋, 𝑤)𝑑𝑤 also, the forecasting, for a new data point, can be modelled as: -47- 𝑝(𝑌 ∗ |𝑋 ∗ , 𝑋, 𝑌) = � 𝑝(𝑌 ∗ |𝑤, 𝑋 ∗ )𝑝(𝑤|𝑋, 𝑌)𝑑𝑤 In his lecture, Rasmussen [71] gave, through log marginal likelihood, the following simple explanation of the marginal likelihood: 1 1 𝑛 log 𝑝(𝑌|𝑋, 𝑀𝑖 ) = − 𝑌 𝑇 𝐾 −1 𝑌 − log|𝐾| − log 2𝜋 2 2 2 Where: 𝐾 is the covariance matrix which is the same as the kernel matrix known, also, as the Gram matrix; and 𝑀𝑖 is the model. Whilst the third term is constant, the first term represents the fitting of the data and the second term represents the system’s complexity. In this regard, optimization, of a marginal likelihood, works as optimization between model complexity and fitness to the data (bias vs. variance optimization, similar to many other learning based algorithms). In Gaussian processes, the more information we have, the higher probability of the locations in which the function will have its other values. Figure 8 demonstrates that the addition of a point increased confidence that a new point will be in an interval x x Output Y Output Y because the confidence interval area circled in indigo colour has shrunk. x x x x 0 x 0 Input X Input X Figure 8: Three Random Functions drawn from the Posterior of a Randomly Selected Distribution. Dashed Grey Line represents a confidence interval of Two Standard Deviations for Each Input Value -48- For GPs, the classification is more complex than regression because it is no longer possible to use the assumption that the likelihood function is Gaussian. The likelihood function is not Gaussian because belonging to a class is not a continuous function but takes discrete values. Because, in the case of classification, an exact inference is infeasible, an approximate inference is calculated. The probabilistic approach, to classification, can be either generative, modelled as 𝑝(𝑌)𝑝(𝑋|𝑌) or discriminative, modelled as 𝑝(𝑋)𝑝(𝑌|𝑋). Both approaches are correct. The implementation, which we are using here, is based on the discriminative approach. I compared three different versions combining RBF; Epanechnikov; and a combination of three Gaussians; all by using the grid search for optimization. I use RBF GP because it returned the best results. The ensemble’s last four components are implemented in RapidMiner. Figure 9 shows a graphical overview of the loop of the macro. That macro runs through the four ensemble components and stores results for Matlab. Figure 9: Graphical Overview of the Loop which runs through the Four Ensemble Components The code, for the Gaussian process, is: <process expanded="true" height="640" width="165"> <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.000" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="45" y="30"> <list key="parameters"> <parameter key="Gaussian Process.kernel_sigma1" value="[1;100;4;logarithmic]"/> <parameter key="Gaussian Process.kernel_sigma2" value="[50;500;3;logarithmic]"/> <parameter key="Gaussian Process.kernel_sigma3" value="[90;1000;3;logarithmic]"/> </list> <process expanded="true" height="640" width="914"> <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="380" y="30"> <parameter key="number_of_validations" value="5"/> <parameter key="parallelize_training" value="true"/> <parameter key="parallelize_testing" value="true"/> <process expanded="true" height="640" width="432"> <operator activated="true" class="classification_by_regression" compatibility="5.2.000" expanded="true" height="76" name="Classification by Regression" width="90" x="171" y="30"> <process expanded="true" height="640" width="914"> <operator activated="true" class="gaussian_process" compatibility="5.2.000" expanded="true" height="76" name="Gaussian Process" width="90" x="412" y="30"> <parameter key="kernel_type" value="gaussian combination"/> <parameter key="kernel_lengthscale" value="10"/> <parameter key="kernel_degree" value="3"/> -49- <parameter <parameter <parameter </operator> <connect key="kernel_sigma1" value="100.0"/> key="kernel_sigma2" value="500.0"/> key="kernel_sigma3" value="1000.0"/> from_port="training set" to_op="Gaussian Process" to_port="training set"/> <connect from_op="Gaussian Process" from_port="model" to_port="model"/> <portSpacing port="source_training set" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> </process> </operator> <connect from_port="training" to_op="Classification by Regression" to_port="training set"/> <connect from_op="Classification by Regression" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="640" width="432"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance (2)" width="90" x="238" y="30"> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model" to_port="model"/> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/> <connect from_op="Performance (2)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="model" to_port="result 1"/> <connect from_op="Validation" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/> <connect from_port="input 2" to_op="Optimize Parameters (Grid)" to_port="input 2"/> <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="output 1"/> <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="output 2"/> <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="output 3"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> <process expanded="true" height="640" width="165"> <operator activated="true" class="x_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation (2)" width="90" x="45" y="30"> <parameter key="number_of_validations" value="5"/> <parameter key="parallelize_training" value="true"/> <parameter key="parallelize_testing" value="true"/> <process expanded="true" height="640" width="432"> <operator activated="true" class="auto_mlp" compatibility="5.2.000" expanded="true" height="76" name="AutoMLP" width="90" x="171" y="30"/> <connect from_port="training" to_op="AutoMLP" to_port="training set"/> <connect from_op="AutoMLP" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="640" width="432"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30"> <list key="application_parameters"/> -50- </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance (3)" width="90" x="238" y="30"> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model (3)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/> <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/> <connect from_op="Performance (3)" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="input 1" to_op="Validation (2)" to_port="training"/> <connect from_op="Validation (2)" from_port="model" to_port="output 3"/> <connect from_op="Validation (2)" from_port="training" to_port="output 2"/> <connect from_op="Validation (2)" from_port="averagable 1" to_port="output 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="source_input 2" spacing="0"/> <portSpacing port="source_input 3" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> <portSpacing port="sink_output 3" spacing="0"/> <portSpacing port="sink_output 4" spacing="0"/> </process> 4.4 Feature selection Feature selection is an important part of modelling because selecting the optimal data and representing it in the right order can affect the end results greatly. The possibilities, for feature selection, are infinite and it can take a lot of time to find the best selection of features for a certain problem. Feature selection is especially important in domains with a lot of features (above 500). By dismissing some, or a substantial part of features which are not found to be relevant to the process, feature selection can help in: • Reducing the effect of the curse of dimensionality; • Reducing the runtime of the whole process; • Increasing the generalization of the model; and • Making the model easier to interpret and visualize. There are a lot of possible approaches for feature selection. In general, they are divided into simple approaches and methodological approaches. The feature selection module has the following options: All; Optimize lags; and Default. All is the option in which the whole feature set is used in the forecasting. This approach does not lead to optimal results because, frequently in practice, unimportant features increase the load forecasting error substantially. Optimize lags changes iteratively different selection of lags up to the periodicity of the load time-51- series and, together with other features, forwards it to the forecasting part. In this way, the best autoregressive exogenous - nonlinear autoregressive exogenous feature selection is found for a given time-series. The Default option is a product of long empirical testing. 4.5 Error Calculation and Ranking The last module is used for the calculation of the errors; ranking the results; and storing and displaying the results. Error calculation is the important part of time-series forecasting because it determines the way in which results are evaluated. There are numerous metrics for the error calculation for time-series. The most important metrics are: Mean Absolute Error; Mean Square Error; Root Mean Square Error; Mean Absolute Percentage Error; and Mean Absolute Scaled Error. They are defined as follows: Mean Absolute Error (MAE) 𝑛 Mean Square Error (MSE) 1 𝑀𝐴𝐸 = ��𝑌�𝑖 − 𝑌𝑖 � 𝑛 𝑖=1 𝑛 1 2 𝑀𝑆𝐸 = ��𝑌�𝑖 − 𝑌𝑖 � 𝑛 Root Mean Square Error (RMSE) 𝑖=1 𝑛 1 2 𝑅𝑀𝑆𝐸 = √𝑀𝑆𝐸 = � ��𝑌�𝑖 − 𝑌𝑖 � 𝑛 𝑖=1 Mean Absolute Percentage Error 𝑛 1 𝑌�𝑖 − 𝑌𝑖 𝑀𝐴𝑃𝐸 = � � � 𝑛 𝑌𝑖 Mean Absolute Scaled Error (MASE) 𝑖=1 -52- 𝑛 1 𝑌�𝑖 − 𝑌𝑖 𝑀𝐴𝑆𝐸 = � � � 1 𝑛 𝑛 |𝑌 ∑ | − 𝑌 𝑖=1 𝑖−1 𝑛 − 1 𝑖=2 𝑖 RMSE and MAPE are used most widely as performance measures in time-series forecasting [18] and in load forecasting [19]. However, for meta-learning the system, and, later, for performance comparison, MASE [72] and Normalized Root Mean Square Error (NRMSE) are used, instead. This is due to problems with the same scale (RMSE) and division by zero (MAPE). Amongst the different versions of NRMSE, I selected RMSE over standard deviation since it depicts best, different scales in one format. It is defined as: 𝑁𝑅𝑀𝑆𝐸 = � 2 ∑𝑛𝑖=1�𝑌𝑖 − 𝑌�𝑖 � ∑𝑛𝑗=1�𝑌𝑗 − 𝑌�� 2 The code, for the MAPE calculation, is as follows: mape = Dataset(k).stdY*(Yrez(border+d1:border+d2,:)... -repmat(Y(border+d1:border+d2),1,:))... ./(repmat(Y(border+d1:border+d2),1,:)*Dataset(k).stdY+Dataset(k).mean Y... *ones(size(Yrez(border+d1:border+d2,:)))); AvMAPE = mean(abs(mape)); The code, for the RMSE and NRMSE calculation, is as follows: rmse = sqrt(mean((Yrez(border+d1:border+d2,:)repmat(Y(border+d1:border+d2),1,:)).^2)); rmse2 = rmse*Dataset(k).stdY; nrmse1 = rmse/mean(Y); nrmse2 = rmse/(max(Y)-min(Y)); nrmse3 = rmse/std(Y); MASE is calculated using the following part: MASgo = abs(Yrez(border+d1:border+d2,:)repmat(Y(border+d1:border+d2),1,:)); MASlen = mean(abs(Y(2:end)-Y(1:end-1))); mase = MASgo/MASlen; AvMASE = mean(abs(mase)); -53- Afterwards, the results are added to the structure containing results for a certain Task 𝑘. In order to rank the results, the ranking part uses the results which are obtained from the ensemble’s components. The obtained ranking is: 𝑅𝑖,𝑗 = � Where: max 𝑐𝑜𝑢𝑛𝑡�𝑅1,𝑗,𝑘 �, ∀𝑗, ∀𝑘 𝑅𝑖,𝑗,7 , 𝑖 > 1, ∀𝑗 𝑅𝑖,𝑗 is the ensemble ranking; 𝑅𝑖,𝑗,𝑘 are the forecasting algorithm rankings for the 𝑖-th place; 𝑗 is the Task index; and 𝑘 is the classification algorithm index such that the GP index is 7. Those rankings are on the Task level and are based on MASE aggregated for all the cycles of each Task. Measures, stored in a structure for each Task, are: Testrez(k).Yrez = Yrez; Testrez(k).MAPE = mape; Testrez(k).AvMAPE = AvMAPE; Testrez(k).MASE = mase; Testrez(k).AvMASE = AvMASE; Testrez(k).rmse1 = rmse; Testrez(k).rmse2 = rmse2; Testrez(k).nrmse1 = nrmse1; Testrez(k).nrmse2 = nrmse2; Testrez(k).nrmse3 = nrmse3; The following is the part which returns the label based on the performance of the forecasting algorithms: %Rank the algorithms for i=1:k, A(i,:)=Testrez(i).AvMASE; -54- end [B,RankedIndexes] = sort(A,2); In the end, the results are stored and various plots are made in order to provide a visual perspective of the meta-learning system performance. -55- “Opportunities multiply as they are seized” Sun Tzu 11 5 Forecasting Algorithms In this section, I give an overview of the forecasting algorithms used in the metalearning system. The space of time-series forecasting algorithms is wide and becomes constantly richer with new approaches. For the meta-learning system, I selected well-known algorithms of different classes. In the end, the proposed selection of forecasting algorithms for the meta-learning system consists of the following: 1. Random Walk; 2. ARMA; 3. Similar Days; 4. Layer Recurrent Neural Network (LRNN); 5. Multilayer Perceptron; 6. v-Support Vector Regression (v-SVR); and 7. Robust LS-SVM (RobLSSVM). 5.1 Random Walk Random Walk is a synonym for a trajectory of successive random steps. Pearson used the term first [73]. Nowadays, different versions of random walk exist depending on the field of application. Figure 10 gives an example of random walk. Random Walk serves as a fundamental model for a stochastic activity in many fields. Lévy flight and the drunkard's walk are RW special cases [74]. Random walk is used as a benchmark in time-series problems like time-series forecasting. In time-series forecasting, random walk is simple and it is used to estimate the upper error bound. For load time-series 𝑌, prediction using RW is defined as: 𝑌� = 𝑌𝑖−𝑝𝑒𝑟 + 𝑒𝑡 , 11 (5.1) Sun Wu, better known as Sun Tzu or Sunzi, was an ancient Chinese general; strategist; and philosopher, well-known as the author of the book “The Art of War”. -56- where, as will be introduced in section 7, 𝑝𝑒𝑟 denotes periodicity; and 𝑒𝑡 is the white noise, which, from time to time, is uncorrelated if 𝑝𝑒𝑟 = 0, median 𝑝𝑒𝑟 for same granularity is used. 12 8 Value [ ] 4 0 -4 -8 -12 0 10 20 30 40 50 Time [ ] 60 70 80 90 Figure 10: An Example of Three Different Random Walks This RW differs slightly from a common RW approach and uses 𝑝𝑒𝑟 data points in the past, instead of 1, because, in load forecasting application, each forecast is made without the knowledge of, at least, one real value which precedes the forecasted period. Selecting 𝑝𝑒𝑟 instead of 2 is more realistic and due to highest autocorrelation, returns the best results for RW. Figure 11 compares a load forecasting error, using random walk, with a different number of skipped steps from 1 to 𝑝𝑒𝑟. This comparison shows that 𝑝𝑒𝑟 gives the lowest results. In the typical case of the daily or lower granularity, RW relies on the values of the same hour one week previously. On monthly granularity, it is frequently 12, and, for some other granularities, such as yearly, there is no such yardstick. The additional advantage, of random walk calculated in this way, is that it is insensitive to the outliers, in the few most recent data points, because it uses older data. -57- 18 16 MAPE [%] 14 12 10 8 6 4 1 21 41 61 81 Skip [ ] 101 121 141 161 Figure 11: MAPE for RW using Various Skips from 1 to 168 on the example of Task 38. 𝒑𝒆𝒓 is 24 for this example and what can be seen as low MAPE 5.2 ARMA Autoregressive moving average model or ARMA [75] is a classical time-series approach dating back to the Box-Jenkins methodology which is used often to estimate ARMA models. 5.2.1 ARMA in Load Forecasting In numerous early studies [12], [17] and [76], ARMA was applied successfully to load forecasting. Fan and McDonald [77] implemented online ARMA modelling, using the weighted recursive least squares algorithm, for a short-term load forecasting application in a distribution utility. The model had a 2.5 % error for the forecasts up to one week ahead. Chen et al. [78] used an adaptive ARMA model for load forecasting. That model used the available forecast errors to update itself by using minimum mean square error to derive error learning coefficients. As reported, the adaptive scheme outperformed conventional ARMA models. Paarmann and Najar’s [79] adaptive on-line load forecasting approach adjusted automatically model parameters according to changing conditions based on the time-series analysis. This approach has two unique features: autocorrelation optimization is used for handling cyclic patterns; and, in addition to updating model parameters, the structure and -58- order of the time-series are adaptable to new conditions. Nowicka-Zagrajek and Weron [80] fitted standard and adaptive ARMA models, with hyperbolic noise, to a deseasonalized time-series of electric loads in the California power market. The proposed method outperformed significantly the one used by the California Independent System Operator. 5.2.2 Implemented Algorithm ARMA or 𝐴𝑅𝑀𝐴(𝑝, 𝑞) is a model of a stationary time-series which combines two parts: autoregressive 𝐴𝑅(𝑝); and moving-average 𝑀𝐴(𝑞). Stationarity is a property, of a stationary process and of a time-series, which is stochastic and, when shifted in time or space, there is no change to its joint probability distribution [81]. Also, mean and standard deviation, of a stationary timeseries, do not change over time or position. Due to stationarity, often, time-series are pre-processed for ARMA using normalization. Autoregression is a property of a time-series method which aims to use the regression, of one part of time-series, to explain some other part of the same timeseries. It relies on the autocorrelation in a time-series. For a time-series prediction of 𝑌�, with an order of an autoregression process of 𝑝, an 𝐴𝑅(𝑝) model is defined as: 𝑝 𝑌�𝑡 = 𝜑1 𝑌𝑡−1 + 𝜑2 𝑌𝑡−2 + ⋯ + 𝜑𝑝 𝑌𝑡−𝑝 = � 𝜑𝑖 𝑌𝑡−𝑖 + 𝑒𝑡 (5.2) 𝑖=1 Where 𝜑𝑖 are the parameters of a model and 𝑒𝑡 is the white noise [82]. 𝜑𝑖 can be calculated using least squares methods; maximum likelihood method; Yule-Walker equations; or Burg’s method [82] and [83]. For a time-series, the moving average model (MA model) [75] is a weighted linear combination of previous values of a white noise process or a regression of one value, of the stochastic time-series, against previous white noise random shocks. A moving average model of order 𝑞, 𝑀𝐴(𝑞) is defined as: where: 𝑞 𝑌�𝑡 = 𝑒𝑡 + 𝜃1 𝑒𝑡−1 + 𝜃2 𝑒𝑡−2 + ⋯ + 𝜃𝑞 𝑒𝑡−𝑞 = 𝑒𝑡 + � 𝜃𝑖 𝑒𝑡−𝑖 , 𝑖=1 𝜃𝑖 are a model’s parameters; and -59- (5.3) 𝑒𝑡 is the white noise. By using a backshift operator 𝐵 it can be written as: 𝑌�𝑡 = 𝐵�Ž�𝑒𝑡 , where 𝐵�Ž� = 1 + 𝜃1 Ž−1 + 𝜃2 Ž−2 + ⋯ + 𝜃𝑞 Ž−𝑞 , Ž−𝑘 𝑒𝑡 = 𝑒𝑡−𝑘 As done in practice, this moving average model definition assumes prior normalization to a zero-mean. Otherwise, a term, representing mean, ought to be added to the model definition. If the roots of 𝐵(Ž) are inside the unit circle |𝜃𝑖 | < 1, the MA model is invertible. In cases in which a moving average model is invertible, it can be rewritten as an autoregressive model [76]. Additionally, the parameters, of one model, can be used to determine the parameters of the other [8] and [36]. By combining 𝐴𝑅(𝑝) and 𝑀𝐴(𝑞), 𝐴𝑅𝑀𝐴(𝑝, 𝑞) is defined as: 𝑝 𝑞 𝑖=1 𝑖=1 𝑌�𝑡 − � 𝜑𝑖 𝑌𝑡−𝑖 = 𝑒𝑡 + � 𝜃𝑖 𝑒𝑡−𝑖 , all parameters are equal as in (5.2) and (5.3). The best fitting ARMA model is found by model identification. The process, of ARMA model identification, consists of searching for the best fitting model with different 𝐴𝑅(𝑝), 𝑀𝐴(𝑞) and 𝐴𝑅𝑀𝐴(𝑝, 𝑞) orders. Historically, models were identified manually by trying low order models such as 𝐴𝑅𝑀𝐴(1,1), 𝐴𝑅𝑀𝐴(2,1), 𝐴𝑅𝑀𝐴(1,2) or by considering the sample autocorrelation function (ACF) or partial autocorrelation function (PACF) plots. One possible approach to automatic model selection, is to compare the accuracy results and obtain reduced statistics of the ARMA model on the data itself and on a long AR model as first-stage input. In this work, I applied the ARMAsel algorithm for ARMA model identification and selection as developed by Broersen [84]. ARMAsel computes the 𝐴𝑅(𝑝) models, of -60- order 𝑝 = 0, 1, 2, … , 𝑛/2, where, in the model being calculated, 𝑛 is the number of data points in a time-series. The AR model is used for estimation because it is easier, than in the case of MA autocorrelation function to obtain its parameters. Often, the best model order is greater than 0.1𝑛 where 𝑛 is the number of data points in a time-series. In order to estimate AR models, ARMAsel uses Burg’s method [85] and [84]. Burg’s method is applied because it is always stationary and, compared to the least squares methods; maximum likelihood method; and the Yule-Walker equations, it gives a lower expected mean square error for the parameters of the estimated AR model. During iterations of the calculation using all available information, Burg’s method estimates a single coefficient of an order. Residuals are used for the estimation of 𝜍1 , 𝑛 − 1; for the estimation of 𝜍2 , 𝑛 − 2; and so on. At the 𝜍-th iteration, the first 𝜍 − 1 coefficients are kept constant. Model parameters are calculated using LevinsonDurbin recursion. The currently estimated model is filtered out of the data and a single coefficient is estimated. Forward (𝜌𝑓,𝑖 ) and backward (𝜌𝑏,𝑖 ) residuals of order zero up to intermediate order Γ are defined as: 𝜌𝑓,0 (𝑙) = 𝑌𝑙 𝜌𝑓,1 (𝑙) = 𝑌𝑙 + 𝚥̂1 𝑌𝑙−1 … 𝜌𝑓,Γ (𝑙) = 𝑌𝑙 + 𝜑�1 Γ 𝑌𝑙−1 + 𝜑�2 Γ 𝑌𝑙−2 + ⋯ + 𝜑�Γ Γ 𝑌𝑙−𝐽 , 𝑙 = Γ + 1, … , 𝑛 Where forward residuals 𝜌𝑓,𝑖 are filtered out, backward residuals 𝜌𝑏,𝑖 are found by filtering the present model: 𝜌𝑏,0 (𝑙) = 𝑌𝑙 𝜌𝑏,1 (𝑙) = 𝑌𝑙−1 + 𝚥̂1 𝑌𝑙 … 𝜌𝑏,Γ (𝑙) = 𝑌𝑙−Γ + 𝜑�1 Γ 𝑌𝑙−Γ+1 + 𝜑�2 Γ 𝑌𝑙−Γ+2 + ⋯ + 𝜑�Γ Γ 𝑌𝑙 , 𝑙 = Γ + 1, … , 𝑛 -61- By using a reverse vector operator ⃖���, which changes the place of values in a vector symmetrically (𝑋𝑖 = 𝑋𝑛−𝑖+1 , ∀𝑖 ≤ 𝑛), forward and backward residuals can be written in a more compact form as: 𝜌𝑓,Γ (𝑙) = [𝑌𝑙 𝑌𝑙−1 ⋯ 𝜌𝑏,Γ (𝑙) = [𝑌𝑙 𝑌𝑙−1 ⋯ 1 𝑌𝑙−Γ ] � � [Γ] � 𝜙 ⃖� [Γ] 𝑌𝑙−Γ ] �𝜙� � 1 The Levinson-Durbin recursion goes as follows: 𝜄02 = 𝑟(0) 𝜍1 = − for Γ = 1,2, … , 𝑝 − 1: 𝑟(1) 𝜄02 𝜄12 = 𝜄02 (1 − 𝑗12 ) 𝜍Γ+1 = − 𝑟(Γ + 1) + 𝑟⃖Γ𝑇 𝜙 [Γ] 𝜄Γ2 2 2 𝜄Γ+1 = 𝜄Γ2 (1 − 𝜍Γ+1 ) 𝜙 [Γ+1] = �𝜙 [Γ] + 𝜍Γ+1 𝜙⃖� [Γ] � 𝜍Γ+1 Where 𝜙 [Γ] is a parameter vector for the Γ-th iteration 𝜙 [Γ] = [𝜑1Γ 𝜑1Γ ⋯ 𝜑ΓΓ ]𝑇 , and the reversed version, of that vector is 𝜙⃖� [Γ] . The reversed version of the autocovariance vector 𝑟 = [𝑟(𝑝) 𝑟(𝑝 − 1) ⋯ 𝜄𝐽2 is a residual variance of the AR model. 𝑟(1)]𝑇 , is 𝑟⃖ 𝑇 . In the Levinson-Durbin recursion, the forward and backward residuals become: 𝜌𝑓,Γ (𝑙) =𝜌𝑓,Γ−1 (𝑙) + 𝜍̂Γ 𝜌𝑏,Γ−1 (𝑙 − 1), 𝜍 = Γ + 1, … , 𝑛 𝜌𝑏,Γ (𝑙) =𝜌𝑏,Γ−1 (𝑙 − 1) + 𝜍̂Γ 𝜌𝑓,Γ−1 (𝑙) -62- From those equations, the single unknown 𝜍̂Γ is estimated by minimizing the sum of squares of the forward and backward residuals 𝑛 2 2 (𝑙)� (𝑙) + 𝜌𝑏,Γ 𝑅𝑆𝑆(Γ) = � �𝜌𝑓,Γ (5.4) 𝑙=Γ+1 By equalising the derivative, with respect to the coefficient value of zero, for 𝚥̂𝐽 as follows: 𝚥̂Γ = −2 𝑛 ∑𝑛𝑙=Γ+1 �𝜌𝑓,Γ−1 (𝑙) 𝜌𝑏,Γ−1 (𝑙 − 1)� 2 2 ∑𝑛𝑙=Γ+1�𝜌𝑓,Γ−1 (𝑙)𝜌𝑏,Γ−1 (𝑙 − 1)� |𝜍̂Γ | < 1 because 2 � �𝜌𝑓,Γ−1 (𝑙) ±𝜌𝑏,Γ−1 (𝑙 − 1)� ≥ 0 𝑙=Γ+1 𝑛 2 (𝑙) 2 (𝑙 − 1)� ≥ 0 � �𝑓Γ−1 ±2𝜌𝑓,Γ−1 (𝑙) 𝜌𝑏,Γ−1 (𝑙 − 1) + 𝜌𝑏,Γ−1 𝑙=Γ+1 𝑛 𝑛 𝑙=Γ+1 𝑙=Γ+1 2 2 (𝑙) + 𝜌𝑏,Γ−1 (𝑙 − 1)� ±2 � 𝜌𝑓,Γ−1 (𝑙) 𝜌𝑏,Γ−1 (𝑙 − 1) ≤ � �𝜌𝑓,Γ−1 All used coefficients, being less or equal to 1, are the same since all the AR parameter’s poles are within a unit circle. From (5.4), it follows that residual variance of the Burg’s method is related to the residual sum of squares. A fit of 𝐴𝑅(Γ) model is defined as: 2 𝜄Γ2 = 𝜄Γ−1 (1 − 𝜍̂Γ2 ) (5.5) Broersen and Wensink [86] showed that (5.5) approximates the average of squared forward and backward residuals of (5.4) as: 𝜄Γ2 2 2 (𝑙)� ∑𝑛𝑙=Γ+1�𝜌𝑓,Γ (𝑙) + 𝜌𝑏,Γ 𝑅𝑆𝑆(Γ) ≈ = 2(𝑛 − Γ) 2(𝑛 − Γ) In order to optimize performance, the AR order 𝑝 is calculated using a CIC criterion defined as: -63- 𝐶𝐼𝐶(𝑝) = ln�𝜄𝑝2 � + 𝑝 𝑛+2−𝑖 max �� − 1, 𝑖−𝑛 𝑖=0 𝑝 3� 𝑖=0 1 � 𝑛+1−𝑖 Although order 𝑝 can be less or equal to 𝑛 − 1, it is limited to 𝑝 = boundary for accurate estimations. 𝑛 2 which is the upper In order to boost the created model’s performance, a stopping criterion on 𝑛 is set to 2000, but, if the performance is better, an ARMA2cor function extends the number of taken data points. 𝑀𝐴(𝑞) is estimated by using the model error calculation. Because 𝐴𝑅(𝑝) is used as an intermediate model for 𝑀𝐴(𝑞), the maximum candidate order, for MA is chosen always lower than AR model. For the estimated AR model, the best MA model order 𝑄 is chosen by using the general information criterion (GIC) in 𝑄 = min(𝑞, 3), with: 𝐽 𝐺𝐼𝐶(Γ, 𝛼) = 𝑙𝑛𝜄Γ2 + 𝛼 𝑛, where 𝛼 is the penalty factor. For the best MA model order 𝑄, the parameters are estimated by using the previously estimated AR model. ARMA2cor determines the maximum length of 𝐴𝑅(𝑝) and 𝑀𝐴(𝑞) parts. It uses the correlation results, of the function GAIN, which is used to determine a power gain. The power gain is defined as the ratio of variance of a time-series 𝑌 and a variance of a white noise 𝑒. ARMAcor determines the number of lags for the MA autocorrelation function. If it exceeds 2000, the number is extrapolated for the AR contributions [82] and [87]. ARMAsel [82] and [88] returns the candidates for the best model. By minimizing 𝐺𝐼𝐶 criterion from those candidate models, the best one is selected. Similar to the 𝑀𝐴 residuals, the residuals are obtained by backwards extrapolation with a long 𝐴𝑅 model. For ARMA, the final prediction error (FPE) uses the relationship between unbiased expectations of the residual variance 𝜄Γ2 and the prediction error for Γ. It is defined as: 𝐹𝑃𝐸(Γ) = 𝜄Γ2 -64- 𝑛+Γ 𝑛−Γ By selecting the best 𝐴𝑅(𝑝) model; the best 𝑀𝐴(𝑞) model; and the best 𝐴𝑅𝑀𝐴(𝑖, 𝑗) model, FPE is computed as an estimate of the prediction error. For MA and ARMA, the prediction error is calculated using asymptotical relations as: 𝑛 𝑛 + 𝑚𝑝𝑎𝑟 𝑌𝑖 � , 𝑃𝐸�𝑚𝑝𝑎𝑟 � = 𝑛 − 𝑚𝑝𝑎𝑟 𝐵(Ž) 𝑖=1 where 𝑚𝑝𝑎𝑟 is the number of a model’s estimated parameters. Asymptotical relations are applied to all moving average and ARMA models with 𝑚𝑝𝑎𝑟 parameters. For 𝐴𝑅(𝑝) models, 𝑃𝐸(𝑝) is derived from the statistical properties of standard deviation and variance for the finite-sample expectations as: 𝑝 𝑃𝐸(𝑝) = 𝜄𝑝2 � 1 1+𝑛+1−𝑚 𝑝𝑎𝑟 1 𝑚𝑝𝑎𝑟 =1 1 − 𝑛 + 1 − 𝑚𝑝𝑎𝑟 This can be substituted by finite-sample variance of coefficients 𝜍𝑖 : 𝑣𝑎𝑟(𝜍𝑖 ) = 1 𝑛+1−𝑖 The model type, which has the smallest estimate of the prediction error, is selected. That single model is the result or ARMAsel. In general, as exemplified by an accuracy comparison in Figure 12, the estimated model accuracy is better for 𝐴𝑅𝑀𝐴(𝑝, 𝑞) than for 𝐴𝑅(𝑝) and 𝑀𝐴(𝑞). Predictions 𝑌�𝑡+1 … 𝑌�𝑡+𝐻 are calculated in one batch by applying the model to the original time-series 𝑌 up to the data point 𝑌𝑡 . 𝐻 represents the calculated forecasting horizon and it depends on the data granularity and forecasting horizon 𝑠𝑡𝑒𝑝. The 𝑠𝑡𝑒𝑝 is 24 for the hourly granularity; 30 for the daily granularity; 12 for the monthly granularity; and 3 for the yearly granularity. The calculated forecasting horizon is 36 for hourly granularity; 31 for daily granularity; 13 for monthly granularity; and 3 for yearly granularity. After the calculation, load forecasting, using ARMA is done iteratively. This is similar to other algorithms, and as it is described in the previous section. -65- 0,3 0,28 Accuracy [ ] 0,26 0,24 0,22 0,2 AR(r) 0,18 MA(r) 0,16 ARMA(r,r-1) 0,14 1 101 201 301 401 501 Order [r] Figure 12: Comparison shows a typical case, 𝐴𝑅𝑀𝐴(𝑟, 𝑟 − 1) has higher accuracy compared to 𝐴𝑅(𝑟) and 𝑀𝐴(𝑟) which indicates it also returns lower error Figure 13 presents a comparison of the best selected model for 𝐴𝑅𝑀𝐴(𝑝, 𝑞) with the actual load. 2,5 Actual Forecasted 2 1,5 Load [ ] 1 0,5 0 -0,5 -1 -1,5 1 25 49 73 97 121 145 Hour [h] 169 193 217 Figure 13: Load forecasting using ARMA for first 10 days of Task 38 -66- 5.3 Similar Days Historically, short-term load forecasting is the most used type of forecasting since, traditionally in power systems, loads are planned and scheduled on a daily basis. It was one of the first methods applied to load forecasting and continues to be used today, mainly as the part of hybrid solutions with neural networks or support vector machines [89], [90] and [76]. In short-term load forecasting, the Similar days technique emerged from the high similarity of different daily patterns. It is motivated by the fact that, for a particular load on the same day of the week, with similar values of weather and other factors, the load will behave similarly. Also, on days with high similarity of feature sets, loads will be more similar. Feature similarity can be calculated in different ways; however, two approaches, for the same feature, are: • Equality of features; and • Euclidean distance between features. In practice, equality is used for fixed feature values such as calendar information and is a precondition in comparing the similarity of features using the Euclidean distance. That two-step approach can be expressed as follows: Firstly, the equality for all corresponding features and periods (days) is checked, using the indicator function: 𝐼�∩(𝑋𝑖,𝐷=𝑋𝑖,𝑗)� , ∀𝑗 = 1, … , 𝐷 − 1 (5.6) Where 𝑖 is the index of all features which have to be checked for equality constraint; 𝐷 is the index of forecasted period; and 𝑗 is the index of all periods which preceded that forecasted period. 𝑋𝑖,𝐷 and 𝑋𝑖,𝑗 are descriptors of each feature belonging to a certain period. Between all periods (days) which satisfy the precondition, the algorithm will search for those which have the most similar features with the period (day) being forecasted: 𝐷−1 𝑛𝑓 ℎ 𝑚𝑖𝑛𝑖 � � � �(𝑋 2 𝐷,𝑗,𝑘 − 𝑋 2 𝑖,𝑗,𝑘 ), 𝑖=1 𝑗=1 𝑘=1 (5.7) where 𝑖 is the index of the period for which distance is being minimized to the forecasted period, indexed with 𝐷; -67- 𝑛𝑓 is the number of all features used for the Euclidean distance calculation; 𝑗 is the index over those features; ℎ is the number of data points in each period (day); 𝑘 is the index over those data points; and 𝑋 is the data point in each feature. This process is repeated iteratively 5 times for each value of 𝐷; every time removing data, corresponding to 𝑥, from the learning dataset and remembering it as 𝑥𝑖 where 𝑖 is incremented by 1. In the end it leaves, with indexes 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 , 5 most similar days of the original learning set. The forecast is obtained by taking, for each data point, the median value of loads as follows: 𝑌�𝐷 = 𝑌�𝑥1, …,𝑥5 (5.8) With the average value between best 5 days defined as: 𝑌𝑥̅ ,𝑖 5 1 = � 𝑌𝑥𝑘,𝑖 5 𝑘=1 The median calculation, written formally per data point, is: 𝑌�𝐷,𝑖 = 𝑌𝑥̅ ,𝑖 + 𝑚𝑖𝑛𝔼�𝑌𝑥1 ,𝑖 − 𝑌𝑥̅ ,𝑖 , 𝑌𝑥2 ,𝑖 − 𝑌𝑥̅ ,𝑖 , 𝑌𝑥3 ,𝑖 − 𝑌𝑥̅ ,𝑖 , 𝑌𝑥4 ,𝑖 − 𝑌𝑥̅ ,𝑖 , 𝑌𝑥5 ,𝑖 − 𝑌𝑥̅ ,𝑖 �, ∀𝑖 = 1, … , ℎ (5.9) where 𝔼 is the expectation. Figure 14 which includes all important time-series, is an example of a load forecasting procedure using similar days. It shows the best candidate days for similarity, forecasted and actual loads, for a randomly chosen similar day forecast. In general, the Similar days approach can be extended to Similar periods. However, it is not applied in practice. -68- 2 Load [ ] 1,5 1 0,5 0 -0,5 1 5 Actual Load 9 13 Hours [h] SD SD1 17 SD2 21 SD3 Figure 14: Three Best Candidates (Similar Days: SD1, SD2 and SD3), forecast as a Median SD and the Actual Load for a Random Day of Task 38 5.3.1 Application of Similar Days In recent years, Similar days are applied often in a hybrid with one or more machine learning methods. They are used as data extraction in order to reduce the number of data points which has a performance impact on machine learning algorithm classes such as neural networks; SVMs; and LS-SVMs. Mandal et al. [91] proposed electricity price forecasting by using a neural network and Similar days method. They extended Similar days to Similar price days and introduced weighting of input data in an Euclidean distance calculation; this is a measure of similarity. Due to the high correlation of load and price, the model’s input data is based on load. In another paper, Mandal et al. [89] proposed a hybrid of Similar days and neural network to forecast up to six hours ahead, both electricity price and load. Together with load, they used temperature as input data. They compared only the month of February to the results of [92] and obtained slightly better results. Li et al. [90] used support vector machines to forecast the next day’s load and, then used the Similar days method to correct the results. Similar days are looking to identify a day with the lowest load variance, within the previous days, to the forecasted load. -69- 5.4 Neural Network In recent years, artificial neural network or neural network is the load forecasting class of algorithms which is used most widely. In [35], a response to the query in Life and Physical Science more papers were returned, in a single year, on “neural network” than on “support vector”; ARIMA; “similar days”; or ARMA, terms together with “load forecasting”. For example, in 2011, there were 170 papers which included both terms “load forecasting” and “neural network”, whilst there were only 58 papers which included both terms “load forecasting” and “support vector”, which positioned “support vector” based algorithms in second place. Figure 15 illustrates the detailed results of my SCOPUS query. 250 Number of papers 200 150 "neural network" "support vector" ARMA 100 ARIMA "similar days" 50 0 2011 2010 2009 Year 2008 2007 Figure 15: Comparison of the Number of Load Forecasting Scientific Papers which mentioned different forecasting approaches Neural networks are popular probably because of their simple and straightforward usages in which the user does not need to go to complex depths in order to achieve a relatively good solution to the problem. High usage may not necessarily mean that neural networks are the best algorithm in overall performance. Therefore, what is a neural network? -70- 5.4.1 Definition The following is a definition of a neural network as viewed as an adaptive machine from [93] and [94]: A neural network is a massively parallel, distributed processor which is made up of simple processing units. It has a natural propensity to store experiential knowledge and make it available for use. It resembles the brain in two respects: • The network acquires knowledge through a learning process from its environment; and • Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge. The neural network’s building unit is a neuron. It is a node in the network which houses either a function for calculation or values. The edges connecting the nodes are called weights. Figure 16 shows the difference, between neurons and weights, on an example of a neural network. Figure 16: Weights are Connections between Neurons 5.4.2 History of Neural Network Development The modern development of neural networks begins with McCulloch and Pitts’ paper [95]. They developed a mathematical model of a neuron which followed the “all-ornone” rule. They showed that a large, connected network of such neurons can approximate any function. It is agreed that this paper was the start of the neural -71- network and artificial intelligence fields. The development influenced von Neumann in building the Electronic Discrete Variable Automatic Computer or EDVAC which succeeded Electronic Numerical Integrator and Computer or ENIAC, the first electronic computer for general usage [96]. In his doctoral thesis on neural networks, Minsky showed interesting results. It was probably the first time that reinforcement learning, an important approach in multi-agent systems was considered. In 1956, Rochester and a team of scientists ran a computer simulation based on Hebb’s postulate of learning [97]. A year before he died, von Neumann solved a problem of designing a complex system with neurons as potentially unreliable components. He did it by introducing redundancy. After his death, his unfinished manuscripts, on differences between brains and computers, were published in [98]. In 1958, Rosenblatt introduced Perceptron [99] which, at the time, was a novel supervised learning algorithm. Later, he proved perceptron convergence theorem [100] which is considered to be his greatest achievement to the field of neural networks [97]. In his Ph.D. thesis [101], Cowan gave a mathematical description of a neuron and introduced a sigmoid function as a function of a neuron state. Minsky and Papert’s book [102] demonstrated the fundamental limits of single-layer perceptrons and stated there were no assumptions that multilayer perceptrons would overcome such limitations. Under that notion, the interest, in the field of neural networks, died during the 1970’s. Although it was rediscovered a few times, subsequently, Werbos, in his Ph.D. thesis [103], described a very important discovery. This was a backpropagation algorithm as a back gradient procedure applied to neural networks. In 1982, Hopkins used the energy function to redesign recurrent networks with symmetric connections. He linked the model with the Ising model in statistical physics; for physicists, this paved a way in neural modelling. Those networks demonstrate that information can be stored in dynamically stable networks. Later, that type of neural networks became Hopfield networks. In the same year, Kohonen proposed his self-organizing maps [104] and [105]; an idea of a neural network approach which can be used for 2-dimensional visualization. In 1983, simulated annealing was proposed [106] and, based on it, there was proposed a stochastic machine, known as the Boltzmann machine [107]. The Boltzmann machine was the first multilayer network. It showed that Minsky and Papert’s assumption on MLPs -72- [102], was wrong and, consequently, boosted the confidence in and the usage of neural networks. In 1988, Broomhead and Lowe [108] proposed layered feed-forward networks through using RBFs. Even today, RBFs stand as a field separately from MLPs which is the mainstream in neural network research and development. According to [97], Rumelhart and McLelland’s book [109] had a strong influence in the resurgence and popularity of neural networks. The name, neural networks, was coined from the analogy of a mathematical model and neurons in the human brain. As shown in Figure 17, a typical human neuron has three distinct parts. The nucleus is the core of each neuron. Figure 17: Main Parts of a Human Neuron [110] Dendrites are short synaptic links close to the nucleus. Electric signals, from other neurons, arrive at dendrites. Axon is the long neck of the neuron through which electric signals are sent to other neurons. Synapse is the connection between two neurons, e.g. between the axon or receptor, of one neuron, and dendrite of the other. Figure 18 shows, as a logistic unit, the mathematical model of a neuron. Figure 18: Neuron as an Activation Unit -73- 5.4.3 Types of Neural Networks Although dozens of neural network types exist, they can be classified into three fundamentally distinct groups: single-layer feed-forward networks; multilayer feedforward networks; and recurrent networks [97]. Single-layer feed-forward networks are simplest and were the first to emerge. They consist of an input layer and an output layer. The input layer is built out of source nodes into which the input data e.g. feature set of past loads; temperature, and gross domestic product are fed. The input layer is linked to the output layer which consists of neurons. The output is calculated in neurons via logic which is set to the neural network. The link, for the calculation between source nodes and output neurons, is based on an activation function. Single-layer networks process information only in the direction from the input to the output layer which is why they are referred to as acyclic nodes. Each connection, between nodes and neurons or neurons between themselves, is called a synapse. They are strictly feed-forward networks because the information is processed from the input to the output. Figure 19 depicts an example of a single-layer feed-forward network. Figure 19: A Simple Single-layer Feed-forward Neural Network -74- Multilayer feed-forward networks are more complex and have one or more hidden layers between the input and output layers. In research and in practice, 99 % of the multilayer feed-forward networks have one hidden layer which is why, if it is not stated otherwise, it can be assumed that a multilayer feed-forward network is a threelayer network. Hidden layers consist of hidden neurons which are referred to, also, as hidden units. Usually, in calculating values, in all layers with the exception of the input layer, one more node, representing the previous layer, is added. That node plays the role of the bias weight and, therefore, it is called the bias unit [94]. Hidden neurons add complexity which, in return, enables more generalized estimation of functions and convergence. As in the case of single-layer feed-forward networks, neurons are connected with synapses which are represented by the activation function. A general description, of a neural network topology, can be expressed as a sequence of numbers, e.g. 32-9-24. That network topology is typical for an hourly, day-ahead load forecasting neural network. It consists of 32 features supplied to 9 hidden neurons which help to shape the resultant function which returns 24 output values. Figure 20 shows a simpler neural network topology of 8-4-6. Figure 20: Multilayer Feed-forward Network with 8-4-6 Topology -75- A fully connected neural network has all neurons connected to neurons or nodes in the previous layer and it is symmetric. A partially connected neural network is missing some synaptic connections. Often, partially connected neural networks have specialized neurons with different purposes or activation functions. The backpropagation algorithm is very important for multilayer feed-forward networks. It is based on back and forth updating of neuron weights in the network; this leads to convergence of the solution. Similar to the multilayer feed-forward networks being a generalization of single-layer networks, Recurrent Neural Networks (RNNs) are generalizations of multilayer feedforward networks. RNNs are characterized by, at least, one feedback loop. A feedback loop can go to units of any layer. If it goes to the units, of the same layer, it is called a self-feedback loop. Using the world as an analogy, nervous systems have a lot of feedbacks. The feedback occurs in almost every part of every animal’s nervous system [111] and [94]. Recurrent neural networks do not have to have a hidden layer. Some feedback loops can have unit delay elements. Recurrent neural networks, with unit delay elements, are known as Time Delay Neural Networks (TDNNs). Figure 21 depicts an example of the recurrent neural network. Figure 21: Recurrent Neural Network with One Hidden Layer -76- 5.4.4 Activation Function A node’s activation function is a transfer function which transfers values coming from the input edges. It is a synonym to functional mapping, therefore the same notation is used. Usually, the activation function is non-linear and differentiable everywhere (smooth). Also, it is uniform, for the whole neural network, and transfers values to the interval -1 to 1. Most frequently, the transfer function is a sigmoid function; this is a group of functions which maps an input signal to interval [0, 1] or [-1, 1]. For an incoming value, 𝑖𝑛, the value, after the activation 𝜑 = 𝜑(𝑖𝑛), is defined by the logistic function (4.1) as: 𝑓(𝑖𝑛) = 1 1 + exp(−𝑖𝑛) In experiments, I use a variation which is commonly used for neural networks. This is a hyperbolic tangent sigmoid transfer function (tan-sig) defined as [112]: 𝑡𝑎𝑛𝑠𝑖𝑔(𝑖𝑛) = of which the plot is shown in Figure 22. 2 −1 1 + exp(−2𝑖𝑛) f(in) = tansig(in) 1 0 in -1 Figure 22: Tan-sig Function Plot -77- Its main difference is that, instead of mapping results to interval [0, 1] which are more suited for classification, results are mapped to an interval [-1, 1] for which, usually, algorithms for neural networks are tuned. An important characteristic, of sigmoid functions, is that its derivatives, which are related to synaptic weights (as will be shown in the next subsection), change the most in respect of the values in the signal’s midrange which, consequently, stabilizes the neural networks. 5.4.5 Back-propagation Back-propagation is important algorithm for neural networks, especially for MLPs, since it enables relatively fast iterative learning by converging to the optimal local solution. In this subsection, I derive the back-propagation algorithm using the following notation: 𝑖, 𝑗 and 𝑘 are indexes for the ascending layers of the neurons, e.g. 𝑖 is the first and input layer, 𝑗 is the second and hidden layer and 𝑘 is the third and output layer; 𝑖𝑛 is the i𝑛-th data-point arriving at the neuron; 𝑣𝑗 is the induced local field produced at the input of the activation function; 𝑤𝑗𝑖 is the synaptic weight which connects the output of 𝑖 to the input of 𝑗; for notation of weights in other learning algorithms later on, we use, 𝑤, also; 𝑦𝑗𝑖 is the output signal of the neuron 𝑗. For the neural network, in this example, although it is possible to make neural networks work in batches, e.g. (𝑌�𝑖𝑛−𝑡𝑖 , … , 𝑌�𝑖𝑛 ) = (𝑦𝑘 (𝑖𝑛 − 𝑡𝑖), … , 𝑦𝑘 (𝑖𝑛)) with 𝑡𝑖 being the batch size minus 1, the outputs are stored directly as forecasted load values at the end of algorithm procedure 𝑌�𝑖𝑛 = 𝑦𝑘 ; 𝔖 is the sum of error squares of the neural network; 𝑒𝑟𝑗 is the error response of the 𝑗-th neuron; 𝑑𝑗 is the desired output of the neuron 𝑗; and 𝜁 is the learning rate. The error response is defined as: 𝑒𝑟𝑗 = 𝑑𝑗 − 𝑦𝑗 -78- (5.10) From it, the sum of error squares is: 1 𝔖 = � 𝑒𝑟𝑗2 2 (5.11) 𝑗 The induced local field, at the input of the 𝑗-th neuron, is: 𝑛 𝑣𝑗 = � 𝑤𝑗𝑖 𝑦𝑖 (5.12) 𝑦𝑗 = 𝜑𝑗 (𝑣𝑗 ) (5.13) 𝑖=1 The signal, at the neuron output, 𝑦𝑗 , is the (activation) function of induced local field: According to the chain rule calculus, the gradient change, of the sum of the errors over the network’s weights can be expressed as: 𝜕𝔖 𝜕𝔖 𝜕𝑒𝑟𝑗 𝜕𝑦𝑗 𝜕𝑣𝑗 = 𝜕𝑤𝑗𝑖 𝜕𝑒𝑟𝑗 𝜕𝑦𝑗 𝜕𝑣𝑗 𝜕𝑤𝑗𝑖 (5.14) That derivative is known, also, as the sensitivity factor and it determines the direction of search in weight space for the synaptic weight 𝑤𝑗𝑖 . By differentiating both sides of (5.11) with respect to 𝑒𝑟𝑗 , I obtain: 𝜕𝔖 = 𝑒𝑟𝑗 𝜕𝑒𝑟𝑗 (5.15) By differentiating both sides of (5.10) with respect to 𝑦𝑗 I obtain: 𝜕𝑒𝑟𝑗 = −1 𝜕𝑦𝑗 (5.16) The differential of (5.13) with respect to 𝑣𝑗 is: 𝜕𝑦𝑗 = 𝜑𝑗 ′(𝑣𝑗 ) 𝜕𝑣𝑗 The differential of (5.12) with respect to 𝑤𝑗𝑖 is: -79- (5.17) 𝜕𝑣𝑗 = 𝑦𝑗 𝜕𝑤𝑗𝑖 (5.18) Substituting the equations (5.15) to (5.18) in (5.14) with respect to 𝑤𝑗𝑖 leads to: 𝜕𝔖 = −𝑒𝑟𝑗 𝜑𝑗 ′(𝑣𝑗 )𝑦𝑗 𝜕𝑤𝑗𝑖 (5.19) By adjusting weights for a small correction factor, in each iteration, the backpropagation is used to move closer to the solution: ∆𝑤𝑗𝑖 = −𝜁 𝜕𝔖 . 𝜕𝑤𝑗𝑖 (5.20) Learning rate 𝜁 is used to improve the speed of convergence and minus is used in order to minimize the sum of square errors, 𝔖. Because minus goes in front of the gradient, this technique is called a gradient descent. Later, we will take a look at Levenberg-Marquardt, the well-known gradient descent technique which I applied, in this work, to train neural networks. From (5.19) and (5.20), it follows: ∆𝑤𝑗𝑖 = −𝜁𝛿𝑗 𝑦𝑗 (5.21) 𝛿𝑗 = 𝑒𝑟𝑗 𝜑𝑗 ′(𝑣𝑗 ) (5.22) Where the local gradient 𝛿𝑗 is equal to In order to update the weights, we have to calculate all ∆𝑤𝑗𝑖 and, to do so, we need to calculate 𝑒𝑟𝑗 for every neuron. In the last layer, the calculation of 𝑒𝑟𝑗 , is straightforward because we know the output for which we are tuning the weights. For previous layers, we calculated each neuron’s share of weight change for each neuron to the change in the output layer by propagating error back to the beginning of the network (back-propagation). According to (5.14) and (5.22) we can write an equation for a hidden neuron 𝑗 as: 𝛿𝑗 = − 𝜕𝔖 𝜕𝑦𝑗 𝜕𝔖 =− 𝜑 ′(𝑣 ) 𝜕𝑦𝑗 𝜕𝑣𝑗 𝜕𝑦𝑗 𝑗 𝑗 1 (5.23) Taking from (5.11) the 𝑘-th layer, we obtain 𝔖 = 2 ∑𝑘 𝑒𝑟𝑘2. With respect to functional signal 𝑦𝑗 , the differentiation result is: -80- 𝜕𝔖 𝜕𝑒𝑟𝑘 =� 𝜕𝑦𝑗 𝜕𝑦𝑗 (5.24) 𝑘 By using the chain rule for partial derivative of error over functional signal I get 𝜕𝔖 𝜕𝑒𝑟𝑘 𝜕𝑣𝑘 =� 𝜕𝑦𝑗 𝜕𝑣𝑘 𝜕𝑦𝑗 (5.25) 𝑘 Because 𝑘 is the index of the output layer: 𝜕𝑒𝑟𝑘 = −𝜑𝑗 ′(𝑣𝑘 ) 𝜕𝑣𝑘 And the local field for 𝑘 is: (5.26) 𝑛 𝑣𝑘 = � 𝑤𝑘𝑗 𝑦𝑗 Differentiating the equation (5.27) yields: (5.27) 𝑗=1 𝜕𝑣𝑘 = 𝑤𝑘𝑗 𝜕𝑦𝑗 (5.28) In order to obtain the desired partial derivative we use (5.26) and (5.28) in (5.25): 𝜕𝔖 = − � 𝑒𝑟𝑘 𝜑𝑘 ′(𝑣𝑘 )𝑤𝑘𝑗 = − � 𝛿𝑘 𝑤𝑘𝑗 𝜕𝑦𝑗 𝑘 (5.29) 𝑘 Using (5.29) in (5.23) gives, as a result of using a back-propagation equation for the local gradient, 𝛿𝑗 : 𝛿𝑗 = 𝜑𝑗 ′(𝑣𝑗 ) � 𝛿𝑘 𝑤𝑘𝑗 (5.30) 𝑘 The back-propagation algorithm consists of two distinct parts, a forward and a backward pass. In the forward pass, the weights remain intact and only the neuron -81- values and signal value is computed in the neural network. After the 𝑦𝑘 is computed, an error 𝑒𝑟𝑘 is computed and the reweighting of the network is made by going backwards, from the last to the first layer, and computing the local gradient, 𝛿. For the last layer, we use (5.21) and using (5.30), we calculate the preceding layers. Both forward and backward passes are executed on the same data point and, then, the neural network takes the next data point. 5.4.6 Layer Recurrent Neural Network Layer recurrent neural network [66] is a RNN and, more specifically, a generalized version of the Elman network. The Elman network, named after Jeff Elman, is a recurrent neural network which has a special copy layer in which important feedback, from the hidden layer, is stored and used later. Saving the state enables the Elman network to predict sequences. The Jordan network is similar to the Elman network but, in that network, the copy (also named state) layer is fed from the output layer. Sometimes, these two types are referred to as simple recurrent networks. With the exception of the network’s last layer, the LRNN’s feedback loop has a single delay around each layer. A layer recurrent neural network has an arbitrary number of layers and transfer functions in each layer. Figure 23 is an example of a layer recurrent neural network topology. Figure 23: Simple LRNN Topology -82- There are various possibilities, for the selection of the gradient-based algorithms in learning LRNN, and I selected Levenberg-Marquardt (damped least-squares) with the mean square error. Because I chose, also, Levenberg-Marquardt, for MLP implementations of neural network forecasting algorithms, it is explained separately. 5.4.7 Levenberg-Marquardt algorithm Levenberg-Marquardt algorithm known, also, as damped least-squares, is an optimization algorithm which is used to minimize the non-linear functions. It interpolates between the gradient descent and the Gauss-Newton algorithm. It is a bit slower than the Gauss-Newton algorithm but more robust; this is important in problems where initial conditions are unknown. The Levenberg-Marquardt algorithm’s main disadvantage is that it finds only local and not a global minimum. The algorithm works on a problem of least squares curve fitting. In the case of neural networks, it is used to minimize the sum of least squares: 𝑚 min 𝑆𝛽 = �[𝑌�𝑖 − 𝑓(𝑋�𝑖 , 𝛽)]2 (5.31) 𝑖=1 Where 𝑚 is the number of data points in 𝑌�𝑖 and 𝑋�𝑖 . 𝛽 is initialized to 𝛽 𝑇 = (1 1 1 … 1) and, in each iteration of the algorithm, 𝛽 is replaced by (𝛽 + 𝛿). 𝛿 is determined as a linearized approximation from 𝑓�𝑋�𝑖 , 𝛽 + 𝛿� ≈ 𝑓�𝑋�𝑖 , 𝛽� + 𝜕𝑓�𝑋�𝑖 , 𝛽� 𝛿 𝜕𝛽 (5.32) Where the term, with 𝛿, is a gradient and part of the Jacobian matrix. At the minimum, the sum of squares gives the following approximation: 𝑆𝛽+𝛿 𝑚 2 𝜕𝑓�𝑋�𝑖 , 𝛽� ≈ � �𝑌�𝑖 − 𝑓�𝑋�𝑖 , 𝛽� − 𝛿� 𝜕𝛽 (5.33) 𝑖=1 This can be written more compactly in vector notation as: 𝑆𝛽+𝛿 ≈ �𝑌� − 𝑓�𝑋�, 𝛽� − 𝐽𝛿� (5.34) Where 𝐽 is the Jacobian matrix. Taking the derivative with respect to 𝛿 and setting the result to zero gives 𝐽𝑇 𝐽𝛿 = 𝐽𝑇 [𝑌� − 𝑓�𝑋�, 𝛽�] which can be solved for 𝛿. Levenberg added “non-damped” factor, 𝜅: (𝐽𝑇 𝐽 + 𝜅𝐼)𝛿 = 𝐽𝑇 �𝑌� − 𝑓�𝑋�, 𝛽��, -83- (5.35) which by tuning 𝜅 can move closer to gradient descent of the Gauss-Newton algorithm. The disadvantage, of Levenberg’s algorithm, is that, for large values of 𝜅, the term, (𝐽𝑇 𝐽 + 𝜅𝐼), is not inverted. Marquardt extended the algorithm by scaling the gradient change to the curvature by replacing the identity matrix 𝐼 with 𝑑𝑖𝑎𝑔(𝐽𝑇 𝐽) and, therefore, affecting the large movement for the small components which increased the convergence. Finally, the algorithm works iteratively as: [𝐽𝑇 𝐽 + 𝜅𝑑𝑖𝑎𝑔(𝐽𝑇 𝐽)]𝛿 = 𝐽𝑇 �𝑌� − 𝑓�𝑋�, 𝛽�� (5.36) 5.4.8 Multilayer Perceptron The multilayer perceptron is the well-known type of neural networks and, nowadays, probably the most used learning algorithm. Sometimes, it is referred to as NN with FF BP. It is based on the Rosenblatt’s Perceptron and back-propagation algorithm. MLP can distinguish data which is not linearly separable [68]. As described previously, implemented multilayer perceptron uses the backpropagation algorithm; tan-sig as an activation function; and Levenberg-Marquardt for gradient descent. Because other parameters of MLP are shared with layer recurrent neural network, they are detailed together in subsection 5.7. Because of its importance, I give, here, some general characteristics and remarks on the back-propagation algorithm. Validation is used on the learning set in order not to overfit on the data and to obtain a good generalization as a trade-off between bias and variance. An important feature, of a multilayer perceptron, is that it can universally approximate. If, by 𝑓(𝑤, 𝑋), we denote a multilayer perceptron’s function as the function of neuron weights 𝑤 and input data 𝑋, we can write: 𝑓(𝑤, 𝑋) = 𝜑 �� 𝑤𝑘𝑗 𝜑 �… 𝜑 �� 𝑤𝑙𝑖 𝑋𝑖 ��� 𝑖 𝑗 This is the nested set of non-linear functions which, in this case of the MLP, makes the universal approximator. The multilayer perceptron’s computational complexity is linear in 𝑤 which means that its complexity is 𝑂(𝑤). Because of that linearity, we can consider MLP to be computationally efficient. -84- The MLP’s BP can be considered to be stochastic because, in the weight space, it uses an instant estimate of the gradient error surface [94]. That stochasticity is shown as a zig-zag convergence which, for the back-propagation, is slow. Theoretically, multilayer perceptrons could be used as universal computing machines. In order to do so, a way, to measure how they behave on a higher scale, is proposed in [102]. Higher scale measures the neural networks behaviours in terms of increasing in size and complexity. Tesauro and Janssen showed that a runtime, for neural networks to learn the parity function, scales exponentially with the number of input data points. This gives a notion that practical usage of MLPs, with the backpropagation as universal computing machines, was too optimistic [94]. A well-known and important drawback, of the back-propagation and multilayer perceptrons, is the problem of local-minima which happens because gradient descent can find only the optimal solution locally. As an answer to neural networks stucking to local minima and their overfitting to a training set, a new algorithm emerged. That algorithm is explained in the next subsection. 5.5 Support Vector Regression Support vector regression is the variant of support vector machines [113] for regression [9] [114] [115] [116]. SVMs date back to the 1960’s and Vladimir Vapnik’s “statistical learning theory”. Together with Alexey Chervonenkis he developed the Vapnik-Chervonenkis theory which is a branch of the statistical learning theory for the data with unbounded probability distribution. From the Vapnik-Chervonenkis theory, Vapnik together with other scientists, developed support vector machines; firstly, for classification; next for regression; and, sometime later, for clustering. Although, in the 1960’s, Vapnik introduced linear classificators and optimal separating hyperplanes, the theory was not developed substantially until the 1990’s. In 1992, Vapnik and his colleagues, Isabelle Guyon and Bernhard Boser from AT&T Bell Laboratories, proposed a non-linear classification by applying a kernel trick to the hyperplanes with the widest margin [117]. The kernel trick, introduced by Aizerman [118], is used to transform linear classificator to non-linear by applying, via non-linear functions, a mapping to higher dimensional space. In the real-world, the non-linear nature, of many problems, has opened a way to a successful application of support vector machines in both practice and research. Since Vapnik et al. [9] proposed support -85- vector regression in 1996, it found a place in numerous areas and replaced previously used methods. 5.5.1 Statistical Learning Theory The main focus, of the statistical learning theory, is learning machine’s capacity and performance. A relationship, between different learning machines and their errors, is incorporated in the following risk estimation: Each 𝑛 data points consist of input variable 𝑋𝑖 and an output variable (label) 𝑌𝑖 , where 𝑖 = 1,2, … , 𝑛. In the process of deterministic learning, it is possible to set up functionally a machine which maps the 𝑋𝑖 ↦ 𝑓(𝑋𝑖 , 𝛼𝑆𝐿𝑇 ), by tuning 𝛼𝑆𝐿𝑇 . 𝛼𝑆𝐿𝑇 , is analogue to fuzzy variables and rules in fuzzy logic and bias weights in neural networks. Following the assumption of the statistical learning theory that there exists an unknown probability distribution 𝑃(𝑋, 𝑌) with the probability density function 𝑝(𝑋, 𝑌), from which data is independent and distributed identically, the expected test error or risk (𝑅𝑔𝑒𝑛 ) can be expressed as a function of 𝛼𝑆𝐿𝑇 : 𝑅𝑔𝑒𝑛 (𝛼𝑆𝐿𝑇 ) = 1 �|𝑌𝑖 − 𝑓(𝑋𝑖 , 𝛼𝑆𝐿𝑇 )|𝑝(𝑋, 𝑌)𝑑𝑋𝑑𝑌 2 𝑋,𝑌 It was impossible to determine the expected risk without knowing the density probability function and probability distribution; this was the situation in practice. Because of that fact, the methods calculate instead the empirical risk (𝑅𝑒𝑚𝑝 ): 𝑛 1 𝑅𝑒𝑚𝑝 (𝛼𝑆𝐿𝑇 ) = �|𝑌𝑖 − 𝑓(𝑋𝑖 , 𝛼𝑆𝐿𝑇 )| 2𝑛 (5.37) 𝑖=1 Earlier presented methods such as ARMA; Similar days; and neural networks are based on the empirical risk minimization. 𝑅𝑒𝑚𝑝 does not depend on probability distribution but instead on 𝛼𝑆𝐿𝑇 , 𝑋 and 𝑌. The term, |𝑌𝑖 − 𝑓(𝑋𝑖 , 𝛼𝑆𝐿𝑇 )|, is the loss and it depends on the scale of 𝑌. By introducing an arbitrary parameter 𝜇 such that 0 ≤ 𝜇 ≤ 1, Vapnik showed that with the probability of 1 − 𝜇 the following holds: 2𝑛 𝜇 ℎ �log � � + 1� − log � � 4� ℎ 𝑅𝑔𝑒𝑛 (𝛼𝑆𝐿𝑇 ) ≤ 𝑅𝑒𝑚𝑝 (𝛼𝑆𝐿𝑇 ) + 𝑛 -86- (5.38) In (5.38), integer ℎ is a quantitative measure of the learning machine named the Vapnik-Chervonenkis dimension. The possibility to calculate ℎ enables the user to calculate the whole term under the root which enables the inequality’s right hand-side to be calculated. The inequality’s right hand side is an upper risk bound. The goal, of building a learning machine, is to minimize the risk. Because it was impossible to calculate the risk and minimize it, it turned out that minimizing the upper bound of the risk returned better results than minimizing only the empirical risk itself. The process of risk minimization without the knowledge of a probability distribution is the principle of the structural risk minimization. 5.5.2 Vapnik-Chervonenkis Dimension If for a set of 𝑛 points for each of 2𝑛 ways of categorization, we can find a function which separates it completely, that set is separable by a principle of shattering points. Figure 24 depicts three points in a two-dimensional ℝ2 space together with all combinations of their separation with one line in two sets. Vapnik-Chervonenkis’ dimension ℎ, on a set of functions 𝑓(𝑋, 𝛼𝑆𝐿𝑇 ), is defined by the highest number of learning points which can be separated by members of the set of functions [119]. For an example in Figure 24, ℎ = 2 + 1 = 3. Figure 24: Shattering of Points in a Two-Dimensional Space -87- Separating the function in ℝ2 , is a set of directed lines. In an 𝑛-dimensional space ℝ𝑛 , separating functions would make a set of hyperplanes. Vapnik-Chervonenkis’ dimension, in ℝ𝑛 , is equal to 𝑛 + 1. Vapnik-Chervonenkis’ dimension is the measure of the shattering capacity of a set of functions. Although, for learning machines with more parameters, Vapnik-Chervonenkis’ dimension is, also, higher, there are examples of univariate functions with one parameter and infinite VapnikChervonenkis’ dimension such as sine function (𝑥, 𝜔) = sin 𝜔𝑥 > 0. By setting the frequency 𝜔 high enough, the sine function can shatter arbitrarily high number of points. Vapnik-Chervonenkis bound is the term on the right side of the (5.38). For every positive empirical risk, Vapnik-Chervonenkis bound increases monotonically with the increase of ℎ. The minimization, of Vapnik-Chervonenkis bound, leads to the minimization of the right side of the (5.38) which leads to the minimization of the real error. Vapnik-Chervonenkis bound can be seen from (5.38) as an estimated difference between real and empirical risk. 5.5.3 Structural Risk Minimization The goal, of structural risk minimization, is to determine optimal model complexity for a given learning task. In order to do so and by following (5.38), a particular function from a set of functions has to be chosen. The goal is to find a subset of a chosen set of functions, such that there is minimal risk bound for a given subset. Because ℎ ∈ ℕ, the risk bound change will be discrete. In the structural risk minimization [119] the set Š of loss functions is structured of nested subsets in the following way: Š1 ⊂ Š2 ⊂ ⋯ ⊂ Š𝑖 ⊂ ⋯, where ℎ1 < ℎ2 < ⋯ < ℎ𝑖 < ⋯ with Š𝑖 being the subsets and ℎ𝑖 the Vapnik-Chervonenkis dimension on each subset. The goal, of the method, is to find, through the nested subsets, the model with the best ratio of performance and complexity and, therefore, avoid either underfitting or overfitting. Figure 25 shows a link between a nested structural risk minimization subsets structure and how it is used to obtain the optimal model in balancing error and the learning machine’s capacity to generalize. One of the possible ways, to implement the structural risk minimization principle, is by building a series of learning machines, one for each subset, where a goal, in each subset, is to minimize the empirical risk. From the series of learning machines, the -88- one with the lowest sum of the Vapnik-Chervonenkis bound and empirical risk is optimal and, therefore, is the solution for the learning task. Š1 Š2 h1 < h2 < Š3 h3 Š4 < h4 < ... ... Figure 25: Nested Function Subsets with Associated Vapnik-Chervonenkis Dimensions 5.5.4 Support Vectors Sometimes, data is not separable and, in general, time increases with the decrease of space between separable data. One of reasons, for the introduction of optimally separating hyperplanes, is to separate the data in ℝ𝑛 . If in the original, primal space, data is not separable, sometimes, it is possible to project the data to a higherdimensional, dual space in which the same data is separable or, even more, linearly separable. Consider an example of the input data, for our load forecasting task, which has two features, 𝑋𝑖 = {𝑋1 , 𝑋2 }. In that case, our input space is two- dimensional. On a mission to create a rule-based forecasting system, for example, we prefer the data to be separable in order to distinguish the rules. If 𝑋𝑖 = {𝑋1 , 𝑋2 } is not separable, it is possible to map the input space into a higher dimensional space, e.g. three-dimensional, 𝑋𝑖′ = {𝑋1 , 𝑋2 , 𝑋1 𝑋2 }, where the data would be separable. The narrow space, between the separating hyperplane and the closest data to it, is called a margin. An optimal separating hyperplane is the one with the widest margin. The optimal separating hyperplane is the result of maximization of margin distance max𝑐 2𝑐. Because margin distance is equal to a double reciprocal norm of distances 1 orthogonal on the hyperplane 𝑐 = |𝑤|, |𝑤| is minimized instead of maximizing the margin. 𝑚 vectors closest to the optimal separating hyperplane with the widest margin, are called support vectors and are defined as: 𝑌𝑖 (w 𝑇 𝑋𝑖 + 𝑏) = 1, -89- 𝑖 = 1, 2, … , 𝑚, where input data, output data and reciprocal norm of distances orthogonal on the hyperplane 𝑋, 𝑌, |𝑤| ∈ ℝ𝑛 and 𝑏 ∈ ℝ. 5.5.5 Regression in Linear Form Support vector regression was proposed a few years after support vector machines for classification. Pontil et al. [120] have shown that if 𝜀 small enough is chosen, the SVM, for classification, is a special case of SVR. In the case of regression the margin is represented by a distance between real data and its estimation as an error term. Usually, that error is represented by using the following norms: L1 norm: |𝑋 − 𝑌| L2 norm: |𝑋 − 𝑌|2 Vapnik’s linear 𝜀 loss function: � (𝑋−𝑌)2 Huber’s loss function: � 2 , 0, 𝑖𝑓 |𝜉| ≤ 𝜀 |𝜉| − 𝜀, 𝑒𝑙𝑠𝑒 𝜉|𝑋 − 𝑌| − 𝑖𝑓 |𝑋 − 𝑌| < 𝜉 𝜉2 2 , 𝑒𝑙𝑠𝑒 Huber’s loss function is similar to the most often used error norm, L2 norm, known, also, as square error. Huber’s loss function is used in robust regression. It is a generalization of the L2 norm which becomes linear and turns into a Huber’s loss function after |𝜉|, where 𝜉 represents slack variables introduced in [121], and 𝜀 is the error sensitivity measure. For SVR, a linear model is similar to the definition of support vectors: Y = 𝑓(X) = w 𝑇 X + 𝑏 Firstly, we are searching for w by minimizing the objective function: 𝑛 With respect to w, where, 𝜆 𝑚𝑖𝑛 � 𝑉�𝑌𝑖 − 𝑓(𝑋𝑖 )� + ‖w‖2 2 (5.39) (5.40) 𝑖=1 |š| − 𝜀 0, 𝑉(š) = � |š| − 𝜀, 𝑒𝑙𝑠𝑒 -90- (5.41) 𝜆, in (5.40) is, a regularization constant which can be obtained by cross-validation. Usually, in the tuning of support vector regression models, cost 𝐶 is optimized. The 1 cost is reciprocal of the regularization constant 𝐶 = 𝜆. 𝜀, in (5.41), is an error sensitivity measure. I use 𝜀 = 10−3 in this model setting. The assumption, given as a constraint (5.41) is that function 𝑓(X) that approximates pairs (𝑋𝑖 , 𝑌𝑖 ) with the precision of 𝜀, exists. If 𝑓(X) exists, the convex optimization problem is feasible. At the same time, support vector regression minimizes empirical risk 𝑅𝑒𝑚𝑝 and ‖w‖2 . By introducing slack variables (𝜉𝑖 , 𝜉𝑖 ∗ ), it is possible to manage the infeasible constraints to the optimization problem. The formulation including (𝜉𝑖 , 𝜉𝑖 ∗ ) is the formulation stated in [119]. 𝑛 Subject to: 1 𝑚𝑖𝑛 [𝐶 �(𝜉𝑖 +𝜉𝑖 ∗ ) + ‖w‖2 ] 2 (5.42) 𝑖=1 𝑌𝑖 − w 𝑇 X𝑖 − 𝑏 ≤ 𝜀 + 𝜉𝑖 �w 𝑇 X𝑖 + 𝑏 − 𝑌𝑖 ≤ 𝜀 + 𝜉𝑖 ∗ 𝜉𝑖 , 𝜉𝑖 ∗ ≥ 0 (5.43) In Figure 26, support vectors are used in regression of the function estimation. The data is represented by dots. Outliers are represented by squared dots whilst support vectors are represented by circled dots. 𝜀-insensitive linear loss function is represented by 𝜀-tube, which, in Figure 26, is the area with a distance of 𝜀 from both sides of the margin. If the estimated value is inside the 𝜀-tube, the error is zero and, for all other values outside the 𝜀-tube, loss grows with the distance from the 𝜀-tube. Cost 𝐶 has a regularization role. With higher values of 𝐶, the support vector regression goes closer to its hard margin, theoretically reaching it at 𝐶 = ∞. For higher 𝐶, 𝑚 increases; computation time grows; and model complexity increases. Typically, input data can be pre-processed, so that an optimal value of 𝐶, for a given task is not high. 𝐶 is a positive number, usually between 1 and 104 . It is part of the cost function that is optimized, because lower values of 𝐶 give better generalization and runtime with a potential increase of error. -91- Y = f(X) Outlier ε ε Support vectors Outlier 0 X Figure 26: Linear Support Vector Machines with ε-loss Function 5.5.6 Lagrange Multipliers In the support vector regression, the optimization problem is based on (5.40) and (5.41). We introduce Lagrange multipliers in the form of dual variables (𝛼𝑖 , 𝛼𝑖 ∗ ) as means of transforming an equation system, in an unconstrained form, in order to optimize the calculation procedure. Lagrange multipliers or dual variables are scalar variables used to calculate function extremes subject to constraints in mathematical optimization [122]. Lagrange multipliers are employed widely in different areas of mathematical modelling because they enable us to express a constrained system in a form of unconstrained system and solve it using e.g. a gradient descent method. In the case of the support vector regression, Lagrange multipliers (𝛼𝑖 , 𝛼𝑖 ∗ ) correspond to the pair of dots outside the 𝜀- tube. That pair of dots is such that they are on the different sides of the 𝜀-tube. They are introduced for each parameter and they build a system of multipliers with parameters. -92- Consider the optimization problem max 𝑔2 (𝑋,𝑌)=𝑘 where: 𝑔1 (𝑋, 𝑌), (5.44) 𝑋, 𝑌 ∈ ℝ𝑛 ; functions 𝑔1 , 𝑔2 ∈ ℱ; and 𝑛𝑘 ∈ ℝ. In order to solve (5.44), Lagrange multiplier, 𝛼 is introduced. The Lagrange function is defined by: Λ(𝑋, 𝑌, 𝛼) = 𝑔1 (𝑋, 𝑌) + 𝛼[𝑔2 (𝑋, 𝑌) − 𝑛𝑘 ], where 𝛼 can be added or subtracted. If 𝑔1 (𝑋, 𝑌) is a maximum for the constrained problem, there exists 𝛼 such that partial derivatives of Lagrange function Λ(𝑋, 𝑌, 𝛼), with respect to 𝑋, 𝑌 and 𝛼, in a point (𝑋, 𝑌, 𝛼), are equal to zero. max 𝑓(𝑋, 𝑌) → ∃𝛼 s.t. 𝜕Λ(𝑋, 𝑌, 𝛼) =0 𝜕𝑋 𝜕Λ(𝑋, 𝑌, 𝛼) =0 𝜕𝑌 𝜕Λ(𝑋, 𝑌, 𝛼) =0 𝜕𝛼 Not all points, (𝑋, 𝑌, 𝛼), in which partial derivatives of Λ(𝑋, 𝑌, 𝛼) = 0, are a solution of the original constrained problem. Because of that, Lagrange multipliers are a necessary condition for optimality in constrained problems [122] and [123]. In addition to that, the Lagrange function has a saddle point in respect of primal and dual variables in the point of the solution [124]. -93- 5.5.7 Optimization problem If we introduce Lagrange multipliers, 𝜂𝑖 , 𝜂𝑖 ∗ , 𝛼𝑖 , 𝛼𝑖 ∗ we can write (5.42) and (5.43) as a Lagrange function: 𝑛 𝑛 1 Λ(𝑋, 𝑌, 𝜂, 𝛼) = 𝐶 �(𝜉𝑖 +𝜉𝑖 ) − �(𝜂𝑖 𝜉𝑖 +𝜂𝑖 ∗ 𝜉𝑖 ∗ ) + ‖w‖2 − 2 𝑖=1 𝑛 Given ∗ 𝑖=1 𝑛 (5.45) − � 𝛼𝑖 (𝜀 + 𝜉𝑖 − 𝑌𝑖 + w 𝑇 X𝑖 + 𝑏) − � 𝛼𝑖 ∗ (𝜀 + 𝜉𝑖 ∗ + 𝑌𝑖 − w 𝑇 X𝑖 − 𝑏) 𝑖=1 𝑖=1 𝜂𝑖 , 𝜂𝑖 ∗ , 𝛼𝑖 , 𝛼𝑖 ∗ ≥ 0 (5.46) Partial derivatives of Λ with respect to primal variables (w, 𝑏, 𝜉𝑖 , 𝜉𝑖 ∗ ) vanish for optimality: 𝑛 𝜕Λ = �(𝛼𝑖 ∗ − 𝛼𝑖 ) = 0 𝜕𝑏 (5.47) 𝑖=1 𝑛 𝜕Λ = w − � 𝑋𝑖 (𝛼𝑖 − 𝛼𝑖 ∗ ) = 0 𝜕w 𝑖=1 𝜕Λ = 𝐶 − 𝛼𝑖 − 𝛼𝑖 ∗ − 𝜂𝑖 − 𝜂𝑖 ∗ = 0 𝜕𝜉𝑖 𝜕𝜉𝑖 ∗ (5.48) (5.49) Substituting (5.45) with (5.47), (5.48) and (5.49) the optimization problem defined by (5.40) and (5.41), written in the form of Lagrange multipliers turns to: 𝑛 𝑚𝑖𝑛𝜀 �(𝛼𝑖 + 𝛼𝑖 𝑖=1 with respect to (𝛼𝑖 , 𝛼𝑖 ∗ ), ∗) 𝑛 + � 𝑌𝑖 (𝛼𝑖 − 𝛼𝑖 𝑖=1 ∗) 𝑛 1 + � (𝛼𝑖 − 𝛼𝑖 ∗ )�𝛼𝑗 − 𝛼𝑗 ∗ �X𝑖𝑇 X𝑗 , 2 (5.50) 𝑖,𝑗=1 given the constraints: 0 ≤ 𝛼𝑖 , 𝛼𝑖 ∗ ≤ 𝐶, 𝑛 𝑛 𝑖=1 𝑖=1 𝑖 = 1, … , 𝑛 � 𝛼𝑖 = � 𝛼𝑖 ∗ The dual variables in 𝜂𝑖 , 𝜂𝑖 ∗ (5.50) are eliminated through (5.49). -94- (5.51) (5.52) Lagrange multipliers 𝛼𝑖 , 𝛼𝑖 ∗ are used to solve (5.47) given the constraints (5.51) and (5.52). The linear model, of the regression function, is given in (5.39) and its main components can be expressed as: 𝑛 w = − �(𝛼𝑖 − 𝛼𝑖 ∗ )X𝑖 (5.53) 𝑖=1 𝑛 1 𝑏 = �(𝑦𝑖 − X𝑖𝑇 w) 𝑛 (5.54) 𝑖=1 Support vectors are obtained by finding the solution which satisfies the Karush-KuhnTucker (KKT) conditions. Based on KKT, 𝑏 from (5.39) can be computed later. 5.5.8 Karush-Kuhn-Tucker Conditions Karush-Kuhn-Tucker conditions are an approach to nonlinear programming which is a generalization of Lagrange multipliers to inequality constraints. The result was obtained independently by W. Karush, in 1939, by F. John, in 1948, and by H.W. Kuhn and J.W. Tucker in 1951 [125]. KKT conditions are stationarity; primal feasibility; dual feasibility; and the complementary slackness condition. Karush-KuhnTucker conditions state that the product between dual variables and constraint, vanishes at the point of the solution which, for our case, can be expressed as: 𝛼𝑖 (𝜀 + 𝜉𝑖 − 𝑌𝑖 + w 𝑇 X𝑖 + 𝑏) = 0 𝛼𝑖 ∗ (𝜀 + 𝜉𝑖 ∗ + 𝑌𝑖 − w 𝑇 X𝑖 − 𝑏) = 0 (𝐶 − 𝛼𝑖 )𝜉𝑖 = 0 (𝐶 − 𝛼𝑖 ∗ )𝜉𝑖 ∗ = 0 Taking them in consideration, it comes out that only the samples of 𝑋𝑖 , 𝑌𝑖 for which 𝛼𝑖 = 𝐶 and 𝛼𝑖 ∗ = 𝐶, lay outside the 𝜀-insensitive tube. Because of 𝛼𝑖 𝛼𝑖 ∗ = 0, always, at least, one of the dual variables, in the pair, is equal to zero. Based on this, dual problems become easier to solve since it is enough to find only half of the dual variables knowing that the other half is equal to zero. Support vectors are examples 𝑋𝑖 which come with non-vanishing coefficients. After Lagrange multipliers are found, the optimal weight vector w and optimal bias, 𝑏, can be found by using (5.53) and (5.54). -95- The support vector machines’ true power was not unleashed before their non-linear form was introduced; it is based on so called kernels. So, let’s take a look at what kernels are. 5.5.9 Kernels Because, in the real-world, most data is non-linear, there is a strong motivation to find algorithms which exercise non-linear behaviours and have high performance on non-linear data. Often, linearity is only a special case of non-linearity. The idea of adjusting the algorithm, so that it can work with non-linear data, is the motivation for an introduction of kernels. The easy way of introducing the non-linearity is to simply map data by a function Φ: 𝒳 ↦ ℱ, 𝑋 ∈ 𝒳 which maps 𝑋 to some other feature space ℱ [118]. Mapping can be introduced in a form of dot product between input data instead of defining Φ explicitly. This is because, for an algorithm based on the support vectors, it is sufficient to know the dot product, between input data 𝑋𝑖 . The dot product, between input data, can be defined by means of a kernel function or kernel as: 𝑘�𝑋𝑖 , 𝑋𝑗 � = 𝐾�𝑋𝑖 , 𝑋𝑗 � = 〈ϕ(𝑋𝑖 ), ϕ�𝑋𝑗 �〉 (5.55) In a form of non-linear regression, kernels can be used to map a problem into a richer space with more dimensions, where problematic, non-linear features would be highlighted and would become, eventually, linearly separable. From the optimization perspective, it is a continued search for the optimal space for linear separation. Another advantage, of kernels, is in cases when a simple function has an infinite number of dimensions such as signum 𝜓(X𝑖 )𝑇 𝜓�X𝑗 �, 𝑖, 𝑗 = 1, … , 𝑛, and inner products have to be calculated to solve the optimization problem. Because it is a space with large dimensionality, it takes a lot of resources to compute the solution and the calculation can lead to a dimensionality explosion. By mapping that space to a space where dimensionality is reduced, it can be solved easily [126]. By introducing kernels, some infinite dimensional problems can be solved effectively. Nowadays, this makes kernels an important tool in machine learning. A product, of kernel functions, can be represented as a dot product matrix known as kernel matrix or Gram matrix [126]: -96- 𝐾=� 𝐾(1,1) ⋮ 𝐾(𝑛, 1) ⋯ 𝐾(1, 𝑛) ⋱ ⋮ � ⋯ 𝐾(𝑛, 𝑛) (5.56) Kernel matrix is the fundamental structure of kernel machines because it contains all necessary information for the learning algorithm. Two important properties, of the kernel matrix, are: • Kernel (Gram) matrix is symmetrically positive-definite; • For every symmetrically positive-definite matrix there exists a feature space for which it is a dot product or kernel matrix. Those two properties are known as Mercer’s theorem [127]. Mercer’s theorem: suppose 𝐾 ∈ 𝐿∞ (𝒳 2 ) is such that the integral operator 𝑇𝐾 : 𝐿2 (𝒳) ⟶ 𝐿2 (𝒳), 𝑇𝐾 𝑓(∙) ≔ � 𝐾(∙, 𝑋)𝑓(𝑋)𝑑𝜇(𝑋) 𝒳 (5.57) is positive, with 𝜇 being a positive measure on 𝒳 with 𝜇(𝒳) being finite. Let Ψ𝐽 ∈ 𝐿2 (𝒳) be the eigenfunction of 𝑇𝐾 associated with the eigenvalue eig 𝐽 ≠ 0 and normalized such that �Ψ𝐽 �𝐿2 = 1 and let Ψ𝐽 denote its complex conjugate. Then: (λ𝐽 (𝑇))𝐽 ∈ 𝑙1 Ψ𝐽 ∈ 𝐿∞ (𝒳), and 𝑠𝑢𝑝𝐽 �Ψ𝐽 �𝐿∞ < ∞ 𝐾�𝑋𝑖 , 𝑋𝑗 � = ∑𝐽∈ℕ eig 𝐽 Ψ𝐽 (𝑋𝑖 )Ψ𝐽 (𝑋𝑗 ), holds for almost all �𝑋𝑖 , 𝑋𝑗 � where the series converges absolutely and uniformly for almost all �𝑋𝑖 , 𝑋𝑗 �. These can be written briefly as: If � 𝒳×𝒳 𝐾�𝑋𝑖 , 𝑋𝑗 �𝑓(𝑋𝑖 )𝑓�𝑋𝑗 �𝑑𝑋𝑖 𝑑𝑋𝑗 ≥ 0, ∀𝑓 ∈ 𝐿2 (𝒳) holds, 𝐾�𝑋𝑖 , 𝑋𝑗 � can be written as a dot product in some feature space. (5.58) Kernels’ properties follow from Mercer’s theorem. The most important properties for kernels 𝐾𝑖 and 𝐾𝑗 are: -97- • • • 𝐾𝑖 + 𝐾𝑗 is a kernel, ć𝐾𝑖 , ć > 0 is a kernel, ć𝐾𝑖 + č𝐾𝑗 , ć, č > 0 is a kernel, where ć, č ∈ ℕ. Due to linearity of integrals, this follows directly from (5.31). Based on the kernels’ properties, we can build new kernels from the existing ones. A kernel’s validity can be checked by inspecting the Gram matrix. If it is not sparse outside the main diagonal, it shows that a lot of dots, in the space, are mutually orthogonal. Nowadays, there exists a large variety of kernels. These are: linear; polynomial; RBF (which is similar to a Gaussian kernel); Fourier kernel; heat kernel; pair hidden Markov model kernel; 𝜒 2 kernel; Epanechnikov kernel; neural kernel; and sequence kernels. Lots of kernel types are described further in [128]. Fields, of kernel design and multiple kernel learning, are involved in improving kernels and in finding new kernels for a wide variety of applications. I use primarily RBF kernel since, nowadays, it is the standard in machine learning and because, empirically, it returned better results. It is a generalization of a linear kernel in means that a circle becomes linear when its radius approaches infinity. With the exception of marginal cases of very high cost parameter, 𝐶, RBF kernel returns results which are equal or better than results obtained by using the linear kernel. The RBF kernel is better than the neural kernel since it gets us to the same result more efficiently. My previous experiments, on kernel performance where I compared RBF with linear kernel and polynomial kernel, confirmed findings that, for the time-series, RBF has the best performance. Additionally, due to lower number of tuning parameters, it is simpler than the polynomial kernel. It is known that there exist other kernels such as Epanechnikov’s kernel which could be applied successfully to tasks such as load forecasting in SVR. However, because they are not semi-positive definite, so far, there were no reports that they work stably in the classical support vector regression. RBF kernel is defined as: -98- 2 �𝑋𝑖 −𝑋𝑗 � 2 𝐾�𝑋𝑖 , 𝑋𝑗 � = exp �− � = exp �−𝛾�𝑋𝑖 −𝑋𝑗 � � , 𝛾, 𝜎 2 > 0 2 2𝜎 (5.59) 𝜎 2 or 𝛾 is called the parameter of RBF kernel. ε-SVR 5.5.10 By introducing the kernel function to a definition of support vector optimization problem for regression, that problem becomes: 𝑛 𝑚𝑖𝑛𝜀 �(𝛼𝑖 + 𝛼𝑖 𝑖=1 ∗) 𝑛 + � 𝑌𝑖 (𝛼𝑖 − 𝛼𝑖 𝑖=1 ∗) 𝑛 1 + � (𝛼𝑖 − 𝛼𝑖 ∗ )�𝛼𝑗 − 𝛼𝑗 ∗ �𝐾(X𝑖 , X𝑗 ) 2 (5.60) 𝑖,𝑗=1 In respect of (𝛼𝑖 , 𝛼𝑖 ∗ ) and constrained by (5.51) and (5.52). By expanding the Lagrange function by means of the conditions and, in the same way, solving the system of equations, the regression function, for the non-linear case, turns to: 𝑇 𝑛 𝑓(X) = w X + 𝑏 = �(𝛼𝑖 − 𝛼𝑖 ∗ )𝐾(X𝑖 , X𝑗 ) + 𝑏 (5.61) 𝑆𝑉𝑖 Figure 27 shows a diagram with non-linear SVM. Y = f(X) Outliers ε ε Outlier Support vectors Outlier 0 X Figure 27: Non-linear SVM after the Introduction of an Appropriate Kernel -99- Then, we can obtain the optimal bias from it. A drawback of 𝜀-SVR is that either 𝜀 is not calibrated or it is determined beforehand. Because 𝜀 depends on the problem, a new version of the support vector regression was proposed such that it tunes 𝜀 in the optimization process. 5.5.11 ν-SVR 𝜈-SVR is the type of support vector regression in which tuning of 𝜀 is controlled by a parameter 𝜈 ∈ [0,1], where 𝜈 is the error bound for the points marginal to the 𝜀-tube [129]. Schölkopf et al. [130] estimated the function in (5.39) allowing an error of 𝜀 in each 𝑋𝑖 . Slack variables (𝜉𝑖 , 𝜉𝑖 ∗ ) capture everything with an error greater than 𝜀. Via a constant 𝜈, the tube size 𝜀 is chosen as a trade-off between model complexity and slack variables as an expansion of (5.42): 𝑛 1 1 𝑚𝑖𝑛 {𝐶[𝜈𝜉 + �(𝜉𝑖 +𝜉𝑖 ∗ )] + ‖w‖2 } 𝑛 2 (5.62) 𝑖=1 Subject to (5.43) and 𝜀 ≥ 0. By introducing Lagrange multipliers 𝜂𝑖 , 𝜂𝑖 ∗ , 𝛼𝑖 , 𝛼𝑖 ∗ ≥ 0, a Wolfe dual problem is obtained which leads to a 𝜈-SVR optimization problem: For 𝜈 ≥ 0, 𝐶 ≥ 0, Subject to: 𝑛 𝑛 1 max �� 𝑌𝑖 (𝛼𝑖 − 𝛼𝑖 ) − � (𝛼𝑖 − 𝛼𝑖 ∗ )�𝛼𝑗 − 𝛼𝑗 ∗ �𝐾(X𝑖 , X𝑗 )� 2 𝑖=1 ∗ 𝑖,𝑗=1 𝑛 �(𝛼𝑖 − 𝛼𝑖 ∗ ) = 0, 𝑖=1 𝐶 0 ≤ 𝛼𝑖 , 𝛼𝑖 ∗ ≤ , 𝑛 𝑛 �(𝛼𝑖 + 𝛼𝑖 ∗ ) ≤ 𝐶𝜈. 𝑖=1 -100- (5.63) The regression estimate takes the form of (5.61). A good property, of support vector regression, is that it returns a sparse solution. Its drawback may be that it is a quadratic programming problem. Parameter optimization is calculated by using the well-known simplex algorithm. 5.6 Least Squares Support Vector Machines J.A.K. Suykens proposed least squares support vector machines in [131] and, later, together with other scientists, summed these up with other related discoveries in [132]. LS-SVM can be seen as an improved version of SVM. The main idea is to simplify SVM and make it more efficient by replacing the non-equality constraint with an equality constraint and to use the quadratic loss function instead of the 𝜀-loss function. The introduction, of the equality constraint, results in a set of linear equations instead of a quadratic programming problem which can be solved more efficiently. LS-SVM gives us a choice: sometimes, it is possible and more efficient to solve the problem in dual and, sometimes, it is more convenient to solve it in primal space. Compared to SVM, LS-SVM’s disadvantage is that, in LS-SVM, all points are support vectors and, consequently, sparsity could not be exploited. In LS-SVM, our goal is to optimize an objective function similar to the one in (5.40). If the ‖w‖ is finite, the optimization problem of finding w and 𝑏 can be solved as: 𝑛 1 1 min w 𝑇 w + 𝛾 � 𝑒𝑟𝑘 2 w,𝑏,𝑒𝑟 2 2 (5.64) 𝑘=1 Such that, Y = w 𝑇 φ(X) + 𝑏 + 𝑒𝑟, where 𝑒𝑟 is a vector of i.i.d. random errors with zero mean and finite variance [133]. Similar to the case of SVM, the dual formulation, of LS-SVM, can be obtained by differentiating the Lagrangian in the manner analogous to (5.45): 𝑛 Λ(𝑤, 𝑏, 𝑒𝑟; 𝛼) = w 𝑇 φ(X) + 𝑏 + 𝑒𝑟 − � 𝛼𝑘 (w 𝑇 φ(X𝑘 ) + 𝑏 + 𝑒𝑟𝑘 − Y𝑘 ) 𝑘=1 Where 𝛼 are the Lagrange multipliers and, in this case, called support values. The conditions, for optimality, are given by: -101- (5.65) 𝑛 𝜕Λ(𝑤, 𝑏, 𝑒𝑟; 𝛼) = 0 → 𝑤 = � 𝛼𝑘 𝜑(X𝑘 ) 𝜕𝑤 𝑘=1 𝑛 𝜕Λ(𝑤, 𝑏, 𝑒𝑟; 𝛼) = 0 → � 𝛼𝑘 = 0 𝜕𝑏 𝑘=1 𝜕Λ(𝑤, 𝑏, 𝑒𝑟; 𝛼) = 0 → 𝛼𝑘 = 𝛾𝑒𝑟𝑘 𝜕𝑒 𝜕Λ(𝑤, 𝑏, 𝑒𝑟; 𝛼) = 0 → w 𝑇 φ(X𝑘 ) + 𝑏 + 𝑒𝑟𝑘 = Y𝑘 𝜕𝛼 After the elimination of w and 𝑒, the result is given by the following linear system in the dual variables 𝛼 [132] and [133]: � 1𝑇𝑛 0 1𝑛 Ω+ 1𝑛 𝛾 𝑏 0 � �𝛼� = �Y� (5.66) With 1𝑛 = (1,1, … ,1)𝑇 and Ω = φ(X)𝑇 φ(X) according to Mercer’s theorem [127], LSSVM for regression is: 𝑛 � 𝛼�𝑘 𝐾(X, X𝑘 ) + 𝑏�, (5.67) 𝑘=1 where 𝛼� and 𝑏� are the solution to (5.66): 1𝑛 −1 �Ω + 𝛾 � Y 𝑏� = 1 −1 1𝑇𝑛 �Ω + 𝛾𝑛 � 1𝑛 1𝑇𝑛 1𝑛 −1 𝛼� = �Ω + � (Y − 1𝑛 𝑏�) 𝛾 For regression and similar to other learning algorithms, LS-SVM uses the learning and testing parts of the datasets. For this work, I do learning as a 10-fold crossvalidation (CV) in which the learning set consists of training and validation parts; both are chosen by stratified sampling. -102- In the case of LS-SVM algorithms, parameter tuning is done using a recent coupled simulated annealing optimization technique. 5.6.1 Robust LS-SVM In a statistical sense, Robust LS-SVM is robust to outliers and this makes it useful for problems in which outliers are present. Its drawback is that the advantage, of outlier detection, comes with higher model complexity and longer runtime. It is a type of weighted LS-SVM which Suykens et al. proposed in [134] and, in this work, I use the version from [135]. I use the Myriad reweighting scheme because it was shown in [135] that, between four candidate reweighting scheme approaches, it returns the best results. If we denote the weights, defined in (5.69) by 𝑣𝑘 , the model can be written as: 𝑛 1 1 min w 𝑇 w + 𝛾 � 𝑣�𝑘 𝑒𝑟 �𝑘 2 w,𝑏,𝑒 2 2 (5.68) 𝑘=1 Such that, Y = w 𝑇 φ(X) + 𝑏 + 𝑒𝑟 �. As in previous cases, the solution (5.66) can be obtained by using Lagrange multipliers and KKT conditions. Here, the exception is that 1 1 𝑑𝑖𝑎𝑔 �𝛾𝑣 , … , 𝛾𝑣 �. 1 𝑛 |𝑒𝑟𝑘 /𝑠̂ | ≤ 𝑐2 1, 𝑐2 − |𝑒𝑟𝑘 /𝑠̂ | 𝑣𝑘 = , 𝑐1 ≤ |𝑒𝑟𝑘 /𝑠̂ | ≤ 𝑐2 ⎨ 𝑐2 − 𝑐1 ⎩ 10−8 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ⎧ Where: 1𝑛 𝛾 becomes (5.69) 𝑐1 = 2.5; 𝑐2 = 3.0; and 𝑠̂ = 1.483𝑀𝐴𝐷(𝑒𝑟𝑘 ) is, in statistical terms, a robust estimate of the standard deviation. The resulting model is: -103- 𝑛 𝑧𝑅𝑜𝑏 (X) = � 𝛼�𝑘 𝐾(X, X𝑘 ) + 𝑏�. (5.70) 𝑘=1 The model is reweighted iteratively until, the distance between the Lagrange multipliers in two consecutive runs 𝛼 [𝑖−1] and 𝛼 [𝑖] is not lower than 10-4; this is [𝑖−1] max𝑘 �𝛼𝑘 [𝑖] − 𝛼𝑘 � ≤ 10−4 . 5.6.2 Cross-validation As mentioned earlier in this thesis, cross-validation is based on an averaging error on the learning set by dividing the data for training and validation. It is an important part of many learning-based algorithms as a means to avoid the overfitting. In this thesis, it is employed in all the learning-based algorithms. 10-fold CV can be defined as: 10 𝑏𝑣 𝑣=1 𝑘=1 2 �Y𝑘 − � Y𝑘 (X𝑘 , 𝜃)� 𝑏𝑣 , 𝐶𝑉10−𝑓𝑜𝑙𝑑 (𝜃) = � � 𝑛 𝑏𝑣 where number of data points in a fold 𝑏𝑣 ∈ [⌊𝑛/10⌋, ⌊𝑛/10⌋ + 1] for every 𝑣. In practice, the selection of 3 and 5, as folds, did not give much worse results. However, to be on the safe side, I use 10 as the number of folds because it is used most frequently. Figure 28 shows the relationship of load forecasting error to the number of CV folds on one of the Tasks. MAPE [%] 4,6 4,2 3,8 3,4 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CV folds [ ] Figure 28: Load Forecasting using LS-SVM and Dependence of MAPE on the Number of CV Folds -104- 5.7 Common for all algorithms All neural network forecasting algorithms use the stopping criterion of 𝜀𝑠𝑡 = 10−5 up to 6 consecutive times or a maximum of 500 epochs – whichever comes first. In that case, 𝜀𝑠𝑡 is calculated as 𝔖 averaged over the epochs to check the convergence. In a training set, the epoch is the representation of all data points for a neural network algorithm. Learning data is shuffled randomly and chosen as 80 % of the learning set. The learning set consists of training and validation sets. The validation set is used to validate the performance of the neural network model which, is then used to forecast, from the test set, the values for the data points. For all neural networks, used for forecasting, I empirically tested a number of neurons in a hidden layer, in a range between 4 and 16 and I obtained optimal number of neurons between error measurement and runtime, to be 9. In order to check for errors in the data, I checked the algorithms’ predictions against random walk by using: �𝑌�𝑖 − 𝑌�𝑖,𝑅𝑊 � > 6𝜎𝑑 , where 𝜎𝑑 is the standard deviation of the difference of point forecasts of 𝑌� for a particular algorithm and the RW, 𝑌�𝑅𝑊 . Since the deviation greater than the six standard deviations is very unlikely, this data is double checked manually for errors. -105- “Energy and persistence conquer all things” Benjamin Franklin 12 6 Collecting the Data Good data is an important part of a successful time-series forecasting. In order to investigate the performance, of the proposed meta-learning approach, a reasonable amount of quality data is collected. Unlike 10 years ago, there is, nowadays, an abundant amount of publicly available data for the load forecasting. According to a survey [18], load forecasting is, in scientific publications, one of the most often found types of time-series forecasting. Load forecasting’s main characteristic is that it is periodic. On the lower aggregation levels of load, e.g. on a level of a single point of delivery (similar to a metering point), load can be seen, generally, as a time-series. It could be non-periodic; chaotic; linear; or trended. In order to build the meta-learning system, heterogeneous Tasks are created based on available data from [38], [136] and [137]. 24 time-series of different hourly loads in Europe are taken, averaging between 1 and 10,000 MW. Figure 29 gives an example of one such load. For all of these loads, missing values are estimated and, for 21 of those loads, the outliers are removed. None of the time-series parts such as anomalous days; special events; or data which might be flawed, were cut out. Other than calendar information, for the majority of these loads, time-series of exogenous data are created. For the days of daylight switching, where applicable, the data is transformed so that all the days have the same hourly length. This is done by removing the 3rd hour or by adding the average of the 2nd and 3rd hours. A used procedure is more rigorous than found typically in scientific publications. However, it is similar to the typical procedures which, are used in load forecasting practice. Whilst a univariate time-series is, over time, a set of values of a single quantity, a multivariate time-series refers to values of several quantities which change over time. In the load forecasting context, the Task which consists of a load; calendar information; and temperature is a multivariate Task. Where it is available, temperature is used as exogenous feature for loads. This is due to its high correlation with the load and good empirical results in the industry and research. Additionally, for 12 Benjamin Franklin (17 January 1706 – 17 April 1790) was one of the Founding Fathers of the United States of America. He was a leading author; politician; scientist; and musician. As a scientist, he is well-known for his experiments and theories about electricity which made him a major figure in the American Enlightenment and the history of physics. -106- one load, the following exogenous weather time-series are used: wind chill; dew point; humidity; pressure; visibility; wind direction; wind speed; and the weather condition factor. Past values of weather time-series are used for forecasting. Based on the availability of exogenous information for each Task, 24 load time-series are separated into univariate and multivariate Tasks with hourly granularity. For the multivariate Tasks, new Tasks are created by aggregating the time-series on a daily and monthly granularity. Figure 29: Hourly Load in Duration of One Year with a Stable and Frequent Periodic and Seasonal Pattern found often in Loads above 500 MW The size of the learning set is: 1,000 data points for hourly granularity; 365 for daily; and 24 for monthly. For each granularity, the test set size and forecasting horizon equal to: 36 for hourly; 32 for daily; and 13 for monthly. Because forecasts are simulated in advance, part of the load forecast is discarded. For hourly loads, which are assumed to be used for creation of daily schedules, the forecasts are made at 12:00 of the load time-series and the first 12 hours are discarded. For the data of daily and monthly granularity, the first point forecast is discarded since the load value, belonging to the moment in which the forecast is made, is unknown. For data of hourly granularity, 36 values are forecasted in each iteration and the first 12 are discarded, leading to 24 hour forecasts. Similarly, for daily granularity, 32 values are -107- forecasted and the first one is discarded. For a monthly granularity, 13 are forecasted and the first value is discarded. Feature sets, for all of the Tasks, are created based on the calendar information. Those feature sets consist of different dummy variables for each granularity and feature selection. For a daily granularity, a total of 39 calendar features is encoded at the end; for daily granularity 13; and, for monthly granularity, 14 features. Because holidays are different globally, all the features, with holidays, are coded uniquely for each Task. Badly performing feature combinations such as four-seasonal year and working day holidays are not implemented in the meta-learning system. Up to 25 combinations, of the lags of the load, are added to the feature set in order to improve the load forecasting performance. In order to obtain a default feature set combination, based on calendar information and lags of the load time-series, an extensive empirical testing is conducted. With this approach a total of 69 Tasks are created, of which 14 are univariate and 55 are multivariate. The average length, of load time-series, is 10,478 data points and the feature set of Tasks is between 4 and 45 features. 65 Tasks are used to build the meta-learning system, and 4 Tasks are left for the experimental part. Section 8 provides more details about the experiments. -108- “Life is and will ever remain an equation incapable of solution, but it contains certain known factors.” Nikola Tesla 13 7 Meta-features This section is dedicated to setting out an approach for the definition and selection of the time-series meta-features, for electric load problems, which uses the ReliefF algorithm for classification. Meta-features are the representation of the metadata in features. They are the data at the meta-level. Meta-features are important in meta-learning in the same way as features are important in machine learning or data mining generally. Meta-features characterize the underlying data and they can make substantial difference in terms of performance. In order to develop meta-features, the following issues have to be taken into account [56]: • Discriminative power; • Computational complexity; and • Dimensionality. Meta-features have to be created and selected in order to distinguish between the performances of the forecasting algorithms. The computational complexity of meta-features has to be low in order to keep the runtime of the calculation of meta-features under a level of magnitude compared to the runtime of the forecasting algorithms. Otherwise, it is questionable if it is reasonable to use meta-learning. According to [54], computational complexity of meta-features ought to be at most 𝑂(𝑛 log 𝑛). In terms of the number of meta-features, the dimensionality ought not to be too much higher than the number of Tasks; otherwise it may lead to overfitting. In the next subsection, meta-features will be separated based on their types. 13 Nikola Tesla (10 July 1856 – 7 January 1943) was a Croatian scientist and inventor whose ideas and patents gave an important contribution to the field of electrical engineering. He is one of the pioneers of power systems engineering with his contributions to alternating current (AC) power systems. He invented radio, AC induction motor and transformer. -109- 7.1 Types of Meta-features Meta-features can be separated into the following three types based on the way in which they are created [56]: • Statistical and information-theoretic characterized; • Model-based; and • Landmarkers. Figure 30 illustrates the three types of meta-features based on how they are created in terms of usage of the forecasting algorithm. Figure 30: Illustration of the Difference between Types of Meta-features based on the Data and the Models Statistical and information-theoretic characterized meta-features are, by far, the broadest and, also, the simplest group. This group of meta-features is estimated from the datasets. The examples can be: the number of features; the average values of a feature; the relationship between the sample autocorrelation function of the features; or the median skewness of the features [46], [61], [62], [138–142]. In the previously described projects Statlog and METAL, this type of meta-features was shown to produce positive and tangible results [56]. Model-based meta-features are created using underlying forecasting algorithms. They are based on the properties of an induced hypothesis [143] and [144]. The model’s parameters are used as meta-features. Usually, model-based meta-features are used on one type of forecasting algorithm in order to find the optimal values of parameters for a given task. For a certain domain or for a group of datasets, they are -110- more specialized for the purpose of finding the optimal model or best combination of parameters. Landmarking is a specialization in terms of meta-features, similar to a boosting ensemble in terms of ensembles. In order to quickly obtain the performance of an algorithm class for a certain dataset it uses simple and fast learners which are specialized in different ways. Instead of representing the sample of the data, they represent the performance of an algorithm on the data and in that respect they go one step further. For example, by using k Nearest Neighbour algorithm (kNN), for each dataset, a particular dataset’s ranking, based on their performance, was generated [145]. The recommended ranking can be obtained by aggregating those rankings. Some works are published using landmarking [146–148] but, in order to see how it compares to other two types of meta-features, more research is still needed [56]. Sometimes, it is more important to design meta-features so they can be used in a certain domain as, in its own regard, each domain is distinct. This holds even at a higher level of algorithm types, e.g. some meta-features are used frequently for classification. However, for regression, the majority of those meta-features are not usable. Meta-features are different when it comes to: classification; regression; clustering; and anomaly detection. In the next subsection, we go through a definition of metafeatures of which some are made specifically for problems which are based on the load represented as a time-series. Although those meta-features can be used in every of the afore mentioned four areas of application, here, it is used for regression. 7.2 Definition of meta-features In order to obtain an optimal subset of meta-features, the results of previous metalearning research, mentioned in the previous subsection, were used as guidance. Based on it, we decide to use the statistical and information-theoretic characterization for the approach of defining meta-features for the problems of load. This is because load is a time-series and, in terms of time-series, there have not been many attempts to define meta-features. The advantage, of this type of metafeatures, is that it is not tied to a certain algorithm class such as model-based meta-111- features and it has a lower runtime and more possibilities than regression based landmarking. The initial list of meta-features was created from two segments: • Simple and most frequently used; and • Meaningful from the perspective of the domain knowledge about electric load problems (all four) and the time-series (last three). We finish up with the following meta-features: First segment: Minimum; Maximum; Mean; Standard deviation; Skewness; Kurtosis; Length; Exogenous; Periodicity; and Trend. Second segment: Granularity; Highest ACF; Traversity; and Fickleness. Whilst the first segment is the same or similar to some of the research cited in the previous subsection about works involving the creation of meta-features, the metafeatures, in the second segment, are our own creation. Before going on to the selection process of the meta-features, each meta-feature is defined and described. 7.2.1 Minimum Minimum (𝑚𝑖𝑛) is simply the minimum value of the load before normalization. It is never normalized because normalized minimum, at the same scale of normalization, would always be the same and, when standardization is used, it would be of similar value. Minimum can show important difference because some loads which have values equal to zero and some loads which have values less than zero (electricity production; error in the data) act differently and, therefore, can be forecasted and estimated better if the algorithm, which is optimal for frequently encountered errors, is used. 7.2.2 Maximum Maximum (𝑚𝑎𝑥) can be used normalized or without normalization. It is used with standardization because Tasks have the loads saved on different scales (e.g. kWh and MW). Maximum is the maximum value of the actual load which is being forecasted. -112- 7.2.3 Mean Mean is the mean (arithmetic average) of the load. For a load time-series 𝑌, mean 𝑌� is calculated as 𝑌� = ∑𝑛𝑖=1 𝑌𝑖 . 7.2.4 Standard deviation 1 Standard deviation 𝜎𝑑 is calculated as 𝜎𝑑 = �𝑛 ∑𝑛𝑖=1(𝑌𝑖 − 𝑌�)2 7.2.5 Skewness Skewness 𝑠𝑘𝑒𝑤 is the measure of the lack of symmetry and it is calculated as: 𝑛 7.2.6 Kurtosis 1 𝑠𝑘𝑒𝑤 = �(𝑌𝑖 − 𝑌�)3 𝑛𝜎 3 𝑖=1 Kurtosis is the measure of the flatness relative to a normal distribution 𝑛 7.2.7 Length 1 𝑘𝑢𝑟𝑡 = �(𝑌𝑖 − 𝑌�)4 𝑛𝜎 4 𝑖=1 Length is the number of data points in the time-series 𝑛. 7.2.8 Granularity Putting granularity, in a subset of candidate meta-features, was straightforward because, as mentioned in the section 1, it is known that different forecasting algorithms perform better for different types of load forecasting. As noted earlier, the type of load forecasting is tied to certain granularity. Granularity 𝑔𝑟𝑎𝑛 is the distance in time between the two data points in a time-series, 𝑌. 7.2.9 Exogenous Exogenous 𝑛𝑓 is the number of exogenous features used to explain the load. In this regard, load itself and calendar information are not calculated as features. If exogenous is equal to zero, load is seen as a univariate time-series; if it is greater than zero, it is a multivariate time-series. -113- 7.2.10 Trend Trend, 𝑡𝑟, is the linear coefficient of the linear regression of 𝑌. 7.2.11 Periodicity Periodicity of a time-series, 𝑝𝑒𝑟, is the smallest number of data points which repeats in a time-series. It is an index of a highest autocorrelation function (ACF) lag after, at least, one local minimum of the autocorrelation function. If its difference, to the ACF’s global minima, is not greater than 0.2 or it is not found, 𝑝𝑒𝑟 is 0. For load time-series of hourly granularity, 𝑝𝑒𝑟 is frequently 168 and, for the majority of load time-series of monthly granularity, 𝑝𝑒𝑟 is 12. 7.2.12 Highest ACF Highest ACF ℎ𝐴𝐶𝐹 is the value of the autocorrelation function at the periodicity lag. 7.2.13 Traversity Traversity (𝑡𝑟𝑎𝑣) is a standard deviation of the difference between time-series 𝑌 and 𝑌𝑝𝑒𝑟 where 𝑌𝑝𝑒𝑟 is the 𝑝𝑒𝑟-th ACF lag of 𝑌. 7.2.14 Fickleness Fickleness (𝑓𝑖𝑐) is a ratio of the number of times a time-series reverts across its mean and its length 𝑛. 𝑛−1 1 𝑓𝑖𝑐 = � 𝐼{𝑠𝑔𝑛(𝑌𝑖 −𝑌�)≠𝑠𝑔𝑛(𝑌𝑖−1 −𝑌�)} , 𝑛 𝑖=2 where 𝐼{𝑠𝑔𝑛(𝑌𝑖 −𝑌�)≠𝑠𝑔𝑛(𝑌𝑖−1 −𝑌�)} denotes the indicator function. From those defined meta-features, some are more useful than others. In order to determine the meta-features which are contributing to the metadata and can be used to improve the solution of the given load problem, selection of meta-features is used. 7.3 Selection of meta-features Selection of meta-features is only a type of feature selection. In order to execute it, we can reuse the component of feature selection described in an earlier subsection. There are numerous ways of feature selection. In order to determine which metafeatures are most appropriate, for the problems involving electric load, the ReliefF -114- algorithm for classification was chosen. The ReliefF algorithm [149] is Kononenko’s extension of the Relief algorithm for feature selection [150]. The Relief algorithm evaluates features based on how well their values distinguish between instances which are near each other. It is done by searching for the two closest instances called closest hit and nearest miss. The idea is that features, which are similar amongst themselves and different with other features, are good. In a few versions, Kononenko extended the Relief algorithm in order to be able to work easily with multiclass; noisy; and incomplete datasets. ReliefF is the version of the Relief algorithm which is capable of doing it. The main difference, with the original Relief, is that it uses kNN do to so. After empirical testing, I found four to be the best number of nearest neighbours for the algorithm and, therefore, I select it as the parameter value. Based on the ranking of the algorithms and meta-features for 65 Tasks, I apply the ReliefF for classification with the 4 nearest neighbours, to the 14 candidate meta-features described earlier. Maximum was by far the worst in all combinations of 𝑘 and it was the only metafeature with a negative weight. It was discarded from the final set of meta-features because the threshold was set that the ReliefF weight has to be greater than zero. Figure 31 shows, by the colour and the size of the bars, the relationship of the metafeatures’ weights. Based on ReliefF, Highest ACF; fickleness; and granularity are the most important meta-features. Highest ACF relates to the autocorrelation of the timeseries and it is known that some algorithms are more suitable for auto-correlated data. Some algorithms tend to work better with data which is more chaotic and which exhibits more fickleness. In load forecasting practice, it is established that the data’s granularity affects model selection. In order to visualize the meta-features, multidimensional scaling (MDS) in two dimensions is used. Kruskal optimized the stress function with gradient descent which enabled the shrinkage of data in a lower dimensional space [151]. His normalized stress1 criterion, which is used often for the multidimensional scaling, is a stress normalized by the sum of squares of the distances between the data points of the given dataset. Figure 32 shows non-metric MDS in a 2 dimensional space using Kruskal’s normalized stress1 criterion of 13 meta-features for the 65 Tasks used to build the meta-learning system. Some Tasks are outliers and the majority is concentrated densely which characterizes the data points of real-world problems. -115- 0,16 0,14 ReliefF weight [ ] 0,12 0,1 0,08 0,06 0,04 0,02 0 min mean std skew kurt per len ACF trav trend fic Meta-feature gran exog Figure 31: The ReliefF Weights show that Highest ACF; Fickleness; and Granularity are the Important Meta-features Here, the difference is that, Tasks, which are outliers, cannot be omitted like single data points, because the goal is best performance on each Task. Figure 32: MDS of Tasks and the Meta-features -116- MDS of Tasks used to build the meta-learning system and their meta-features shows that some Tasks are outliers. Finally, in the next subsection, the meta-learning system is employed. [63] -117- “However beautiful the strategy, you should occasionally look at the results.” Winston Churchill 14 8 Experimental results In this section we will go through the main findings of this doctoral research. Firstly, I will discuss the findings of earlier research which concentrated mainly on the short-term load forecasting. One of the first goals was to check inconsistency regarding the seasonal factor in the load forecasting. Does seasonality improve the results… or not? In order to do so, two different approaches were tested. The first approach defines seasons as the calendar seasons, e.g. summer is from 21 June to 23 September. The second approach defines seasons based on the change in temperature, e.g. summer is when correlation of hourly temperatures, as daily moving average over 7 days increases to the leading time-series (for spring and autumn, the correlation decreases and, for winter, it increases in absolute value; however, it is negative). In order to make it straightforward, I used, as a feature, a difference between the load and the temperature. Figure 33 illustrates the mentioned difference on an example. The seasonality features reduced the forecasting performance and, therefore, they were not used. Based on the empirical findings, we created the first models, for short-term load forecasting, using SVR and windowing. One of those models was compared to Similar days and results in terms of MAPE were better [152]. The research, in STLF, continued and SVR was applied with different coding of days in features. The method for creating synthetic datasets was proposed [153]. On those datasets, it was investigated how exogenous features affect the STLF of simulated supplier load. 14 Sir Winston Leonard Spencer-Churchill (30 November 1874 – 24 January 1965) was a British politician, best known for his leadership of the United Kingdom during the Second World War. -118- Figure 33: Difference of Load and Temperature is shown and Winter and Summer Periods can be distinguished. Summer Period lasts from 15 April to 15 October Using the proposed five layered forecasting procedure, shown in Figure 34, it was demonstrated that the number of metering points reduces the forecasting error and the number of customers increases it [153]. Input data Apply model Grid Parameter Optimization • Load forecasting data • Exogenous input • Calculate performance • Update learning and training data • Optimize γ • Optimize C 10-fold Crossvalidation • Shuffle sampling Support vector regression • Calculate performance Figure 34: Overview of the Five Layers of the Forecasting Procedure Used in [153] Segmentation of data, by clustering, and, then, classification, on the characteristics of the load, was used, to some extent, in load forecasting, and mostly as preprocessing. Related to it is [154], an approach of customer classification using 960 -119- decision trees from the WhiBo 15 decision trees [155]. It showed better performance than well-known classification algorithms. Together with that, internal and external cluster model evaluation was implemented. Another important branch of my research was segmentation, but this time it was segmentation of time-series. This segmentation may be particularly interesting for handling large amount of time-series such as online measurements in smart grid. An approach to segment time-series unrelated to their noise level based on compact representation for piecewise linear functions using hinging hyperplanes was proposed in [156]. The proposed expression enabled regression due to which an approach using LS-SVM with lasso and hinging feature map is proposed, with an online version, as well. The experimental results show better performance and advantages of the proposed approach over the competing approaches to time-series segmentation. It is noteworthy that there is a piece of research dedicated to electricity market design and trading agent design in the smart grid environment. Forecasting is helpful to the agent in a situation in which the agent obtains only partial information from the market and has to manage, on its own: wholesale operation; tariff offering to the end customers and small power plants; and at the same time, its balancing energy. For that purpose different forecasting algorithms were tested and, inside the agent logic, Holt-Winters exponential smoothing was implemented since it had a good balance between forecasting performance and runtime. Low runtime is an advantage because, agent competes with agents from the whole world in real-time, and a lot of calculations have to be made in a matter of seconds. More on the design of the agent; agent logic; and contribution to the competitions’ test settings can be found in [157–159]. Now, we will look at findings related to the meta-learning system: Besides the research in short-term load forecasting, I have started to conduct research in medium-term and long-term load forecasting. I experimented with different configurations of coding and started to de-trend the time-series and include 15 WhiBo (from white box) is an open source component-based toolbox (opposite to the black box approach) for data mining and algorithm design -120- autocorrelated lags since, in practice, these are known to improve the forecasting performance. In a series of tests on a toy set, I found that, compared to separate (MAPE 1.19 %) and no special coding for those days (MAPE 1.18 %), coding of holidays is best together with Sundays (MAPE 1.13 %). On the same toy set, results, in terms of MAPE for no season, were 1.12 %. This compared to two seasons 1.13 % and four seasons 1.13 % acknowledged the earlier mentioned results that season does not increase the performance (however, on the contrary, it increases the runtime). One of the interesting sub-results is the MAPE’s dependence on the step. Step or step size is the number of data points which are evaluated, together, using recursive simulation in one cycle of the forecasting. Figure 35 illustrates the results which were made on the STLF toy set using ex-post forecasting and determining lags ex-post, too. Depencance of MAPE on step size 1.24 1.22 1.2 MAPE [%] 1.18 1.16 1.14 1.12 1.1 1.08 1.06 1.04 0 50 100 150 Step as nr. of hours 200 250 Figure 35: The Forecasting Error Increases with the Step Size A lot of loads are periodic time-series on higher aggregation levels. Figure 36 presents a sample autocorrelation function of a toy set time-series which was used in the determination of MAPE’s relationship to certain parameters of the forecasting procedure. -121- Sample Autocorrelation Function Sample Autocorrelation 0.8 0.6 0.4 0.2 0 -0.2 0 5 10 15 20 25 Lag 30 35 40 45 50 Figure 36: Sample Autocorrelation Function of a Periodic Toy Set Time-series used for Testing Regarding consecutive lags of the autocorrelation function, dependence of the forecasting error has a knee of around 6 lags, after which the error’s autocorrelation increases the overall error as shown in Figure 37. Figure 37: Dependence of the Forecasting Error of a Periodic Time-series in terms of Number of Lags of the Autocorrelation Function used for Forecasting -122- Border is defined as the learning set size. Just as expected, with the increased number of data points or learning set size (border), the forecasting error declines for different learning-based algorithms as shown in Figure 38. Depencance of MAPE on border 2 LS-SVM e-SVR v-SVR 1.9 1.8 MAPE [%] 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 1 2 3 4 5 Border divided by 500 6 7 8 Figure 38: Decrease of Forecasting Error in terms of MAPE with the Increase of Border The 4 Tasks, which were not used to learn the meta-learning system, were used for testing. I named them A, B, C and D. Task C is LTLF and Tasks A, B and D are STLF. As explained earlier and following real-life load forecasting practice, the forecasting is conducted for simulation on the data in iterative fashion. For Tasks A and B a full year (365 cycles) is forecasted; for Task C, 1 year (1 cycle) is forecasted; and, for Task D, which has forecasts of exogenous variables, 10 days (cycles), are forecasted ex-ante. Default settings are used as the option of the feature selection module. Although RMSE and MAPE are the most widely used performance measures in timeseries forecasting [18] and in load forecasting [19], for meta-learning the system and, -123- later, for performance comparison, I use MASE [72]. This is due to problems with the same scale (RMSE) and division by zero (MAPE). The forecasting results are ranked on 65 Tasks to find the best performing algorithm on each Task. Figure 39 shows non-metric multidimensional scaling of MASE ranking amongst 65 Tasks to 2 dimensions. Figure 39: MDS of Tasks Forecasting with MASE of all Algorithms to 2 Dimensions The algorithm ranking is used for learning meta-features using the ensemble after which the CART decision tree works on meta-features of 65 Tasks for the metalearning system. Figure 40 presents the results of the CART decision tree. The results suggest the importance of highest ACF; periodicity; and fickleness, at the meta-level. Before applying the ensemble to Tasks A to D, the performance is tested on the meta-level alone by using leave one out cross validation on the training data. The ensemble’s results are compared against all candidates for it, as shown in Table 2. -124- Figure 40: The CART Decision Tree shows that the proposed Meta-features are used more than those frequently encountered which might indicate them as good candidates in the application of Meta-learning to Load Forecasting We use Pearson’s 𝜒 2 2 × 2 contingency table test between the pairs of the approaches [160]. Pearson’s 𝜒 2 2 × 2 contingency table is the number of categorical responses between independent groups. Table 2: Accuracy on the Meta-level Approach ED CART LVQ MLP AutoMLP SVM GP Ensemble Accuracy [%] 64.6 76.9 73.9 72.3 70.8 74.6 72.3 80.0 The ensemble had the best accuracy and it is statistically significant in the boundary, compared to the Euclidean distance and 𝜀-SVM (p=0.05). Between other pairs, there is no significant statistical difference (p>0.05). The ensemble is used to find the optimal forecasting algorithm for Tasks A to D. Finally, I conducted a comparison with 10 other algorithms. In addition to the algorithms used for the creation of the meta-learning system, for the comparison, we used simpler approaches like Elman network, 𝜀-SVR and LS-SVM. Optimizations which are used for those are the same as those used for the algorithms in the meta- -125- learning system related to them. Table 3 presents the results of the comparison on the Tasks A to D. It shows that a meta-learning approach is superior to other approaches demonstrated on Tasks of both univariate and multivariate time-series forecasting. The ensemble returned the best result for all four Tasks. The result, of the approach which had the best performance on the corresponding Task, is shown in bold letters. Table 3: Forecasting Error Comparison A Approach B C D MASE NRMSE MAPE MASE NRMSE MAPE MASE NRMSE MAPE MASE NRMSE MAPE RW 0.94 0.341 5.30 0.86 0.270 4.79 1.90 0.963 7.83 3.39 1.086 21.20 ARMA 0.82 0.291 4.83 1.07 0.315 6.14 1.72 1.113 7.72 2.05 0.669 12.43 SD 0.89 0.340 5.00 1.74 0.523 8.56 - - - 4.95 1.455 31.21 MLP 0.28 0.125 1.48 0.38 0.136 1.88 0.37 0.341 0.57 0.52 0.183 3.08 Elman NN 0.45 0.129 2.42 0.38 0.120 2.07 0.78 0.538 3.66 0.73 0.259 4.46 LRNN 0.47 0.222 2.59 0.33 0.106 1.81 1.01 0.711 4.45 0.76 0.279 4.79 ε-SVR 0.30 0.110 1.78 0.35 0.101 1.96 1.60 1.040 7.19 0.49 0.150 2.88 ν-SVR 0.24 0.096 1.41 0.27 0.086 1.54 1.60 1.039 7.19 0.45 0.139 2.61 LS-SVM 0.16 0.072 0.98 0.20 0.065 1.15 0.43 0.311 2.08 0.43 0.143 2.49 RobLSSVM 0.15 0.065 0.91 0.20 0.065 1.15 0.44 0.340 2.11 0.40 0.139 2.18 Figure 41 presents an example of the forecasting for seven random iterations of Task A. -126- Figure 41: Typical Difference between Actual and Forecasted Load for Seven Cycles of a Randomly Selected Task using the Meta-learning System Figure 42 presents the comparison of error of a meta-learning system and the best algorithm for each cycle for Task A. Figure 42: MASE Difference for Task A between Meta-learning System and Best Solution Averaged per Cycle -127- Table 4 presents a summary and error statistics for that example. Table 4: Error Statistics of Task A Standard dev. Skewness Selection MASE MAPE Meta-Learning 0.097 0.62 4.84 All combinations 0.082 0.52 119 ratio [%] 118 MASE MAPE Time [s] per cycle Total 4.90 106 38,843 3.69 3.92 370 135,004 131 125 29 In 82 % of the cycles, the meta-learning would have performed best for Task A. The relative MASE is 1.059 and relative MAPE is 1.057 where they are defined as ratio of “Meta-Learning” and “All combinations” of MASE and MAPE, respectively. The meta-learning system was implemented in MATLAB R2011b. By using MATLAB Toolbox Parallel Computing, it was parallelized for an increased performance and scalability. The number, of used CPU cores, was selected in the first module. If it was set to 1, it did not work in parallel mode and used only 1 core. The meta-learning system was tested on 32 and 64 bit versions of the following operating systems: Linux; Windows; Unix; and Macintosh. It was tested on different configurations running from 1 to 16 cores at the same time. The maximum memory usage was 2 GB RAM. The runtime of the system was tuned according to the industry needs. Table 4 shows that one usual cycle of forecasting (36 data points) took on average, 106 seconds. -128- “So long and thanks for all the fish” Douglas Adams 16 9 Conclusions In this doctoral thesis, I proposed the method for time-series forecasting which used meta-learning. The proposed method was organized as the meta-learning system which worked with continuous and interval time-series. Those time-series could be univariate and multivariate, depending on the availability of the data. The proposed method could manipulate, also, multivariate time-series as if they were univariate; however, there was lower performance in that case. The proposed meta-learning method included well-known algorithms, which were used for classification, at the meta-level, and, for regression, at the Task level. At the meta-level, it used an ensemble to optimize the learning procedure and it did classical meta-learning at the Task level. In order to find the optimal set of meta-features, I selected some well-known statistical and theoretic-information characterized meta-features and, then, I created new meta-features which were, also, statistical and theoretic-information characterized. New meta-features are: highest ACF; granularity; fickleness; and traversity. I selected the meta-features with the ReliefF algorithm for classification. The ReliefF algorithm for classification returned a good set of meta-features for timeseries problems involving electric load. Empirical tests showed that meta-features, selected in the proposed way, could indicate more challenging loads, in terms of forecasting, which I were looking for. CART decision trees and the ReliefF test justified the introduction of new meta-features and favoured highest ACF and fickleness. I proposed a univariate and multivariate time-series forecasting meta-learning approach as a general framework for load forecasting. The proposed approach worked as a model selection. The meta-learning approach was based on bi-level learning which used the ensemble on the higher, meta-level. In a detailed comparison with other approaches to load forecasting and, as I demonstrated on Tasks A, B, C and D, the proposed approach returned a lower load forecasting error. 16 Douglas Noel Adams (11 March 1952 – 11 May 2001) was an English author, who is best known for his radio comedy and books “The Hitchhiker's Guide to the Galaxy”. -129- This meta-learning system was designed to be parallelized; modular; componentbased; and easily extendable. The status of the hypotheses is the following: Hypothesis H-1 is accepted because the proposed meta-learning system for timeseries forecasting obtained less than or equal forecasting error relative to the following algorithms: random walk, ARMA, Similar days, multilayer perceptron, layer recurrent neural network, Elman network, 𝜈-support vector regression and robust least squares support vector machines. Hypothesis H-2 is accepted on the grounds that the results of the ReliefF weights and CART decision tree show that the newly proposed meta-features have higher weights in determining which forecasting algorithm will be selected. Hypothesis H-3 is accepted because the forecasting process is now easier since the user can obtain the optimal algorithm for forecasting independent of the type of the load forecasting problem in the same system. On the problems of short-term, medium-term and long-term load forecasting both ex-ante and ex-post, forecasting error of the proposed approach is lower or equal to the forecasting error obtained by using any of the tested algorithms. The proposed ensemble returns lower classification error than its single components such as 𝜀-SVM, which shows that it returns results more stable than any of its components. The runtime of the proposed system is slower than in the case of running each algorithm separately. Hypothesis H-4 is rejected on the grounds that the introduction of calendar features for seasons did not decrease the load forecasting error in the tested cases; on the contrary, it increased the load forecasting error. Also, introduction of holidays did not decrease the forecasting error. The first contribution of this doctoral thesis is time-series forecasting using the multivariate meta-learning. The second contribution is the result of the choice to use the meta-learning on the time-series problems in the domain of power systems. Although there were metafeatures readily available for the problems involving time-series, previously there were no works in determining the most appropriate ones. The problem is two-fold: first to determine the set of candidate metafeatures and second, to determine the suitable ones. -130- The third contribution stems from application of the multivariate meta-learning to the domain of load forecasting as a framework for load forecasting model selection. Explicitly, the three contributions are stated as follows: • The method for forecasting of the continuous and interval time-series using the multivariate meta-learning; • The approach for definition and selection of the time-series metafeatures for electric load problems which uses the ReliefF algorithm for classification; • The approach for electric load forecasting model selection using meta-learning based on bi-level learning with the ensemble on the higher level. The meta-learning approach is a new and promising field of research in the area of load forecasting. New optimization methods, in feature selection and new forecasting algorithms, might lead to higher efficiency and lower forecasting errors. In turn this would enable the research and application to tackle more complex problems such as forecasting of smart grids for the dynamic dispatch. Forecasting approaches such as hybrids; ensembles; and other combinations appear to be a fertile area for further research. Besides research opportunities, the proposed approach can be used for everyday operation in industry; to help non-experts in forecasting with the model selection; and to lower operating costs, thus, saving money to society. It can be propagated to new end-user services based on large scale forecasting and contribute to the development of the electricity market. It is not limited to load forecasting. On the contrary, the load meta-features can be used on other problems which have electric load as input. The proposed multivariate meta-learning, can, also, be implemented in wide area of time-series application, especially those which have lots of time-series such as: weather forecasting systems; finance; traffic; security; logistics; and retail management. -131- 10 References [1] NASA, Weather Forecasting Through the Ages [Accessed Online 12.06.2012] Available: http://earthobservatory.nasa.gov/Features/WxForecasting/wx2.php/ [2] The Free Dictionary, Forecast [Online Accessed 12.06.2012] Available: http://www.thefreedictionary.com/forecast [3] Investopedia, Forecasting [Online Accessed 12.06.2012]. Available: http://www.investopedia.com/terms/f/forecasting.asp/ [4] M. Cerjan, M. Matijaš, and M. Delimar, “A hybrid model of dynamic electricity price forecasting with emphasis on price volatility,” in 9th International conference on the European Energy Market – EEM12, 2012, pp. 1–7. [5] A. J. Mehta, H. A. Mehta, T. C. Manjunath, and C. Ardil, “A Multi-layer Artificial Neural Network Architecture Design for Load Forecasting in Power Systems,” Journal of Applied Mathematics, vol. 4, no. 4, pp. 227–240. [6] S. Butler, “UK Electricity Networks,” 2001. [7] P. O. Reyneau, “Predicting load on residence circuits,” Electrical World, vol. LXXI, no. 19, pp. 969–971, 1918. [8] G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, vol. Third. San Francisco: Prentice Hall, 1970. [9] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” Advances in Neural Information Processing Systems, vol. 9, no. October, pp. 155–161, 1997. [10] B.-J. Chen, M.-W. Chang, and C.-J. Lin, “Load Forecasting Using Support Vector Machines: A Study on EUNITE Competition 2001,” IEEE Transactions on Power Systems, vol. 19, no. 4, pp. 1821–1830, 2004. [11] M. S. Sachdev, R. Billinton, and C. A. Peterson, “Representative bibliography on load forecasting,” IEEE Transactions on Power Apparatus and Systems, vol. 96, no. 2, pp. 697–700, Mar. 1977. [12] G. Gross and F. D. Galiana, “Short-term load forecasting,” Proceedings of the IEEE, vol. 75, no. 12, pp. 1558–1973, 1987. [13] G. E. Huck, A. A. Mahmoud, R. B. Comerford, J. Adams, and E. Dawson, “Load Forecasting Bibliography: Phase I,” IEEE Transactions on Power Apparatus and Systems, vol. PAS-99, no. 1, pp. 53–58, 1980. -132- [14] A. A. Mahmoud, T. H. Ortmeyer, and R. E. Reardon, “Load Forecasting Bibliography - Phase II,” IEEE Transactions on Power Apparatus and Systems, vol. PAS-100, no. 7, pp. 3217–3220, Oct. 1981. [15] I. Moghram and S. Rahman, “Analysis and evaluation of five short-term load forecasting techniques,” IEEE Transactions on Power Systems, vol. 4, no. 4, pp. 1484–1491, 1989. [16] S. Tzafestas and E. Tzafestas, “Computational Intelligence Techniques for Short-Term Electric Load Forecasting,” Journal of Intelligent and Robotic Systems, vol. 31, no. 1–3, pp. 7–68, 2001. [17] H. K. Alfares and M. Nazeeruddin, “Electric load forecasting: literature survey and classification of methods,” International Journal of Systems Science, vol. 33, no. 1, pp. 23–34, Jan. 2002. [18] R. Sankar and N. I. Sapankevych, “Time Series Prediction Using Support Vector Machines: A Survey,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 24–38, 2009. [19] H. Hahn, S. Meyer-Nieberg, and S. Pickl, “Electric load forecasting methods: Tools for decision making,” European Journal of Operational Research, vol. 199, no. 3, pp. 902–907, 2009. [20] Y. Wang, Q. Xia, and C. Kang, “Secondary Forecasting Based on Deviation Analysis for Short-Term Load Forecasting,” IEEE Transactions on Power Systems, vol. 26, no. 2, pp. 500–507, 2011. [21] M. Espinoza, J. A. K. Suykens, R. Belmans, and B. De Moor, “Electric Load Forecasting using kernel-based modeling for nonlinear system identification,” IEEE Control Systems Magazine, vol. 27, no. 5, pp. 43–57, 2007. [22] W.-C. Hong, “Electric load forecasting by seasonal recurrent SVR (support vector regression) with chaotic artificial bee colony algorithm,” Energy, vol. 36, no. 9, pp. 5568–5578, Sep. 2011. [23] J. W. Taylor, “Short-Term Load Forecasting With Exponentially Weighted Methods,” IEEE Transactions on Power Systems, vol. 27, no. 1, pp. 458–464, 2012. [24] J. H. Landon, “Theories of vertical integration and their application to the electric utility industry,” Antitrust Bulletin, vol. 28, no. 101, 1983. [25] B. F. Hobbs, S. Jitprapaikulsarn, S. Konda, V. Chankong, K. A. Loparo, and D. J. Maratukulam, “Analysis of the Value for Unit Commitment of Improved Load Forecasts,” IEEE Transactions on Power Systems, vol. 14, no. 4, pp. 1342– 1348, 1999. -133- [26] M. Matijaš, M. Cerjan, and S. Krajcar, “Mean-based Method of Estimating Financial Cost of Load Forecasting,” in 8th International conference on the European Energy Market – EEM11, 2011, pp. 192–197. [27] W.-C. Hong, Y. Dong, C.-Y. Lai, L.-Y. Chen, and S.-Y. Wei, “SVR with Hybrid Chaotic Immune Algorithm for Seasonal Load Demand Forecasting,” Energies, vol. 4, no. 6, pp. 960–977, Jun. 2011. [28] M. Matijaš, M. Cerjan, and S. Krajcar, “Features Affecting the Load Forecasting Error on Country Level,” in 3rd International Youth Conference on energetics 2011 – IYCE’11, 2011, pp. 1–7. [29] Z. Guo, W. Li, A. Lau, T. Inga-Rojas, and K. Wang, “Detecting X-Outliers in Load Curve Data in Power Systems,” IEEE Transactions on Power Systems, vol. 27, no. 2, pp. 875–884, 2012. [30] J. W. Taylor, “An evaluation of methods for very short-term load forecasting using minute-by-minute British data,” International Journal of Forecasting, vol. 24, no. 4, pp. 645–658, Oct. 2008. [31] Q. Chen, J. Milligan, E. H. Germain, R. Raub, P. Sharnsollahi, and K. W. Cheung, “Implementation and Performance Analysis of Very Short Term Load Forecaster - Based on the Electronic Dispatch Project in ISO New England,” in Large Engineering Systems Conference on Power Engineering, 2001. LESCOPE ’01. [32] P. Shamsollahi, K. W. Cheung, and E. H. Germain, “A neural network based very short term load forecaster for the interim ISO New England electricity market system,” pica 2001. Innovative Computing for Power - Electric Energy Meets the Market. 22nd IEEE Power Engineering Society. International Conference on Power Industry Computer Applications (Cat. No.01CH37195), pp. 217–222, 2001. [33] C. Guan, P. B. Luh, L. D. Michel, Y. Wang, and P. B. Friedland, “Very ShortTerm Load Forecasting: Wavelet Neural Networks With Data Pre-Filtering,” IEEE Transactions on Power Systems, pp. 1–12, 2012. [34] M. Alamaniotis, A. Ikonomopoulos, and L. H. Tsoukalas, “Evolutionary Multiobjective Optimization of Kernel-Based Very-Short-Term Load Forecasting,” IEEE Transactions on Power Systems, vol. 27, no. 3, pp. 1477– 1484, 2012. [35] SCOPUS [Online Accessed 13.06.2012]. Available: http://www.scopus.com/ [36] L. Dannecker, M. Boehm, U. Fischer, F. Rosenthal, G. Hackenbroich, and W. Lehner, “FP7 Project MIRABEL D 4.1: State-of-the-Art Report on Forecasting,” Dresden, 2010. [37] S. Osowski and K. Siwek, “The Selforganizing Neural Network Approach to Load Forecasting in the Power System,” in IJCNN’99. International Joint -134- Conference on Neural Networks. Proceedings (Cat. No.99CH36339), vol. 5, pp. 3401–3404. [38] ENTSO-E [Online Accessed 12.06.2012]. Available: http://www.entsoe.net/ [39] F. J. Marin, F. Garcia-Lagos, G. Joya, and F. Sandoval, “Global model for short-term load forecasting using artificial neural networks,” IEE Proceedings Generation, Transmission and Distribution, vol. 149, no. 2, p. 121, 2002. [40] M. Misiti, Y. Misiti, and G. Oppenheim, “Optimized clusters for disaggregated electricity load forecasting,” REVSTAT - Statistical Journal, vol. 8, no. 2, pp. 105–124, 2010. [41] Regulating Power webpage on Nord Pool Spot. [Online Accessed 01.03.2011]. Available: http://www.nordpoolspot.com/reports/regulating_power3/ [42] Wikipedia, Learning [Online Accessed 12.06.2012]. Available: Wikipedia, http://en.wikipedia.org/wiki/Learning/ [43] Merriam Webster Dictionary: Learning [Online Accessed: 12.06.2012]. Available: http://www.merriam-webster.com/dictionary/learning/ [44] Oxford Dictionaries: Learning. [Online Accessed 12.06.2012]. Available: http://oxforddictionaries.com/definition/learning/ [45] T. M. Mitchell, Machine learning. McGraw-Hill, 1997, p. 2. [46] D. W. Aha, “Generalizing from Case Studies: A Case Study,” in Proceedings of the Ninth International Conference on Machine Learning, 1992, no. 2, pp. 1– 10. [47] D. H. Wolpert, “The Lack of A Priori Distinctions Between Learning Algorithms,” Neural Computation, vol. 8, no. 7, pp. 1341–1390, 1996. [48] C. Giraud-Carrier, Metalearning - A Tutorial. 2008, pp. 1–38. [49] J. R. Rice, “The algorithm selection problem,” Advances in Computers, no. 15, pp. 65–118, 1976. [50] P. E. Utgoff, “Shift of Bias for Inductive Concept Learning,” in Machine Learning: An Artificial Intelligence Approach, Vol. 2, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Eds. Morgan Kaufmann Publishers, 1986, pp. 107–148. [51] L. Rendell, R. Seshu, and D. Tcheng, “Layered Concept-learning and Dynamical Variable Bias Management,” in 10th International Joint Conference on Artificial Intelligence, 1987, pp. 308–314. [52] Y. Kodratoff, D. Sleeman, M. Uszynski, K. Causse, and S. Craw, Building a Machine Learning Toolbox. Elsevier Science Publishers, 1992, pp. 81–108. -135- [53] P. Brazdil and R. J. Henery, “Analysis of Results,” in Machine Learning, Neural and Statistical Classification, D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds. 1994, pp. 175–196. [54] B. Pfahringer, H. Bensusan, and C. Giraud-Carrier, “Meta-Learning by Landmarking Various Learning Algorithms,” in Proceedings of the Seventeenth International Conference on Machine Learning ICML2000, 2000, pp. 743–750. [55] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer and R. Wirth, CRISP-DM 1.0 Step-by-Step Data Mining Guide, vol. 78. SPSS Inc., 2000, pp. 1–78. [56] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning: Applications to Data Mining, 1st ed. Berlin: Springer-Verlag, 2009. [57] A. Bernstein, F. Provost, and S. Hill, “Toward Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Cost-sensitive Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 503–518, 2005. [58] S. Ali and K. A. Smith-Miles, “A meta-learning approach to automatic kernel selection for support vector machines,” Neurocomputing, vol. 70, no. 1–3, pp. 173–186, Dec. 2006. [59] B. Arinze, “Selecting appropriate forecasting models using rule induction,” Omega - The International Journal of Management Science, vol. 22, no. 6, pp. 647–658, 1994. [60] R. B. C. Prudêncio and T. B. Ludermir, “Meta-learning approaches to selecting time series models,” Neurocomputing, vol. 61, pp. 121–137, Oct. 2004. [61] X. Wang, K. Smith-Miles, and R. Hyndman, “Rule induction for forecasting method selection: Meta-learning the characteristics of univariate time series,” Neurocomputing, vol. 72, no. 10–12, pp. 2581–2594, Jun. 2009. [62] C. Lemke and B. Gabrys, “Meta-learning for time series forecasting and forecast combination,” Neurocomputing, vol. 73, no. 10–12, pp. 2006–2016, Jun. 2010. [63] M. Matijaš, J. A. K. Suykens, and S. Krajcar, “Load forecasting using a multivariate meta-learning system,” Expert Systems with Applications, vol. 40, no. 11, pp. 4427–4437, 2013. [64] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “CART: Classification and Regression Trees,” Wadsworth Belmont CA, vol. 30, no. 1, pp. 69–71, 2009. [65] T. Kohonen, Self-Organizing Maps. Berlin: Springer, 1997. -136- [66] M. H. Beale, M. T. Hagan, and H. B. Demuth, Neural Network Toolbox User’s Guide R2012a. MathWorks Inc., 2012, p. 113. [67] T. Kohonen, “Learning vector quantization,” in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, USA: MIT Press, 1995, pp. 537–540. [68] G. Cybenko, “Approximation by Superpositions of a Sigmoidal Function,” Mathematics Of Control Signals And Systems, vol. 2, no. 4, pp. 303–314, 1989. [69] T. M. Breuel and F. Shafait, “AutoMLP: Simple, Effective, Fully Automated Learning Rate and Size Adjustment,” in The Learning Workshop, Snowbird, USA, 2010. [70] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, USA: MIT Press, 2006, pp. 1–219. [71] C. E. Rasmussen, Machine Learning Summer School (MLSS): Solving Challenging Non-linear Regression Problems by Manipulating a Gaussian Distribution. 2009. [72] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,” International Journal of Forecasting, vol. 22, no. 4, pp. 679–688, Oct. 2006. [73] K. Pearson, “The Problem of the Random Walk,” Nature, vol. 72, no. 1865, p. 294, 1905. [74] Wikipedia, “Random Walk [Online Accessed 06.12.2012]. Available: http://en.wikipedia.org/wiki/Random_walk/.” . [75] P. J. Brockwell and R. A. Davis, Introduction to Time Series and Forecasting, 2nd ed. New York, Berlin, Heidelberg: Springer Verlag Inc., 2002, pp. 1–428. [76] R. Weron, Modeling and Forecasting Electricity Loads and Prices - A Statistical Approach. Southern Gate, Chichester: John Wiley & Sons, Ltd, 2006, pp. 1– 170. [77] J.-Y. Fan and J. D. McDonald, “A real-time implementation of short-term load forecasting for distribution power systems,” IEEE Transactions on Power Systems, vol. 9, no. 2, pp. 988–993, 1994. [78] J.-F. Chen, W.-M. Wang, and C.-M. Huang, “Analysis of an adaptive timeseries autoregressive moving-average (ARMA) model for short-term load forecasting,” Electric Power Systems Research, vol. 34, no. 3, pp. 187–196, 1995. -137- [79] L. D. Paarmann and M. D. Najar, “Adaptive online load forecasting via time series modeling,” Electric Power Systems Research, vol. 32, no. 3, pp. 219– 225, 1995. [80] J. Nowicka-Zagrajek and R. Weron, “Modeling electricity loads in California: ARMA models with hyperbolic noise,” Signal Processing, vol. 82, pp. 1903– 1915, 2002. [81] H. Cramér and M. R. Leadbetter, Stationary and related stochastic processes. Sample function properties and their applications. New York, London, Sydney: John Wiley & Sons, Inc., 1967, pp. 1–348. [82] P. M. T. Broersen, Automatic Autocorrelation and Spectral Analysis. London: Springer-Verlag, 2006, pp. 1–298. [83] G. Eshel, “The Yule Walker Equations for the AR Coefficients,” 2009. [84] P. M. T. Broersen, “ARMASA Toolbox with Applications,” in Automatic Autocorrelation and Spectral Analysis, no. 2002, London: Springer-Verlag, 2006, pp. 223–250. [85] J. P. Burg, “Maximum Entropy Spectral Analysis,” in Proceedings of the 37th Annual International Society of Exploration Geophysicists Meeting, 1967, vol. 6, pp. 1–6. [86] P. M. T. Broersen and H. E. Wensink, “On Finite Sample Theory for Autoregressive Model Order Selection,” IEEE Transactions on Signal Processing, vol. 41, no. 1, pp. 194–204, 1993. [87] P. M. T. Broersen, “The Quality of Models for ARMA Processes,” IEEE Transactions on Signal Processing, vol. 46, no. 6, pp. 1749–1752, 1998. [88] P. M. T. Broersen, “Facts and Fiction in Spectral Analysis,” IEEE Transactions on Instrumentation and Measurement, vol. 49, no. 4, pp. 766–772, 2000. [89] P. Mandal, T. Senjyu, and T. Funabashi, “Neural networks approach to forecast several hour ahead electricity prices and loads in deregulated market,” Energy Conversion and Management, vol. 47, no. 15–16, pp. 2128–2142, Sep. 2006. [90] X. Li, C. Sun, and D. Gong, “Application of Support Vector Machine and Similar Day Method for Load Forecasting,” in ICNC’05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part II, 2005, pp. 602 – 609. [91] P. Mandal, T. Senjyu, N. Urasaki, F. Toshihisa, and A. K. Srivastava, “A Novel Approach to Forecast Electricity Price for PJM Using Neural Network and Similar Days Method,” IEEE Transactions on Power Systems, vol. 22, no. 4, pp. 2058–2065, 2007. -138- [92] S. Rahman and R. Bhatnagar, “An Expert System based Algorithm for Short Term Load Forecast,” IEEE Transactions on Power Systems, vol. 3, no. 2, pp. 392–399, 1988. [93] I. Aleksander and H. Morton, An Introduction to Neural Computing, 1st ed. London: Chapman & Hall, 1990, pp. 1–250. [94] S. O. Haykin, Neural Networks and Learning Machines, 3rd ed. Prentice Hall, 2008, pp. 1–906. [95] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bulletin of Mathematical Biology, vol. 5, no. 4, pp. 115–133, 1943. [96] J. von Neumann, Papers of John von Neumann on Computing and Computing Theory. Cambridge, USA: MIT Press, 1986. [97] S. O. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Pearson Education Ltd., 2005, pp. 1–842. [98] J. von Neumann, The Computer and the Brain. New Haven, London: Yale University Press, 1958. [99] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review, vol. 65, no. 6, pp. 386– 408, 1958. [100] F. Rosenblatt, Principles of neurodynamics; perceptrons and the theory of brain mechanisms. Washington: Spartan Books, 1962. [101] J. D. Cowan, “A Mathematical Theory of Central Nervous Activity,” University of London, 1967. [102] M. L. Minsky and S. A. Papert, Perceptrons. MIT Press, 1969. [103] P. J. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences,” Harvard University, 1974. [104] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, pp. 59–69, 1982. [105] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990. [106] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by Simulated Annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983. [107] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive Science, vol. 9, pp. 147–169, 1985. -139- [108] D. S. Broomhead and D. Lowe, “Multi-variable functional interpolation and adaptive networks,” Complex Systems, vol. 2, pp. 321–355, 1988. [109] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, vol. 1. MIT Press, 1986, pp. 135–153. [110] NSF QSTORM Project website [Online Accessed: 13.06.2012]. Available: http://www.qstorm.org/wp-content/uploads/2012/03/axon-dendrite-etc.jpg/ [111] W. J. Freeman, Mass Action in the Nervous System. New York, San Francisco, London: Academic Press, 1975, pp. 2–452. [112] T. P. Vogl, J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon, “Accelerating the convergence of the back-propagation method,” Biological Cybernetics, vol. 59, no. 4, pp. 257–263, 1988. [113] V. N. Vapnik, Statistical Learning Theory, 1st ed. Wiley-Interscience, 1998, pp. 1–736. [114] A. J. Smola and B. Schölkopf, “A Tutorial on Support Vector Regression,” 1998. [115] A. J. Smola and B. Schölkopf, “A tutorial on Support Vector Regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, Aug. 2004. [116] B. Scholköpf, A. J. Smola, R. Williamson, and P. Bartlett, “New support vector algorithms,” Neural computation, vol. 12, no. 5, pp. 1207–1245, May 2000. [117] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the fifth annual workshop on Computational learning theory COLT 92, 1992, vol. 6, no. 8, pp. 144–152. [118] M. A. Aizerman, E. M. Braverman, and L. I. Rozoner, “Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote Control, vol. 25, pp. 821–837, 1964. [119] V. N. Vapnik, The Nature of Statistical Learning Theory. New York: Springer, 1995, p. 188. [120] M. Pontil, R. Rifkin, and T. Evgeniou, “From Regression to Classification in Support Vector Machines From Regression to Classification,” 1998. [121] C. Cortes and V. N. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [122] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge University Press, 2009, pp. 1–683. [123] I. B. Vapnyarskii, “Lagrange multipliers,” Encyclopedia of Mathematics. Springer, 2001. -140- [124] O. L. Mangasarian, Nonlinear Programming. New York: McGraw-Hill, 1969. [125] A. V. Fiacco and G. P. McCormick, Nonlinear programming: sequential unconstrained minimization techniques. SIAM, 1968, pp. 1–210. [126] B. Schölkopf and A. J. Smola, Learning with Kernels, vol. 98, no. November. MIT Press, 2002. [127] J. Mercer, “Functions of positive and negative type and their connection with the theory of integral equations,” Philosophical Transactions of the Royal Society, vol. A 209, no. 1, pp. 415–446, 1909. [128] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble, “A statistical framework for genomic data fusion,” Bioinformatics (Oxford, England), vol. 20, no. 16, pp. 2626–35, Nov. 2004. [129] D. Basak, S. Pal, and D. C. Patranabis, “Support Vector Regression,” Neural Information Processing - Letters and Reviews, vol. 11, no. 10, pp. 203–224, Nov. 2007. [130] B. Schölkopf, P. Bartlett, A. Smola, and R. Williamson, “Shrinking the Tube: A New Support Vector Regression Algorithm,” in Proceedings of the 1998 conference on Advances in neural information processing systems II, 1999, pp. 330–336. [131] J. A. K. Suykens and J. Vandewalle, “Least Squares Support Vector Machine Classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999. [132] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines, 1st ed. Singapore: World Scientific, 2002. [133] J. De Brabanter, “LS-SVM Regression Modelling and its Applications,” Katholieke Universiteit Leuven, 2004. [134] J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted least squares support vector machines: robustness and sparse approximation,” Neurocomputing, vol. 48, no. 1–4, pp. 85–105, 2002. [135] K. De Brabanter, K. Pelckmans, J. De Brabanter, M. Debruyne, J. A. K. Suykens, M. Hubert and B. De Moor, “Robustness of Kernel Based Regression: a Comparison of Iterative Weighting Schemes,” in Lecture Notes in Computer Science: Artificial Neural Networks - ICANN 2009, 2009. pp. 100– 110. [136] Weather Underground [Online Accessed: 27.11.2012]. Available: http://www.wunderground.com/ [137] European Commission, “Demography report 2010,” 2012. -141- [138] E. Horwood, Machine Learning, Neural and Statistical Classification. 1994, pp. 1–290. [139] R. Engels and C. Theusinger, “Using a Data Metric for Preprocessing Advice for Data Mining Applications,” Proceedings of the European Conference on Artificial Intelligence ECAI98, pp. 430–434, 1998. [140] M. Hilario and A. Kalousis, “Fusion of Meta-Knowledge and Meta-Data for Case-Based Model Selection,” in Proceedings of the Fifth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD01), 2001, vol. 2168, pp. 180–191. [141] J. Gama and P. Brazdil, “Characterization of Classification Algorithms,” in Progress in Artificial Intelligence, Proceedings of the Seventh Portuguese Conference on Artificial Intelligence, 1995, vol. 990, pp. 189–200. [142] S. Y. Sohn, “Meta analysis of classification algorithms for pattern recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 11, pp. 1137–1144, 1999. [143] Y. Peng, P. A. Flach, C. Soares, and P. Brazdil, “Improved Dataset Characterisation for Meta-learning,” in Proceedings of the 5th International Conference on Discovery Science, 2002, vol. 2534, pp. 141–152. [144] H. Bensusan, “God doesn’t always shave with Occam’s razor - learning when and how to prune,” in Proceedings of the 10th European Conference on Machine Learning, ECML ’98, 1998, pp. 119–124. [145] P. B. Brazdil, C. Soares, and J. P. Da Costa, “Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results,” Machine Learning, vol. 50, no. 3, pp. 251–277, 2003. [146] C. Köpf and I. Iglezakis, “Combination of Task Description Strategies and Case Base Properties for Meta-Learning,” in Proceedings of the Second International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM-2002), 2002, pp. 65–76. [147] H. Bensusan and A. Kalousis, “Estimating the Predictive Accuracy of a Classifier,” in Proceedings of the 12th European Conference on Machine Learning ECML ’01, 2001, pp. 25–36. [148] L. Todorovski, H. Blockeel, and S. Džeroski, “Ranking with predictive clustering trees,” in Proceedings of the 13th European Conference on Machine Learning, 2002, vol. 2430, no. 2430, pp. 444–455. [149] I. Kononenko, “Estimating Attributes: Analysis and Extensions of RELIEF,” Lecture Notes in Computer Science: Machine Learning - ECML 1994, vol. 784, pp. 171–182, 1994. -142- [150] K. Kira and L. A. Rendell, “The feature selection problem: Traditional methods and a new algorithm,” in Proceedings Of The National Conference On Artificial Intelligence, 1992, vol. 35, no. 3, pp. 129–134. [151] J. B. Kruskal, “Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric hypothesis,” Psychometrika, vol. 29, no. 1, pp. 1–27, 1964. [152] M. Matijaš and T. Lipić, “Application of Short Term Load Forecasting using Support Vector Machines in Rapidminer 5.0,” 2010. [153] M. Matijaš, M. Vukićević, and S. Krajcar, “Supplier Short Term Load Forecasting Using Support Vector Regression and Exogenous Input,” Journal of Electrical Engineering, vol. 62, no. 5, pp. 280–285, 2011. [154] M. Vukićević and M. Matijaš, “Electricity Customer Classification using WhiBo Decision Trees,” in 2nd RapidMiner Community Meeting and Conference – RCOMM 2011, 2011. [155] M. Vukićević, M. Jovanović, B. Delibašić, and M. Suknović, “WhiBo RapidMiner plug-in for component based data mining algorithm design,” in 1st RapidMiner Community Meeting and Conference, 2010. [156] X. Huang, M. Matijaš, and J. A. K. Suykens, “Hinging Hyperplanes for TimeSeries Segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2013. [157] J. Babić, S. Matetić, M. Matijaš, I. Buljević, I. Pranjić, M. Mijić and M. Augustinović, “The CrocodileAgent 2012: Research for Efficient Agent-based Electricity Trading Mechanisms,” in In Proceedings of Special Session on Trading Agent Competition, KES-AMSTA 2012, 2012, pp. 1–13. [158] J. Babić, S. Matetić, M. Matijaš, I. Buljević, I. Pranjić, M. Mijić, M. Augustinović, A. Petrić and V. Podobnik, “Towards a Steeper Learning Curve with the CrocodileAgent 2012,” in 1st Erasmus Energy Forum, 2012. [159] S. Matetić, J. Babić, M. Matijaš, A. Petrić, and V. Podobnik, “The CrocodileAgent 2012: Negotiating Agreements in Smart Grid Tariff Market,” in In Proceedings of the 1st International Conference on Agreement Technologies – AT 2012, 2012, pp. 203–204. [160] Mathbeans Project: Chi Square Statistic [Online Accessed 19.01.2013] Available: http://math.hws.edu/javamath/ryan/ChiSquare.html -143- Biography He was born in Split, Croatia where he finished elementary school Ruđer Bošković (Spinut) and III Mathematical Gymnasium (MIOC). He graduated at the Faculty of Electrical Engineering and Computing, University of Zagreb with master thesis “Risk Management on the model of small electric energy trading company” in 2008. Since then he is working as power analyst in HEP Opskrba, an electricity retail daughter company of the HEP Group. In academic year 2011/12 he spent a total of six months at the KU Leuven under the supervision of prof. Johan A. K. Suykens. He is married. Scientific publications in the scope of this thesis: *1* X. Huang, M. Matijaš and J. A. K. Suykens: ”Hinging Hyperplanes for Time-Series Segmentation”, IEEE Transactions on Neural Networks and Learning Systems, 2013, doi: http://dx.doi.org/10.1109/TNNLS.2013.2254720 *2* M. Matijaš, J. A. K. Suykens and S. Krajcar: ”Load Forecasting using a Multivariate Meta-Learning System”, Expert Systems With Applications, vol. 40, no. 11, pp. 4427-4437, 2013, doi: http://dx.doi.org/10.1016/j.eswa.2013.01.047 *3* S. Matetić, J. Babić, M. Matijaš, A. Petrić and V. Podobnik: ”The CrocodileAgent 2012: Negotiating Agreements in Smart Grid Tariff Market”, In Proceedings of the 1st International Conference on Agreement Technologies – AT 2012, Dubrovnik, Croatia, pp. 203-204, 2012 *4* J. Babić, S. Matetić, M. Matijaš, I. Buljević, I. Pranjić, M. Mijić and M. Augustinović: “The CrocodileAgent 2012: Research for Efficient Agent-based Electricity Trading Mechanisms”, In Proceedings of Special Session on Trading Agent Competition, KES-AMSTA 2012, Dubrovnik, Croatia, pp. 1-13, 2012 *5* M. Cerjan, M. Matijaš and M. Delimar: “A hybrid model of dynamic electricity price forecasting with emphasis on price volatility”, 10th – 12th May, 2012, 9th International conference on the European Energy Market – EEM12, Florence, Italy *6* M. Matijaš, M. Vukićević and S. Krajcar: “Supplier Short Term Load Forecasting Using Support Vector Regression and Exogenous Input”, Journal of Electrical Engineering, vol. 62, no. 5, pp. 280285, 2011, doi: http://dx.doi.org/10.2478/v10187-011-0044-9 *7* M. Matijaš, M. Cerjan and S. Krajcar: “Features Affecting Load Forecasting Error on Country Level”, 7th – 9th July, 2011, 3rd International Youth Conference on energetics 2011 – IYCE’11, Leiria, Portugal *8* M. Vukićević and M. Matijaš: “Electricity Customer Classification using WhiBo Decision Trees”, 7th - 10th June, 2011, 2nd RapidMiner Community Meeting and Conference – RCOMM 2011, Dublin, Ireland *9* M. Matijaš, M. Cerjan and S. Krajcar: “Mean-based Method of Estimating Financial Cost of Load Forecasting”, 25th – 27th May, 2011, 8th International conference on the European Energy Market – EEM11, Zagreb, Croatia *10* M. Matijaš and T. Lipić: “Application of Short Term Load Forecasting using Support Vector Machines in RapidMiner 5.0”, 13th -16th September, 2010, RapidMiner Community Meeting and Conference – RCOMM 2010, Dortmund, Germany Životopis Rođen je u Splitu, Hrvatska, gdje je završio osnovnu školu Ruđer Bošković (Spinut) i III matematičku gimnaziju (MIOC). Diplomirao je 2008. godine na Fakultetu elektrotehnike i računarstva, Sveučilišta u Zagrebu s radom pod nazivom „Upravljanje rizicima na modelu malog trgovca električnom energijom“. Od tada je zaposlen na poslovima analize u HEP Opskrbi, kćeri HEP Grupe za maloprodaju električnom energijom. U akademskoj godini 2011/12. proveo je šest mjeseci na KU Leuven pod nadzorom prof.dr.sc. Johana A.K. Suykensa. Oženjen je.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement