IMPROVED HYPER-TEMPORAL FEATURE EXTRACTION METHODS FOR LAND

IMPROVED HYPER-TEMPORAL FEATURE EXTRACTION METHODS FOR LAND
IMPROVED HYPER-TEMPORAL FEATURE EXTRACTION METHODS FOR LAND
COVER CHANGE DETECTION IN SATELLITE TIME SERIES
By
Brian Paxton Salmon
Submitted in partial fulfilment of the requirements for the degree
Philosophiae Doctor (Electronic)
in the
Faculty of Engineering, Built Environment and Information Technology
Department of Electrical, Electronic and Computer Engineering
UNIVERSITY OF PRETORIA
August 2012
© University of Pretoria
SUMMARY
IMPROVED HYPER-TEMPORAL FEATURE EXTRACTION METHODS FOR LAND
COVER CHANGE DETECTION IN SATELLITE TIME SERIES
by
Brian Paxton Salmon
Promoter:
Department:
University:
Degree:
Keywords:
Prof J.C. Olivier
Electrical, Electronic and Computer Engineering
University of Pretoria
Philosophiae Doctor (Electronic)
classification, clustering, change detection, extended Kalman filter,
Fourier transform, satellite, time series
The growth in global population inevitably increases the consumption of natural resources. The need
to provide basic services to these growing communities leads to an increase in anthropogenic changes
to the natural environment. The resulting transformation of vegetation cover (e.g. deforestation,
agricultural expansion, urbanisation) has significant impacts on hydrology, biodiversity, ecosystems
and climate. Human settlement expansion is the most common driver of land cover change in South
Africa, and is currently mapped on an irregular, ad hoc basis using visual interpretation of aerial
photographs or satellite images. This thesis proposes several methods of detecting newly formed
human settlements using hyper-temporal, multi-spectral, medium spatial resolution MODIS land
surface reflectance satellite imagery. The hyper-temporal images are used to extract time series, which
are analysed in an automated fashion using machine learning methods. A post-classification change
detection framework was developed to analyse the time series using several feature extraction methods
and classifiers. Two novel hyper-temporal feature extraction methods are proposed to characterise
the seasonal pattern in the time series. The first feature extraction method extracts Seasonal Fourier
features that exploits the difference in temporal spectra inherent to land cover classes. The second
feature extraction method extracts state-space vectors derived using an extended Kalman filter. The
extended Kalman filter is optimised using a novel criterion which exploits the information inherent
in the spatio-temporal domain. The post-classification change detection framework was evaluated on
different classifiers; both supervised and unsupervised methods were explored. A change detection
accuracy of above 85% with false alarm rate below 10% was attained. The best performing methods
were then applied at a provincial scale in the Gauteng and Limpopo provinces to produce regional
change maps, indicating settlement expansion.
OPSOMMING
VERBETERDE HOË TYD-RESOLUSIE KENMERKONTTREKKINGSMETODES VIR DIE
DETEKSIE VAN VERANDERING IN LANDBEDEKKING MET BEHULP VAN ’N
SATELLIETTYDREEKS.
deur
Brian Paxton Salmon
Promotor:
Departement:
Universiteit:
Graad:
Sleutelwoorde:
Prof J.C. Olivier
Elektriese, Elektroniese en Rekenaar Ingenieurswese
Universiteit van Pretoria
Philosophiae Doctor (Elektronies)
klassifikasie, groepering, veranderingopsporing, uitgebreide Kalman-filter,
Fourier-transform, satelliet, tydsreekse
Die groei in die globale bevolking veroorsaak verhoogde verbruik van natuurlike hulpbronne. Die
behoefte om basiese dienste te lewer aan hierdie groeiende gemeenskappe lei tot ’n toename
in antropogeniese veranderinge aan die natuurlike omgewing. Die gevolglike transformasie van
plantbedekking (bv. ontbossing, landbou-uitbreiding, verstedeliking) het ’n beduidende impak
op hidrologie, ekosisteme en die klimaat.
Nedersettingsuitbreiding is die mees algemene
oorsaak van landbedekkingsverandering in Suid-Afrika en informasie oor waar en wanneer nuwe
nedersettings, voorkom word tans op ’n onreëlmatige basis bekom deur die visuele interpretasie
van lugfotos of satellietbeelde.
Hierdie tesis stel verskeie metodes voor vir die opsporing van
nuutgestigte nedersettings met behulp van hiper-temporale, multi-spektrale, medium ruimtelike
resolusie MODIS-grondoppervlakte reflektansie satellietbeelde. Die hiper-temporale beelde word
gebruik om tydsreekse te onttrek, wat dan outomaties ontleed word met behulp van masjienleer
metodes. ’n Post-klassifikasie veranderingopsporingsraamwerk is ontwikkel om tydsreekse te analiseer
deur gebruik te maak van verskeie kenmerkonttrekkingsmetodes en klassifiseerders. Twee nuwe
hiper-temporale kenmerkonttrekkingsmetodes word voorgestel om die seisoenale patroon in die
reeks te karakteriseer. Die eerste kenmerkonttrekkingsmetode onttrek Seisoen Fourier-eienskappe
uit die tydsreeks, wat die temporale spektrum eienskappe van verskillende landbedekkingsklasse
beklemtoon.
Die tweede kenmerkonttrekkingsmetode onttrek toestand-ruimte vektore uit die
tydsreeks, wat verkry word met behulp van ’n uitgebreide Kalman-filter. Die uitgebreide Kalman-filter
is geoptimeer deur gebruik te maak van ’n nuwe maatstaf wat gebaseer is op die inligting
in die ruimtelike-temporale domein.
Die post-klassifikasie veranderingopsporingsraamwerk is
geëvalueer met verskillende klassifiseerders; beide toesig en sonder-toesig metodes is ondersoek. ’n
Veranderingopsporingsakkuraatheid bo 85% met ’n valsalarmkoers onder 10% is behaal. Die beste
metodes is toegepas op ’n provinsiale skaal in die Gauteng- en Limpopo-provinsies om plaaslike
veranderings kaarte te produseer.
This thesis is dedicated to:
God Almighty, for all the countless opportunities that He has given me;
My loving family and friends, thank you for all your love, support, and sacrifice throughout my life.
We all grow up with the weight of history on us. Our ancestors
dwell in the attics of our brains as they do in the spiraling chains
of knowledge hidden in every cell of our bodies. - Shirley Abbott
ACKNOWLEDGEMENT
The author would like to thank the following people and institutions, without whose help this thesis
would not have been possible:
• The Council for Scientific and Industrial Research for supporting me on their PhD studentship
programme.
• My study leader, Prof J.C. Olivier, for all the advice and guidance he has given me throughout
the course of my studies.
• My co-promoters, Dr. Frans van den Bergh and Dr. Konrad Wessels, for all their insight, advice
and help.
• My fellow student, Waldo Kleynhans, for all his useful suggestions and advice.
• The University of Pretoria’s computer clusters maintained by Hans Grobler, which greatly aided
in my simulations.
• Karen Steenkamp for providing me with the necessary data used for training and validation
purposes.
• Willem Marais for providing me with custom developed image processing software.
• The financial assistance of the National Research Foundation (NRF) towards this research is
hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the author
and are not necessarily to be attributed to the NRF.
LIST OF ABBREVIATIONS
Autocorrelation Function
ACF
Aikaike Information Criterion
AIC
Atmospheric Infrared Sounder
AIRS
Autocovariance Least Squares
ALS
Ante Meridiem
AM
Advanced Microwave Scanning Radiometer
AMSR
Advanced Microwave Sounding Unit
AMSU-A
Artificial Neural Network
ANN
Advanced Spaceborne Thermal Emission and Reflection radiometer
ASTER
Algorithm Theoretical Basis Document
ATBD
Advanced Very High Resolution Radiometer
AVHRR
Break For Additive Seasonal and Trend
BFAST
Broyden-Fletcher-Goldfarb-Shanno
BFGS
Best Matching Unit
BMU
Bidirectional Reflectance Distribution Function
BRDF
Bias-Variance Equilibrium Point
BVEP
Bias-Variance Score
BVS
Bias-Variance Search Algorithm
BVSA
Clouds and the Earth’s Radiant Energy System
CERES
Change Vector Analysis
CVA
Chandra X-ray Center
CXC
Coastal Zone Color Scanner
CZCS
Discrete Fourier Transform
DFT
Extended Kalman Filter
EKF
Expectation Maximization
EM
Earth Observation System
EOS
Earth Resource Technology Satellite
ERTS
Enhanced Thematic Mapper Plus
ETM+
Enhanced Vegetation Index
EVI
Foreign Argicultural Services
FAS
Fast Fourier Transform
FFT
Farm Service Agency
FSA
Gigabit
Gb
Gross Domestic Product
GDP
Group on Earth Observations
GEO
Global Earth Observation System of Systems
GEOSS
Geographical Information System
GIS
Global Positioning System
GPS
Hierarchical Data Format
HDF
High Resolution Infrared Spectrometer
HIRS
Humidity Sounder for Brazil
HSB
Inverse Discrete Fourier Transform
IDFT
Inverse Fast Fourier Transform
IFFT
Instantaneous Field of View
IFOV
Least Squares
LS
Line Spread Function
LSF
Linear Spectral Mixture Analysis
LSMA
Land Use Land Change
LULC
Multi-angle Imaging SpectroRadiometer
MISR
Multilayer Perceptron
MLP
MODerate-resolution Imaging Spectroradiometer
MODIS
Measurements of Pollution in the Troposphere
MOPITT
Multi-Spectral Scanner
MSS
National Aeronautics and Space Administration
NASA
National Argicultural Statistics Services
NASS
Normalized Difference Vegetation Index
NDVI
Near InfraRed
NIR
National Land Cover
NLC
Ordinary Least Squares
OLS
Principal Component Analysis
PCA
Post Meridiem
PM
Point Spread Function
PSF
Radial Basis Function
RBF
Red Green Blue
RGB
Resilient backpropagation
RPROP
Smithsonian Astrophysical Obervatory
SAO
Seasonal Fourier Features
SFF
Signal-to-Noise Ratio
SNR
Self Organizing Map
SOM
Satellite Pour l’Observation de la Terre
SPOT
Signal-to-Quantization Noise Ratio
SQNR
Sum of Squares Error
SSE
Support Vector Machine
SVM
Thematic Mapper
TM
United Nations
UN
United States Department of Argiculture
USDA
Vegetative Cover Conversion
VCC
Vegetation Index
VI
C ONTENTS
C HAPTER 1 - I NTRODUCTION
1
1.1
Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Objective of this thesis and proposed solution . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
C HAPTER 2 - R EMOTE SENSING USED FOR LAND COVER CHANGE DETECTION
7
2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Spontaneous Settlements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.1
Limpopo province . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2
Gauteng province . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Overview of Remote Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.4
Electromagnetic radiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.5
Earth’s Energy Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.5.1
Interaction with the atmosphere . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.5.2
Interaction with the Earth’s surface . . . . . . . . . . . . . . . . . . . . . . .
16
2.5.3
Interaction with a satellite-based sensor . . . . . . . . . . . . . . . . . . . . .
16
2.6
MODerate resolution Imaging Spectroradiometer . . . . . . . . . . . . . . . . . . . .
20
2.7
Vegetation Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.7.1
Normalised Difference Vegetation Index . . . . . . . . . . . . . . . . . . . . .
26
2.7.2
Enhanced Vegetation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Land cover change detection methods . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.8.1
Hyper-temporal change detection methods . . . . . . . . . . . . . . . . . . .
33
2.8.2
MODIS land cover change detection product . . . . . . . . . . . . . . . . . .
36
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.8
2.9
C HAPTER 3 - S UPERVISED CLASSIFICATION
38
3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
i
Contents
3.3
Supervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.1
Mapping of input vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3.2
Converting to feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.4.1
Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.4.2
Regression using a multilayer perceptron . . . . . . . . . . . . . . . . . . . .
51
3.4.3
Classification using a multilayer perceptron . . . . . . . . . . . . . . . . . . .
52
3.4.4
Training of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.4.5
First order training algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.4.6
Second order training algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
57
Other variants of Artificial Neural Networks used for Classification . . . . . . . . . . .
60
3.5.1
Radial basis function network . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.5.2
Self organising map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.5.3
Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.5.4
Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.6
Design consideration: Supervised classification . . . . . . . . . . . . . . . . . . . . .
63
3.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.4
3.5
C HAPTER 4 - U NSUPERVISED CLASSIFICATION
66
4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.2.1
Mapping of vectors to clusters . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.2.2
Creating meaningful clusters . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.2.3
Challenges of clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.3
Similarity metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.4
Hierarchical clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.4.1
Linkage criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.4.2
Cophenetic correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . .
77
Partitional clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.5.1
K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.5.2
Expectation-maximisation algorithm . . . . . . . . . . . . . . . . . . . . . . .
79
4.6
Determining the number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.7
Classification of cluster labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.5
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
II
Contents
C HAPTER 5 - F EATURE EXTRACTION
84
5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.2
Time series representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.3
State-space representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.4
Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.5
Extended Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.6
Least squares model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.7
M-estimate model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8
Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C HAPTER 6 - S EASONAL F OURIER F EATURES
108
6.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2
Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3
Meaningless analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4
Meaningful clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5
Change detection method using the seasonal Fourier features . . . . . . . . . . . . . . 118
6.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
C HAPTER 7 - E XTENDED K ALMAN F ILTER FEATURES
121
7.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2
Change detection method: Extended Kalman Filter . . . . . . . . . . . . . . . . . . . 121
7.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.2
The method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.3
Importance of the initial parameters . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.4
Bias-Variance Equilibrium Point . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.5
Bias-Variance Search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3
Autocovariance Least Squares method . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C HAPTER 8 - R ESULTS
138
8.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2
Ground truth data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2.1
MODIS time series data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.2
Manual inspection of study areas . . . . . . . . . . . . . . . . . . . . . . . . . 140
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
III
Contents
8.2.3
GoogleTM Earth used for visual inspection . . . . . . . . . . . . . . . . . . . . 142
8.2.4
Simulated land cover data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3
System outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.4
Experimental Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.5
Parameter Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.6
8.7
8.5.1
Optimising the multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . 148
8.5.2
Batch mode versus iterative retrained mode . . . . . . . . . . . . . . . . . . . 149
8.5.3
Optimising least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5.4
BVEP versus autocovariance least squares . . . . . . . . . . . . . . . . . . . . 153
8.5.5
Optimisation of Kalman filter parameters . . . . . . . . . . . . . . . . . . . . 153
8.5.6
BVSA parameter evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.5.7
Determining the number of clusters . . . . . . . . . . . . . . . . . . . . . . . 158
8.5.8
Results: Cophenetic correlation coefficient . . . . . . . . . . . . . . . . . . . 159
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6.1
Classification accuracy: Multilayer perceptron . . . . . . . . . . . . . . . . . 161
8.6.2
Clustering experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.6.3
Clustering accuracy: Single, Average and Complete linkage criterion . . . . . 163
8.6.4
Clustering accuracy: Ward clustering method . . . . . . . . . . . . . . . . . . 164
8.6.5
Clustering accuracy: K-means clustering . . . . . . . . . . . . . . . . . . . . 166
8.6.6
Clustering accuracy: Expectation-Maximisation . . . . . . . . . . . . . . . . . 168
8.6.7
Summary of classification results . . . . . . . . . . . . . . . . . . . . . . . . 168
Change detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.7.1
Simulated land cover change detection . . . . . . . . . . . . . . . . . . . . . . 170
8.7.2
Real land cover change detection . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.7.3
Effective change detection delay . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.7.4
Summary of change detection results . . . . . . . . . . . . . . . . . . . . . . 176
8.8
Change detection algorithm comparison . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.9
Provincial experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
8.10 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
C HAPTER 9 - C ONCLUSION
186
9.1
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2
Future Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
IV
Contents
R EFERENCES
191
A PPENDIX A - P UBLICATIONS EMANATING FROM THIS THESIS AND RELATED WORK
207
A.1 Papers that appeared in Thomson Institute for Scientific Information journals . . . . . 207
A.2 Papers published in Refereed Accredited Conference Proceedings . . . . . . . . . . . 208
A.3 Invited conference papers in Refereed Accredited Conference Proceedings . . . . . . . 209
A.4 Papers submitted to Refereed Accredited Conference Proceedings . . . . . . . . . . . 209
A.5 Best paper award . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
V
C HAPTER
ONE
I NTRODUCTION
1.1
PROBLEM STATEMENT
Reliable monitoring of land cover and its transformation is an important component of environmental
and natural resources management. Land cover is defined as the physical composition of material on
the surface of the Earth, while land use is a description of how the land is used for socio-economic
reasons [1]. Land cover is distinctly different from land use, but these two terms will be used
interchangeably, as the focus of this thesis is the detection of land cover transformation of natural
vegetation to newly formed human settlements. Several studies have investigated the global effects of
anthropogenic activities on the planet, and it is estimated that more than a third of the Earth’s land
surface has been transformed by human activities [2]. The increase in human population is one of the
major drivers of settlement expansion within geographical areas, which further increases the utilisation
of the remaining natural resources [3]. Geographic information on land use and land cover change is
highly sought after at local and global scales.
Land cover change often indicates land use change with major socio-economic impacts, while
the transformation of vegetation cover (e.g. deforestation, agricultural expansion, urbanisation) has
significant impacts on hydrology, ecosystems and climate [4, 5].
All these changes affect the
environment and have a detrimental impact on the habitat of the human race. This raises the question
whether the human’s demand for natural resources is sustainable.
Sustainability is the long-term maintenance plan that will ensure the future of mankind’s
endeavours. The most widely quoted definition of sustainability and sustainable development was
stated by the Brundtland Commission of the United Nations (UN) on March 20, 1987 as [6]:
Sustainable development is development that meets the needs of the present without
compromising the ability of future generations to meet their own needs.
Chapter 1
Introduction
The well-being of the environment is one of the major factors that contributes to sustainability. The
UN General Assembly’s discussion on sustainable human settlements concluded that countries’ local
governments need to plan, implement, develop, and manage human settlements [7]. It was further
stated that the local government needs to manage existing settlements and prevent the establishment of
any new unplanned settlements. The ability to determine where new settlements are formed, creates
opportunities for the local government to provide running water supplies, sewage- and refuse removal
services, which ties in directly with the UN’s Millennium Development goals. The UN proposes
a systematic development of sustainable cities for newly formed settlements. The South African
government incorporated this vision into its local policies by focusing on service delivery to these
newly formed settlements. Human settlement expansion is currently the most pervasive form of land
cover change in South Africa [8]. Most of the new settlements are informal, unplanned and are usually
built without the legal consent of the land owner [9, 10]. This thesis focuses on the detection of new
human settlements formed in South Africa.
Satellite-based remote sensing is widely recognised by agencies, such as the United States
Department of Agriculture (USDA)’s Farm Service Agency (FSA), the USDA’s National Agricultural
Statistics Services (NASS), and USDA’s Foreign Agricultural Services (FAS), as a cost-effective
method of acquiring information on the Earth’s land surface [11]. Monitoring environmental dynamics,
and classifying and detecting land cover change, require this type of cost-effective, systematic
observations. The remote sensing science has thus progressed rapidly to meet the need to monitor
global environmental change activities [12, 13]. Visually inspecting large volumes of high spatial
resolution images for monitoring land cover is time-consuming and resource-intensive [14].
Earth observation satellites with wide swath widths provide the means of monitoring large areas
on a frequent basis (high temporal resolution) [15]. These satellites are equipped with multiple coarse
to medium spatial resolution sensors to record land surface information, in different spectral bands on
a daily basis. Land cover surveillance of large geographical areas is augmented by the information
inherent in the hyper-temporal satellite images, and therefore the analysis of these long-term data sets
has attracted much attention [16, 17]. Owing to the complexity and non-parametric nature of land
cover classification and change detection, machine learning methods are widely regarded as the most
viable option for classification and change detection [14, 18]. The use of machine learning methods
enables digital change detection, which encompasses the quantification of temporal phenomena from
multi-date imagery that is most commonly acquired by satellite-based multi-spectral sensors [19].
Two types of land cover changes are usually investigated [20]: land cover modification and land
cover transformation. Land cover modification is caused by internal changes within a particular land
cover class. These changes affect the current state of the land cover class, but do not change the land
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
2
Chapter 1
Introduction
cover class, i.e. seasonal variation of natural vegetation. Land cover transformation of a particular
geographical area involves change from one land cover class to another. This thesis focuses on land
cover transformation of natural vegetation to newly formed human settlements, although the methods
are applicable to other forms of land cover transformation. In the rest of this thesis the terms land cover
transformation and land cover change are used interchangeably.
Change detection studies usually rely on image differencing, post-classification comparison
methods, and change trajectory analysis [20–26], and the data are mostly treated as hyper-dimensional,
but not necessarily as hyper-temporal. These methods therefore do not fully capitalise on the high
temporal sampling rate which captures the dynamics of different land cover types. Satellites with high
temporal acquisition rates provide information on the seasonal dynamics of a particular land cover
type [15]. Incorporating the temporal information into a change detection algorithm allows the method
to distinguish between land cover conversion and natural seasonal variations.
Main problem statement: To detect land cover conversion of natural vegetation to newly formed
human settlements reliably. The land cover change detection algorithm should incorporate
temporal information to distinguish the change from seasonal variations. The land cover change
detection algorithm should also be able to detect new human settlements that only span a small
geographical area using coarse spatial resolution satellite imagery.
1.2
OBJECTIVE OF THIS THESIS AND PROPOSED SOLUTION
Primary objective: Develop a change detection algorithm that operates on multiple spectral bands,
which exploits the richness of information inherent in hyper-temporal images.
Secondary objective: Develop a change detection algorithm that is sufficiently near automated,
requiring minimal human interaction.
As stated previously, machine learning methods are the more viable solution when analysing high
dimensional data sets. A post-classification change detection approach detects change by classifying a
geographical area into different classes over time. Land cover change is defined here as the transition
in class label of a pixel’s time series from one class to another class, after which it remains in the newly
assigned class for the remainder of the time series [20]. A flow diagram for the proposed solution is
shown in figure 1.1.
A set of images of a particular geographical area is obtained. The interval between two consecutive
images must be short, which implies hyper-temporal acquisitions. The hyper-temporal images in
this thesis were acquired by the MODerate resolution Imaging Spectroradiometer (MODIS) sensor
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
3
Chapter 1
Introduction
SPOT image
Visual inspection
Create no-change
time series data set
Create simulated change
time series data set
Time series
extraction
Obtain multiple images of a
particular geographical area
Change detection
Machine learning
methods
Create real change
time series data set
Feature extraction
methods
F IGURE 1.1: A flow diagram which depicts the steps followed to realise the proposed solution.
on board the Terra and Aqua satellites and are freely available. The MCD43A4 product provides
hyper-temporal, multi-spectral (7 spectral bands) medium spatial resolution (500 metre) land surface
reflectance data. The Bidirectional Reflectance Distribution Function (BRDF) correction models all
the pixels in an image to a nadir view, which significantly reduces the anisotropic scattering effects of
surfaces under different illumination and observation conditions [27, 28]. Time series of reflectance
values were extracted for each spectral band over a particular geographical area (500 metre by 500
metre) from the multi-spectral hyper-temporal MODIS data set (February 2000 – January 2008).
Since the hyper-temporal images are coarse to medium spatial resolution, high spatial resolution
satellite data are required for ground truth. Satellite Probatoire d’Obervation de la Terre (SPOT) images
are high spatial resolution images, which are analysed by operators to identify areas that experienced
land cover change or no land cover change.
Land cover change is a rare event on a regional scale and vital information, such as the date of
change and rate of change, is usually not known. Therefore land cover change was simulated to enable
a detailed assessment of change detection methods, which could not be performed on the real land
cover change data set. A simulated land cover change time series data set is created by blending time
series of two different land cover classes which did not change. The simulated land cover change data
are used to test the functionality of the change detection methods, after which tests are performed on
real examples of land cover change mapped using high spatial resolution images. Several contributions
are made in this thesis that provide solutions to the primary and secondary objectives.
Contribution 1: Develop of a novel land cover change detection method.
The method is a
post-classification approach and will operate on the Seasonal Fourier Features (SFF). SFF are
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
4
Chapter 1
Introduction
hyper-temporal features extracted from time series.
The SFF are hyper-temporal features extracted without experiencing the usual pitfalls encountered
with subsequence clustering [29]. The use of the SFF is then compared to another method proposed by
Kleynhans et al. [30], referred to as the Extended Kalman Filter (EKF) feature extraction method. The
drawback with this method is that it requires an offline optimisation phase, which must be performed
by an operator. This does not satisfy the secondary objective (full automation) of this thesis, but has
shown promising results.
Contribution 2: Extend the EKF feature extraction method to a higher dimensions to improve change
detection capabilities.
The second objective concerned with full automation of the EKF extraction method is addressed in the
following contribution.
Contribution 3: Propose a novel criterion that is referred to as the Bias-Variance Equilibrium Point
(BVEP). The BVEP is the point where the tracking of the reflectance values within time series
are improved and the internal stability of the EKF is optimised. Define a Bias-Variance Score
(BVS) that will measure the current system in relation to the BVEP.
The BVEP criterion also provides statistical information on the seasonal vegetation activity cycle,
which could provide vital insight into environmental dynamics [31, 32]. The optimisation of the BVS
requires an unsupervised search method, which adjusts the variables to satisfy the BVEP criterion.
Contribution 4: Design a new search algorithm, referred to as the Bias-Variance Search Algorithm
(BVSA), that can effectively optimise the BVS to the BVEP criterion for optimal EKF
performance.
1.3
OUTLINE OF THESIS
The outline of the thesis is as follows:
• Chapter 2 gives a brief overview of the study area and an introduction to remote sensing
principles. The chapter discusses several trade-offs that should be considered when selecting
a sensor to solve the problem statement. The chapter concludes with an overview of some of the
most common change detection methods found in the literature.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
5
Chapter 1
Introduction
• Chapter 3 gives an introduction to supervised classification and in particular the Multilayer
Perceptron (MLP). The chapter further discusses the pursuit of acceptable performance, and
concludes with an overview on design considerations for a supervised classifier.
• Chapter 4 gives an introduction to unsupervised classification and provides several motivations
for using an unsupervised classifier. The chapter also covers the disadvantages of unsupervised
clustering and methods to mitigate them with proper cluster design.
• Chapter 5 defines four different feature extraction methods and their application to time series.
These features are expected to provide good separation between natural vegetation and human
settlement signals.
• Chapter 6 introduces the novel SFF and provides an in-depth investigation of the limitation of
time series analysis mentioned by Keogh and Lin [29]. The chapter concludes with evidence of
how the SFF provides a solution to this limitation.
• Chapter 7 introduces the BVEP, BVS, and Bias-Variance Search Algorithm (BVSA) used to
optimise the EKF, in order to improve the quality of the extracted features.
• Chapter 8 presents the results of all experiments conducted in the thesis. These experiments
report on classification accuracies, and change detection accuracies. These experiments are first
conducted on a labelled data set within a particular province, and then expanded to run on a
complete province, the Gauteng and Limpopo provinces of South Africa.
• Chapter 9 gives concluding remarks, as well as suggesting possible future research that could
expand on the concepts introduced in this thesis.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
6
C HAPTER
TWO
R EMOTE SENSING USED FOR LAND COVER
CHANGE DETECTION
2.1
OVERVIEW
Remote sensing is the acquisition of information about an object without any direct contact with the
object [33, Ch. 1]. Sensors are usually used to measure reflected wavelengths obtained from an object,
which are then analysed for specific applications. A satellite-based sensor measures the reflected
electromagnetic radiation of the Earth’s surface and these measurements are then used to infer changes
in surface reflectances caused by either environmental dynamics or anthropogenic activities.
Many international organisations and national governments have identified remote sensing as a
beneficial field of study, and have made major joint investments in building better Earth observation
systems. The objective of this chapter is to give the reader insight on how satellite-based sensors can
be used to detect the formation of new human settlements on the Earth’s surface.
2.2
SPONTANEOUS SETTLEMENTS
The standard of living in a country usually improves when sustainable economic growth is maintained.
The government pursues a variety of projects to control the quality of economic growth [34].
Economical growth in developing countries is usually constrained by the lack of skilled labour,
availability of resources, and necessary equipment. This lack of progress is aggravated by the pressure
of a rapid growth in population and a backlog in housing development projects [9].
This backlog creates a shortage in the supply of affordable houses to the public, which results in
the construction of temporary dwellings. These temporary dwellings are usually built without the legal
consent of the land owner. The construction of temporary dwellings is not region-specific and has
become a global phenomenon, although different characteristics are observed in the development of
Chapter 2
Remote sensing used for land cover change detection
these dwellings in each region [35]. A cluster of such temporary dwellings is formally known as a
spontaneous settlement [9], which is a form of informal settlement [36, 37].
Social, economical, and political processes drive the migration of communities to certain regions,
which often results in the development of informal settlement. This motivates the need for the
local government to progressively track settlement expansion and migration [38, 39]. Settlement
expansion is currently mapped on an irregular, ad hoc basis at great financial cost, using expensive
visual interpretation of aerial photographs or satellite images. Regional information on settlement
expansion gives the government the ability to plan the provision of services such as water, sanitation
and electricity to these new or growing communities.
The behaviour of urban settlement migration and expansion has been empirically studied and
predicted in various studies, but for several reasons cannot be applied to spontaneous settlements [9].
In this thesis no prior assumptions are made when attempting to find new or expanding settlements
other than the decrease in seasonal behaviour associated with settlements.
Another motivation for tracking these spontaneous settlements is that their formation is currently
one of the most pervasive forms of land cover change in South Africa [40]. The transformation
of natural vegetation by practises such as deforestation, agricultural expansion and urbanisation has
significant impacts on hydrology, ecosystems and the climate [4, 5, 41]. The area of interest in this
thesis is the Limpopo province and Gauteng province located within South Africa.
2.2.1
Limpopo province
The Limpopo province is situated in the northern part of South Africa (Figure 2.1). The name of
the province was derived from the river that separates South Africa from its neighbouring countries,
Zimbabwe and Botswana. The province shares its southern borders with the Mpumalanga, Gauteng
and North-West provinces.
The province is largely covered by natural vegetation, which is used for grazing by cattle and
wildlife. It houses the largest hunting industry in South Africa. The province is also rich in numerous
different tea and coffee plantations. The area is cultivated, with a range of agriculture focused on
sunflowers, cotton, maize, peanuts, bananas, litchis, pineapples and mangoes.
The government departments within the province cannot currently capture and process all the
necessary data on the different land cover types and anthropogenic activities throughout the province.
This constraint is brought about by a limited budget, which motivates the pursuit of a less expensive
alternative. Remote sensing (section 2.3) has been adopted by several governments as a less expensive
option to augment the current processes of gathering information. If the government had access to
more complete information, it could assist in the development of a management system to control and
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
8
Chapter 2
Remote sensing used for land cover change detection
Limpopo
Mpumalanga
North-West
Gauteng
Free State
Kwazulu-Natal
Northern Cape
Eastern Cape
Western Cape
F IGURE 2.1: The Limpopo province is located in the northern part of South Africa.
monitor resources for the people throughout the province.
2.2.2
Gauteng province
The Gauteng province is situated in the highveld of South Africa (Figure 2.2). The name Gauteng
comes from the Sesotho (indigenous language) word meaning place of gold. This is a common
reference to the gold discovered in the city of Johannesburg in 1886. The province shares its borders
with the Limpopo, Mpumalanga, North-West, and Free State provinces.
Gauteng is a landlocked province in the highveld, which is a high-altitude grassland. The province
is the most urbanised one in the country. The province houses 20% of the country’s population and
only covers 1.4% of the country’s total land area. A total population growth of over 30% was recorded
between the years 2001 and 2010. Even though small in size, the province contributes 33.9% of South
Africa’s gross domestic product (GDP), which equates to 10% of the entire African continent.
In May 2008, the South African government identified problems caused by the massive influx of
foreign nationals and provincial migration towards the Gauteng province. These problems range from
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
9
Chapter 2
Remote sensing used for land cover change detection
Limpopo
Mpumalanga
North-West
Gauteng
Free State
Kwazulu-Natal
Northern Cape
Eastern Cape
Western Cape
F IGURE 2.2: The Gauteng province is located in the highveld of South Africa.
social integration of multiple different cultures to proper service delivery. The active migration is
motivated by a high median annual income for working adults and diverse employment opportunities.
The province is rapidly growing to house cities that will be among the largest in the world. A projected
population of 15 million people is expected by the year 2015.
2.3
OVERVIEW OF REMOTE SENSING
The Earth’s surface is continually undergoing transformation caused by environmental change and
anthropogenic activities. Many environmental problems stem from this continual transformation, of
which some are; water shortage, soil degradation, greenhouse gas emissions, deforestation, biodiversity
loss, etc. [33, Ch. 1].
The ability to evaluate the environmental dynamics will require periodic observation for analysis.
Remote sensing is formally defined as the analysis of remotely acquired information on a particular
object. This is usually accomplished using a sensor that is not in direct contact with the object [42,
Ch. 1].
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
10
Chapter 2
Remote sensing used for land cover change detection
Earth observation satellites are non-military reconnaissance satellites that are used by the remote
sensing community to acquire periodic observations of the Earth. These satellites use sensors to
capture electromagnetic radiation which is reflected from or emitted by the Earth. The first Earth
observation satellite that was developed was the Earth Resource Technology Satellite (ERTS-1), which
was renamed to Landsat 1. It was designed to acquire multi-spectral medium resolution imagery on a
systematic and recurring basis [43, Ch. 1].
Numerous additional remote sensing systems were commissioned and deployed through various
agencies around the world after the success of the ERTS-1 mission. The Group on Earth Observations
(GEO) was created in February 2005 to unite 60 national governments and 40 international
organisations to implement the Global Earth Observation System of Systems (GEOSS). The main
objective is to create high-quality, long-term, global observations in a timely fashion at minimal cost.
The GEOSS system will ultimately monitor all aspects of the Earth’s system to study global change.
A host of nations have launched hundreds of satellites into orbit since 1957, and this created a
range of specifications that must be considered when choosing a sensor on a satellite for a specific
application [43, Ch. 2]. The various permutations of the specifications are passive versus active sensors,
the range of electromagnetic spectrum sensed, spectral bandwidth of each sensor, temporal acquisition
rate, spatial resolution, radiometric resolution, etc. These specifications are discussed in successive
sections along with the interaction of various components within a remote sensing system.
2.4
ELECTROMAGNETIC RADIATION
Electromagnetic radiation is a disturbance produced by an oscillation or acceleration of an electric
charge. This disturbance consists of electromagnetic waves that comprise electric and magnetic fields
which propagate perpendicular to one another with a set of time and spatial properties.
The electromagnetic wave oscillates through a medium with successive cycles and the distance
between each completed cycle is called a wavelength. The energy density of the wave is defined by the
amplitude. All electromagnetic waves radiate to the same wave theory and travel at the speed of light
in a vacuum.
The electromagnetic wave acts according to its wavelength when it comes into contact with an
object and can either reflect, refract, diffract or interfere. Electromagnetic radiation is classified into
several categories according to wavelength: long waves, radio waves, microwaves, infrared, visible,
ultraviolet, X-rays and Gamma rays. The categorised wavelengths are shown in figure 2.3.
One of the major sensor specifications on board a satellite is the deployment of either an active or
passive sensor. An active sensor illuminates a scene with its own source of electromagnetic radiation.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
11
Chapter 2
Remote sensing used for land cover change detection
Long waves
1000m
Radio waves
10m
Microwaves
30cm
Infrared
Visible
Ultraviolet
1mm
X-rays
10nm
750nm
Gamma rays
10pm
1fm
400nm
F IGURE 2.3: The electromagnetic spectrum [42, Ch. 1].
The source is set to a range of wavelengths of interest, which is typically in the 2.4 cm–107 cm range.
A passive sensor relies on the sun’s radiation to illuminate a scene. A passive sensor is also called
an optical sensor, as it operates in the visible and infrared spectrum. The visible spectrum is the most
popular range in the electromagnetic spectrum, as it can be sensed by biological organisms.
The properties of the sun’s radiance are of importance for a passive sensor, as it produces a wide
range of wavelengths with a non-uniform energy distribution. Planck’s law states that the spectral
radiance is a function of the object’s temperature and wavelength of the electromagnetic radiation [44].
The sun’s peak emission is in the 400 nm–750 nm spectrum range, which is referred to as the visible
spectrum. The spectral distribution across the spectrum remains relatively unchanged as it propagates
through space [43, Ch. 2], but the reduction in intensity is subjected to the inverse-square law of the
distance between the sun and the Earth [44].
2.5
EARTH’S ENERGY BUDGET
The Earth receives incoming energy from the sun and stars, while losing energy either through
absorption, reflectance and transmittance [45, 46].
The conservation of energy states that an
equilibrium between the incoming and outgoing energy must be preserved. This equilibrium is a
function of the wavelength λ and is expressed as
EI (λ) = ER (λ) + EA (λ) + ET (λ),
(2.1)
where EI (λ) denotes the incoming energy, ER (λ) denotes the reflected energy, EA (λ) denotes the
absorbed energy and ET (λ) denotes the transmitted energy. The total flux of the incoming energy
EI (λ) is a combination of solar radiation, geothermal energy, tidal energy (moon gravity) and heat
energy (fossil fuel consumption). The outgoing energy is partitioned into either reflected, absorbed
or transmitted radiation. The partitioning of the outgoing energy into either reflected, absorbed
or transmitted radiation varies for different wavelengths, atmospheric conditions and geographical
properties [42, Ch. 1].
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
12
Chapter 2
Remote sensing used for land cover change detection
A sensor on board a satellite measures only the reflected energy ER ; to put the emphasis on the
reflected energy, equation (2.1) is rewritten as
ER (λ) = EI (λ) − EA (λ) − ET (λ).
(2.2)
Approximately 30% of all incoming energy is reflected back into space. The contributions made
to the reflected energy by geothermal energy, tidal energy and heat energy are negligibly small when
compared to the reflected solar radiation [42, Ch. 1]. The average reflectance of 30% of the incoming
energy EI (λ) is further subdivided: atmospheric reflectance of 6%, cloud reflectance of 20% and the
Earth’s surface reflectance of 4% [47–49]. A brief overview is given of all the interacting media within
the energy budget in the following sections.
2.5.1
Interaction with the atmosphere
Electromagnetic radiation penetrates the atmosphere, which consists of five layers of gases that are
retained by the planet’s gravitational field [50]. Power and spectral properties of electromagnetic
radiation are altered as they propagate through the atmosphere. The atmosphere can either scatter
or absorb electromagnetic radiation. The five layers of atmosphere are; the exosphere, thermosphere,
mesosphere, stratosphere and troposphere.
The exosphere is the outer layer of the atmosphere. It is a very thin layer where the atoms and
molecules leave the atmosphere and dissipate into outer space.
The thermosphere is the second layer that electromagnetic radiation penetrates and this is where
most of the Earth Observation satellites orbit. The thermosphere extends between 90 km and 1000 km
above sea level. The temperature in the layer is strongly affected by solar activities.
The mesosphere is the middle layer of the atmosphere and extends between 50 km to 90 km above
sea level. The majority of the meteors originating from outer space burn up in this layer. It is difficult
to measure the properties of the mesosphere, as only sounding rockets can be used at these altitudes.
The stratosphere is the second closest layer to the Earth’s surface and is positioned at an altitude
of between 8 km and 50 km. The ozone layer is situated within the stratosphere and absorbs most of
the harmful solar radiation. An aircraft can fly through the stratosphere because of the temperature
stratification within the layer.
The troposphere is the closest layer to the surface of the Earth and rises up to 20 km above sea level.
Most weather activities occur within this layer, which holds nearly all water vapour and dust particles.
Solar electromagnetic radiation heats up the surface of the Earth and in turn is transferred back to the
troposphere.
The atmosphere alters the intensity and spectral composition of electromagnetic radiation before it
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
13
Chapter 2
Remote sensing used for land cover change detection
is sensed by a sensor on board a satellite. These effects are mainly categorised into either atmospheric
scattering or absorption [42, 43].
2.5.1.1
Atmospheric scattering
The principal mechanisms affecting electromagnetic radiation as it propagates through the atmosphere
are the scattering and absorption effects. Atmospheric scattering occurs when solar radiation is
randomly diffused within the atmosphere. The behaviour of atmospheric scattering is determined
by analysing the ratio of the particle’s diameter to the wavelength of the electromagnetic wave.
Atmospheric scattering is classified into three general categories [42, 43];
• Rayleigh scattering is the most common scattering effect in the atmosphere. This scattering
occurs when a particle’s diameter is much smaller than that of the interacting electromagnetic
wave. Rayleigh scattering is inversely proportional to the fourth power of a radiating wavelength.
This means that shorter wavelengths are more prone to scatter in the atmosphere than longer
wavelengths.
• Mie scattering occurs when a particle’s diameter is equal to an electromagnetic wave’s
wavelength. The major causes of Mie scattering are: pollen, dust, smoke, water vapour, and
other particles situated in the lower portion of the atmosphere.
• Non-selective scattering occurs when an atmospheric particle’s diameter is much larger than
a radiating wavelength. Non-selective scattering mostly affects the visible, near infrared and
mid-infrared spectrums. In this case, all the wavelengths are scattered equally regardless of their
length. Non-selective scattering is found in water droplets, which give clouds and fog a white
appearance.
2.5.1.2
Atmospheric absorption
Atmospheric absorption is caused by gaseous components that retain electromagnetic radiation within
the atmosphere. Atmospheric absorption allows different wavelengths to be absorbed in different parts
of the atmosphere. This absorption rate into different layers is illustrated in figure 2.4. The gases that
absorb most solar radiation are: water vapour, carbon dioxide, and ozone [42, 43].
Earth observation satellites are limited, as they can only acquire images from wavelengths that are
not absorbed into the atmosphere. The range of wavelengths that is not absorbed into the atmosphere
is commonly referred to as the atmospheric window [42, Ch. 1]. A spectral sensor is usually set to
measure a narrow band of spectrum within the atmospheric window.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
14
Chapter 2
Remote sensing used for land cover change detection
F IGURE 2.4: Atmospheric absorption allows different wavelengths to be absorbed in different parts
of the atmosphere. This figure shows the different elevations at which electromagnetic radiation is
absorbed into the atmosphere. Image supplied by NASA/CXC/SAO.
2.5.1.3
Atmospheric correction
The electromagnetic radiation recorded at a sensor is not a true reflection of the Earth’s surface owing to
the effects of atmospheric scattering and absorption. A critical preprocessing step for creating oceanic
and land surface products is the correction of these atmospheric disturbances [51, 52].
Two general methods are used in correcting atmospheric disturbances: relative and absolute
correction. Relative atmospheric correction is exactly as the term implies a relative histogram match
of an image to a reference image. This method requires an accurate reference image for a specified
geographical area and any adjoining areas.
Absolute atmospheric correction is further subdivided into empirical and physical methods. The
absolute empirical method is not popular, as it has a tendency to over-simplify the corrections applied
to an image.
The absolute physical method, on the other hand, uses a mathematical model to extract the effects of
various gaseous components and then to compensate for these effects accordingly. A radiative transfer
model is a form of the absolute physical method which extracts the gaseous concentrations directly
from an image in order to estimate the corrected radiance for the image.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
15
Chapter 2
2.5.2
Remote sensing used for land cover change detection
Interaction with the Earth’s surface
The Earth’s surface interacts with incoming electromagnetic radiation and can either absorb, reflect
and/or transmit the radiation. The reflected electromagnetic radiation excites the components within
the sensor. The amount of reflected electromagnetic radiation is a function of the wavelength and
the properties of the surface. The surface has several properties that affect the amount of reflectance:
mineral profile, surface contour, surface roughness, etc. Reflected electromagnetic waves are mostly
affected by the surface’s roughness and are divided into two general modes: specular (smooth) and
diffuse (rough or Lambertian) [33, Ch. 4].
The Rayleigh criterion determines the level of roughness for a medium and is calculated as
h≤
λ
.
8cos(θ)
(2.3)
The variable h denotes the surface irregularity height, λ denotes the wavelength and θ denotes the angle
of incidence measured to the azimuth. If equation (2.3) is satisfied, then the surface is considered to be
diffuse, otherwise it is specular [42, 43].
A specular surface reflects electromagnetic radiation according to Snell’s law, which states that the
outgoing energy is exactly reflected at a perpendicular angle to the azimuth of the incoming energy.
A diffuse surface reflects the incoming electromagnetic radiation in all directions off the surface. A
Lambertian (perfect diffuse) surface reflects the incoming energy uniformly in all directions off the
surface.
Most natural surfaces are imperfect diffuse reflectors (specular component present) in the visible
and near infrared spectrum. This makes remote sensing possible, as reflected electromagnetic radiation
can be captured at most viewing angles. This would not be possible if the surface was completely
specular, as it would have a high reflectance value at a single specific viewing angle and relatively low
reflective values at all other viewing angles [53, 54].
2.5.3
Interaction with a satellite-based sensor
The principal concept of remote sensing is to observe an object remotely.
In a satellite-based
application it is the recording of electromagnetic radiation that has interacted with an object. A sensor,
as defined in this thesis, is a device that measures a physical quantity and converts it into an electrical
signal.
The advantage, when considering the interaction of radiation with the sensor, is that it can be
designed to measure the environment optimally.
A satellite sensor’s specifications that will be
discussed briefly are: the spatial, spectral, radiometric and temporal resolutions.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
16
Chapter 2
Remote sensing used for land cover change detection
Spatial resolution is the geographical size that is recorded on a two-dimensional pixel in the image.
The size of the area represented in a pixel is determined by the altitude, viewing angle and sensor
characteristics. All these characteristics are influenced by the instantaneous field of view (IFOV) of the
sensor [33, Ch. 4]. The IFOV of the sensor is time-dependent, as the satellite is not perfectly stable in
its orbit. The distance between the satellite and the Earth varies continually, altering the physical size
of the geographical area that is captured within a single pixel.
Another limiting factor is the point spread function (PSF) of the sensor. The PSF is the system
impulse response between the geographical area and the sensor. This function describes the degree of
illumination spreading from the adjacent area to the geographical area of interest. The PSF results in
a blending or spreading effect on areas with relatively bright or dark objects within the IFOV of the
sensor. This leads to high contrast features becoming indiscernible on satellite images even though
their widths are less than the sensor’s spatial resolution.
Spectral resolution is the bandwidth of the electromagnetic spectrum recorded by the sensor. A
sensor that senses a shorter spectrum range of wavelengths (smaller bandwidth) has an improved ability
to capture the spectral signature of an object within the spectral band when compared to a sensor that
measures a larger spectrum range of wavelengths (larger bandwidth).
The disadvantage of increasing the spectral resolution is that the signal-to-noise ratio (SNR)
decreases. Recorded radiance at the sensor is adversely affected by some form of noise. The
physical propagation of electromagnetic radiation to the sensor can be seen as a time-variant multi-path
propagation of the reflected electromagnetic wave of a geographical area with a certain level of additive
noise. The additive noise in the sensor is made up mostly of thermal noise. The thermal noise does not
decrease if a smaller bandwidth is sensed, although the instantaneous radiance in the sensor is reduced
for a higher spectral resolution sensor as it is exposed to a shorter range of spectrum. The thermal noise
remains the same regardless of the range of spectrum that is being sensed. To summarise: reducing the
reflected power within the sensor (reducing the bandwidth) will inadvertently reduce the SNR.
Optimal spectral resolution is obtained when a sensor mitigates the effect of additive noise and
has a spectral bandwidth that captures the best matched spectral signature for the intended remotely
sensed object. Remote sensing systems usually use multi-spectral or hyper-spectral sensors. This is an
array of sensors that capture different ranges of spectrum at the same time. A multi-spectral sensor
has less than 100 unique spectral bands, while a hyper-spectral sensor has more than 100.
Radiometric resolution is the accuracy of converting electromagnetic radiation at the satellite sensor
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
17
Chapter 2
Remote sensing used for land cover change detection
to a digital binary format. A higher radiometric resolution enables the satellite sensor to distinguish
between more levels of intensity.
It is possible to encode electromagnetic radiation as an information source at a rate that is close
to its entropy [55, Ch. 6]. This is unfortunately limited by the storage space available on the satellite,
which induces a certain level of distortion in the sampling of the electromagnetic radiation. The reason
is that electromagnetic radiation is an analog source and requires an infinite number of binary bits to
store.
A loss in precision is caused by the finite storage space, which induces a distortion that is directly
related to the number of quantisation levels (number of binary bits per radiance sample). It should be
noted that the number associated with each quantisation level is not a direct measure of the captured
electromagnetic radiation, but rather the steps into which a range of physical values is divided.
In an effort to distribute the captured electromagnetic radiation more evenly over the range
of quantisation levels, some sensors apply either non-linear quantisation mapping functions or an
amplifier with an automatic gain control mechanism.
This alters the intensity of the captured
electromagnetic radiation and distributes it over a range of different quantisation levels without creating
a saturated buffer in the remotely sensed image.
The total number of quantisation levels and the method of distributing radiation across the
levels affect the level of distortion in the stored values. This rate of distortion is defined by the
signal-to-quantisation-noise ratio (SQNR), which is expressed as
SQNR =
Px
.
Px̃
(2.4)
The variable Px̃ is the quantisation-noise power and Px is the power of the radiation before quantisation.
Low-quality sensors have low SQNR, which equates to low radiometric resolution.
The
disadvantage in increasing the radiometric resolution is the costs and complexity of adding a higher
resolution analogue-to-digital converter device and the increase in required storage space for storing
the binary values of the digital image. For example, the Quickbird satellite owned by DigitalGlobe has
a radiometric resolution of 11 bits. This enables the sensor to distinguish between 2048 (211 ) levels of
radiance. The satellite has 128 Gb storage capacity, which equates to 57 images stored on board. The
sensor can distinguish between 65536 (216 ) levels of radiance if the radiometric resolution is set to 16
bits. The problem is that only 39 images can be stored on board, which results in a 32% reduction in
storage capacity.
Temporal resolution is the periodic rate of acquisition of a geographical area by the same satellite
sensor. This is important for investigating any change in land surface and the monitoring of global
environmental processes. The orbit, altitude, swath width, and priority tasking of the sensor on board
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
18
Chapter 2
Remote sensing used for land cover change detection
the satellite determines the temporal rate at which an area of interest can be imaged [42, Ch. 6]. Sensors
are tasked from a mission control center to acquire images of geographical areas. Areas of interest
are assigned a priority task, which improves the temporal acquisitions for this area. The temporal
resolution varies from less than an hour to more than a few months [43, Ch. 2]. Fixed temporal
resolution is a sensor that has a fixed viewing angle, repetitive orbital track and a fixed swath width.
The swath width is the trade-off between the temporal resolution and the spatial resolution. The
wider the swath width, the shorter the revisit time period for a geographical area, while the narrower
the swath width, the better the spatial resolution (for the same number of pixels).
TABLE 2.1: Specification of different remote sensing sensors.
Sensor
Temporal resolution
(Revisit period)
16 days
Spatial resolution
Wavelength range
15 m – 60 m
0.45 µm–12.50 µm
Number of
spectral bands
8
MODerate-Resolution Imaging
Spectroradiometer (MODIS)
1–2 days
250 m – 1000 m
0.405 µm–14.385 µm
36
Advanced Very High Resolution
Radiometer (AVHRR)
Daily
1100 m – 4000 m
0.58 µm–12.50 µm
5
Enhanced Thematic Mapper
Plus (ETM+)
How to choose a sensor: This thesis focuses on expanding settlements. Finding newly developed
housing requires several considerations when selecting the right remote sensing sensor.
High spatial resolution sensors have the ability to detect much smaller objects in an area. The
drawback is that higher spatial resolution means lower temporal resolution. These images are thus not
regularly acquired and are financially expensive.
Detecting new settlements is possible when comparing two high spatial resolution images taken at
two different dates. The problem is that similar land cover types can appear significantly different at
various times of the year. These seasonal changes in the land cover can be mitigated if the temporal
resolution is high enough to capture these trends [15]. This makes the use of high temporal resolution
sensors much more useful for change detection.
A list of specifications for three different satellites used to image the land surface is shown in
table 2.1. The specifications for these three satellites are used to illustrate the range of trade-offs to
consider when selecting a sensor.
The Enhanced Thematic Mapper Plus (ETM+) operates on a very high spatial resolution of 15 m –
60 m, with a low temporal revisit time of 16 days.
The Advanced Very High Resolution Radiometer (AVHRR) has a high temporal resolution of one
day, but captures a geographical area at a spatial resolution of 1100 metres. The large swath width is
necessary to obtain a high temporal resolution at the expense of the spatial resolution.
The MODerate-resolution Imaging Spectroradiometer (MODIS) is a newer instrument, which was
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
19
Chapter 2
Remote sensing used for land cover change detection
specifically designed for global land surface monitoring and is the chosen sensor for this study, as it
has a high temporal resolution and medium spatial resolution capabilities [16]. MODIS has a temporal
resolution of 1–2 days, which is close to the temporal resolution of the AVHRR sensor. MODIS also
has a medium spatial resolution (250 m – 1000 m) and a wider variety of spectral bands.
2.6
MODERATE RESOLUTION IMAGING SPECTRORADIOMETER
F IGURE 2.5: Multiple MODIS images concatenated to form a image of the Earth.
MODIS is an experimental scientific sensor launched into the Earth’s thermosphere by NASA on
board the Terra EOS-AM-1 satellite on December 18, 1999. A second MODIS sensor was launched
on board the Aqua EOS-PM-1 satellite on May 4, 2002.
The Terra EOS satellite was the first NASA scientific research satellite to carry the MODIS
instrument into orbit. The Terra satellite was launched from the Vandenberg Air Force base into a
sun-synchronous orbit at an altitude of 705 km [56]. Terra is Latin for Earth. The Terra EOS satellite
carries a total of five remote sensing sensors which record measurements of the Earth’s climate system:
Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), Clouds and the Earth’s
Radiant Energy System (CERES), Multi-angle Imaging SpectroRadiometer (MISR), MODIS and
Measurements of Pollution in the Troposphere (MOPITT).
The Aqua EOS satellite was the second NASA scientific research satellite to carry a MODIS
instrument into orbit. The Aqua satellite was launched from the Vandenberg Air Force base into an
afternoon equatorial crossing orbit at an altitude of 705 km [56]. Aqua is Latin for water. The Aqua
EOS satellite carries a total of six remote sensing sensors that collects information about the Earth’s
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
20
Chapter 2
Remote sensing used for land cover change detection
TABLE 2.2: MODIS spectral bands properties and characteristics.
Spectral
bands
Band 1
Band 2
Wavelengths
(nanometres)
620–670
841–876
Resolution
(metres)
250
250
Property or characteristic
Spectral range
Absolute Land Cover Transformation, Vegetation Chlorophyll
Cloud Amount, Vegetation Land Cover Transformation
Visible (Red)
Near Infrared
Band 3
Band 4
Band 5
Band 6
Band 7
459–479
545–565
1230–1250
1628–1652
2105–2155
500
500
500
500
500
Soil/Vegetation Differences
Green Vegetation
Leaf/Canopy Differences
Snow/Cloud Differences
Cloud Properties, Land Properties
Visible (Blue)
Visible (Green)
Short Infrared
Short Infrared
Short Infrared
Band 8
Band 9
Band 10
Band 11
Band 12
Band 13
Band 14
Band 15
Band 16
Band 17
Band 18
Band 19
Band 20
Band 21
Band 22
Band 23
Band 24
Band 25
Band 26
Band 27
Band 28
Band 29
Band 30
Band 31
Band 32
Band 33
Band 34
Band 35
Band 36
405–420
438–448
483–493
526–536
546–556
662–672
673–683
743–753
862–877
890–920
931–941
915–965
3660–3840
3929–3989
3929–3989
4020–4080
4433–4498
4482–4549
1360–1390
6535–6895
7175–7475
8400–8700
9580–9880
10780–11280
11770–12270
13185–13485
13485–13785
13785–14085
14085–14385
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
Chlorophyll
Chlorophyll
Chlorophyll
Chlorophyll
Sediments
Atmosphere, Sediments
Chlorophyll Fluorescence
Aerosol Properties
Aerosol Properties, Atmospheric Properties
Atmospheric Properties, Cloud Properties
Atmospheric Properties, Cloud Properties
Atmospheric Properties, Cloud Properties
Sea Surface Temperature
Forest Fires & Volcanoes
Surface/Cloud Temperature
Surface/Cloud Temperature
Cloud Fraction, Troposphere Temperature
Cloud Fraction, Troposphere Temperature
Cloud Fraction (Thin Cirrus), Troposphere Temperature
Mid Troposphere Humidity
Upper Troposphere Humidity
Surface Temperature
Total Ozone
Cloud Temperature, Forest Fires & Volcanoes, Surface Temperature
Cloud Height, Forest Fires & Volcanoes, Surface Temperature
Cloud Fraction, Cloud Height
Cloud Fraction, Cloud Height
Cloud Fraction, Cloud Height
Cloud Fraction, Cloud Height
Visible (Blue)
Visible (Blue)
Visible (Blue)
Visible (Green)
Visible (Green)
Visible (Red)
Visible (Red)
Near Infrared
Near Infrared
Near Infrared
Near Infrared
Near Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Mid wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
Long wave Infrared
water cycle. The six sensors are: the Atmospheric Infrared Sounder (AIRS), Advanced Microwave
Sounding Unit (AMSU-A), Humidity Sounder for Brazil (HSB), Advanced Microwave Scanning
Radiometer for EOS (AMSR-E), MODIS, and CERES.
NASA’s strategy is to use the MODIS sensors to investigate and acquire hyper-temporal,
multi-spectral and multi-angular observations of the Earth on a daily basis. MODIS was launched
to continue the monitoring of the Earth from older sensors such as: Coastal Zone Colour Scanner
(CZCS), the Advanced Very High Resolution Radiometer (AVHRR), the High Resolution Infrared
Spectrometer (HIRS), and the Thematic Mapper (TM). The MODIS sensors were built by the Santa
Barbara Remote Sensing Institute according to the specifications provided by NASA. NASA has gone
to great lengths to ensure proper sensor calibration to generate an accurate long-term data set for global
studies [57].
MODIS is a passive remote sensing instrument with 490 detectors, which are arranged to form 36
spectral bands that measure the 405 nm–14385 nm spectrum. Each detector in the sensor has a 12-bit
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
21
Chapter 2
Remote sensing used for land cover change detection
TABLE 2.3: Table description of the available MODIS land cover products.
Product
Snow product
Short Description
Snow cover land and snow albedo
Composition time
Daily/8-day
Spatial Resolution
500m/1km
Satellites
Terra or Aqua
Product Code
MOD10/MYD10
MOD29/MYD29
Land surface
temperature
Land surface temperature and
emissivity daily levels
Daily/8-day/
Monthly
1km/6km
Terra or Aqua
MOD11/MYD11
Land cover
dynamic product
Decision tree classify 34 classes
of land cover
Yearly
500m/1km
Terra or Aqua
MOD12/MYD12
Thermal Anomalies/
Fire products
Fire detection
Daily/8-day
1km
Terra or Aqua
MOD14/MYD14
LAI/FPAR
products
Measure surface photosynthesis,
evapotranspiration, and net
primary production
8-day
1km
Terra, Aqua
or combined
MOD15/MYD15/
MCD15
Gross Primary
Production product
Measures growth of terrestrial
vegetation
8-day
1km
Terra or Aqua
MOD17/MYD17
Surface Reflectance
Spectral reflectance and
atmospheric scattering
Daily/8-day
250m/500m/
1km
Terra or Aqua
MOD09/MYD09
Global Vegetation
Indices
Calculates the NDVI and EVI
indices
16-day/Monthly
250m/500m/
1km
Terra or Aqua
MOD13/MYD13
Vegetation Cover
Conversion
Estimate proportions of life
form, leaf type, and leaf longevity
Yearly
500m
Terra
BRDF/Albedo
products
Mathematical models to describe
BRDF and derive Albedo
measurements
8-day/16-day
500m/1km
Terra, Aqua
or combined
Burned Area
product
Burning and quality information
and survey for rapid changes on
surfaces
Monthly
500m
Combined
MOD44
MOD43/MYD43/
MCD43
MCD45
radiometric resolution and can acquire a swath of 2330 km (cross track) by 10 km (nadir track). The
wide swath width of MODIS enables it to record the entire Earth’s surface every two days. MODIS
spectral bands are recorded at a different spatial resolutions: spectral bands 1–2 are measured at 250 m
spatial resolution, spectral bands 3–7 are measured at 500 m spatial resolution and spectral bands 8–36
are measured at 1 km spatial resolution. The spatial resolution is reported at a nadir viewing angle. It
should be noted that an increase in spatial resolution is experienced in the scan direction, which causes
pixels to be partially overlapping at off-nadir angles. This phenomenon is known as the bowtie effect
and is a source of variability over the revisit cycle.
The spectral bands are designed to provide observations of global environmental processes
occurring in the troposphere: cloud activity, radiation budget, oceanographic occurrences and land
cover monitoring (Full listing in Table 2.2). The images acquired by MODIS are converted with a set
of preprocessing steps on a daily basis into terrestrial, atmospheric and oceanic products (Full product
listing in Table 2.3).
The prefix MOD and MYD in the product code (table 2.3) refers to the product derived from the
data acquired from the Terra and Aqua satellites respectively. The prefix MCD in the product code
refers to the product derived using data from both satellites [27, 28, 58–60]. The composition time
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
22
Chapter 2
Remote sensing used for land cover change detection
F IGURE 2.6: Example of a passive orbiting satellite acquiring an image from earth.
(table 2.3) reports the temporal resolution at which an acquisition for the product becomes available
and the spatial resolution at which the products are produced.
The MODIS product chosen for this thesis is the MCD43A4 land surface reflectance product.
The product is defined as a nadir viewed land surface reflectance, which is atmospherically corrected
[61, 62]. The adjusted land spectral reflectance product significantly reduces the anisotropic scattering
effects of surfaces under different illumination and observation conditions [27, 28]. This BRDF/Albedo
product is also used as an input to derive land classifications for the Land Cover Dynamic Product. The
MCD43A4 product uses the first 7 spectral bands, which are often referred to as the land surface bands.
The 7 spectral bands are used because of the minimal atmospheric absorption of atmospheric gases.
The larger swath width on MODIS enables the surveying of every geographical area at least every
two days. The MODIS instrument has an orbital repeat cycle of 16 days, which is a problem with the
large swath width, as the viewing angles (at the same ground location) between successive observations
might differ dramatically. This means that every 16 days an image is acquired of the same geographical
area with similar viewing angles.
The disadvantage of acquiring images from a polar orbiting passive satellite is the variation in the
reflected signal that is caused by the change in the surface reflectance during the composition period
(Figure 2.6). This variation in signal is contributed by many different environmental and external
sources such as: solar zenith angle, viewing zenith angle, seasonality, sensor angle, etc.
This disadvantage created the need to consider the distribution of the electromagnetic radiation as
a function of the observation and illumination angles. The BRDF is a mathematical function which
describes the variability in surface reflection based on the illumination and viewing angles [63].
Estimation of the BRDF enables the adjustment of the reflectance values as if they were taken from
a nadir view. The MODIS MCD43A4 product uses a 16-day rolling window of acquisitions from both
Terra and Aqua satellites, together with a semi-empirical kernel-driven bidirectional reflectance model
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
23
Chapter 2
Remote sensing used for land cover change detection
to determine the global set of parameters describing the BRDF. The hemispherical reflectance and the
bi-hemispherical reflectance at the solar zenith angle are derived from the BRDF parameters to produce
a coarse resolution composite image every 8 or 16 days [28].
A weighted linear sum of kernel functions is used for a BRDF model to correct for illumination and
viewing angles. This BRDF model is a 4-variable function that sums together an isotropic parameter
and two functions of viewing and illumination geometry to determine the reflectance [28]. The BRDF
model is given by
R(θsol , θview , θrel , λ) = fiso (λ) + fvol (λ)Kvol (θsol , θview , θrel , λ)
+fgeo (λ)Kgeo (θsol , θview , θrel , λ),
(2.5)
where θsol denotes the solar zenith angle and θview denotes the viewing angle. The variable θrel denotes
the relative azimuth angle and λ denotes the wavelength.
The RossThick kernel function is currently best suited for the volume scattering radiative transfer
model used in the kernel function Kvol (θsol , θview , θrel , λ) for the MODIS MCD43A4 product. The
LiSparce kernel function is at present best suited for the geometric shadow casting theory used in the
kernel function Kgeo (θsol , θview , θrel , λ) [28].
The BRDF model’s parameters are derived by the MODIS MOD43B1 product and are used to
compute the albedos using the solar illumination geometry. The approximation of terrestrial albedo at
a particular solar zenith angle, requires a weighted sum of the black-sky (directional-hemispherical)
albedo and the white-sky (bi-hemispherical) albedo. The black-sky albedo is defined as albedo in the
absence of a diffuse component and is a function of the solar zenith angle. The white-sky albedo is
defined as albedo in the absence of a direct component when the diffuse component is isotropic [28].
The product uses the black-sky and white-sky model for albedo estimation.
The black-sky model is given as
αBS = fiso (λ)(g0,iso + g1,iso λ2 + g2,iso λ3 )
+fvol (λ)(g0,vol + g1,vol λ2 + g2,vol λ3 )
+fgeo (λ)(g0,geo + g1,geo λ2 + g2,geo λ3 ).
(2.6)
The coefficients for the black-sky model for the isotropic (iso), the RossThick (vol) and LiSparce (geo)
can be substituted into equation (2.6) to simplify to
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
24
Chapter 2
Remote sensing used for land cover change detection
αBS = fiso (λ) + fvol (λ)(−0.007574 − 0.070987λ2 + 0.307588λ3 )
+fgeo (λ)(−1.284909 − 0.166314λ2 + 0.04184λ3 ).
(2.7)
The white-sky model is given as
αWS = fiso (λ)giso + fvol (λ)gvol + fgeo (λ)ggeo .
(2.8)
The coefficients for the white-sky model are also substituted into equation (2.8), which equates to
αWS = fiso (λ) + 0.189184fvol (λ) − 1.377622fgeo (λ).
(2.9)
The solar zenith angle is then transformed to a nadir angle at local sensor noon using the BRDF model.
Cloud obscuration reduces the number of observations that are available for processing even when
both satellites are combined within a product. Fortunately, according to a global analysis conducted,
South Africa has more than an 80% probability of acquiring enough non-cloudy images within 16 days
to produce a reliable 8 day composite land reflectance MODIS product [64].
F IGURE 2.7: Sinusoidal projection of the the planet Earth.
The land surface reflectance products are sinusoidally projected and stored in a Hierarchical Data
Format - Earth Observing System (HDF-EOS) format [65]. A sinusoidal projection of the planet Earth
is shown in figure 2.7. The sinusoidal projection is a pseudocyclindrical projection, which translates
images to retain relative geographical sizes between areas accurately. These images are then gridded to
form an equal-sized gridded map. The disadvantage is that it distorts the shapes and orientation within
the maps when viewing the images.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
25
Chapter 2
Remote sensing used for land cover change detection
The PSF of the MODIS sensor was not measured prelaunch; instead a line spread function (LSF)
was measured in the scan direction to derive the PSF [66]. The MODIS PSF induced radiation from
adjacent areas which is mostly caused by clouds. A correction for this unwanted radiation entering the
sensor is computed using both the PSF and the approximation of the radiance measured by the saturated
spectral bands. This prior knowledge of the radiance received is usually discarded in most products, as
it requires long computing times. The largest impact is the low radiance measured in MODIS oceanic
products, which are in close proximity to highly reflective objects such as clouds, coastlines, or sun
glint. The PSF introduces a small amount of straylight into the MODIS measurements, which does not
have a large impact on land surface products.
2.7
VEGETATION INDICES
Vegetation indices were created to assist in the study of terrestrial vegetation in large-scale global
environmental dynamics. Vegetation indices are spectral transformations of a set of spectral band
combinations. The vegetation indices enhance the vegetation characteristics within an image, which
facilitates the comparison of terrestrial photosynthetic activity variations [67].
2.7.1
Normalised Difference Vegetation Index
The Normalised Difference Vegetation Index (NDVI) is a scalar index that enhances vegetation
characteristics in a multi-spectral image. The NDVI was inspired by phenology, which is the study
of the periodical growth cycle of plants and how this cycle is influenced by seasonal and inter-annual
variability in the ecosystem [68]. A global NDVI coverage map is shown in figure 2.8. NDVI is a
normalised ratio that uses the λRED (Red spectrum band 0.63 µm – 0.69 µm) and λNIR (Near Infrared
spectrum band 0.76 µm – 0.90 µm) spectral bands and is computed as
NDVI =
λNIR − λRED
.
λNIR + λRED
(2.10)
The NDVI index capitalises on the differences in absorption rates between the two spectral bands
when interacting with natural vegetation. The RED spectral band’s electromagnetic radiation is
absorbed by the natural vegetation for photosynthesis and the NIR spectral band’s electromagnetic
radiation is reflected by the natural vegetation because of the vegetation’s cellular structure. The NDVI
index exploits the low reflectance values in the RED spectral band and high reflectance values in the
NIR spectral band for natural vegetation [69, 70]. The NDVI ratio shown in equation (2.10) produces
positive values near 1 (NDVI ≈ 1) for areas containing a dense vegetation canopy and small positive
values (NDVI ≈ 0) for bare soils.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
26
Chapter 2
Remote sensing used for land cover change detection
F IGURE 2.8: Global NDVI index coverage map created using MODIS. Image supplied by NASA.
The general use of the NDVI index is demonstrated in large regional environmental models, which
include: leaf area index, biomass, chlorophyll, net plant productivity, fractional vegetation cover,
accumulated rainfall, etc. Several studies tend to over-use the NDVI index in many applications for
which it was not specifically designed [71]. The normalised difference between these two spectral
bands only illustrates a relationship in the original information, while other important information
is discarded. Whether the discarded information is relevant depends on the process of analysis
and geographical area. The NDVI index is sensitive to numerous environmental factors, including
atmospheric effects, thin cloud coverage (ubiquitous cirrus), moistness of the soil (precipitation or
evaporation), difference in soil colour, anisotropic effects, and spectral effects (different sensors
provide different NDVI).
Several alternatives to NDVI have been proposed to address a variety of limitations in analysing
satellite acquired imagery. These include: the Perpendicular Vegetation Index [72], the Soil-adjusted
Vegetation Index [73], the Atmospherically Resistant Vegetation Index [74], and the Global
Environment Monitoring Index [71].
2.7.2
Enhanced Vegetation Index
The Enhanced Vegetation Index (EVI) is an improved version of the NDVI vegetation index. The EVI
does not tend to saturate as quickly as the NDVI does in areas with high biomass. The EVI decouples
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
27
Chapter 2
Remote sensing used for land cover change detection
the canopy background reflectance, and is computed as
EVI = G
λNIR − λRED
.
λNIR + C1 λRED − C2 λBLUE + L
(2.11)
The variable λNIR denotes the surface reflectance of the near infrared band and λRED denotes the
surface reflectance of the red spectral band. The variable λBLUE denotes the surface reflectance of the
blue spectral band and L denotes the canopy background adjustment term. The coefficients C1 and C2
denote the aerosol resistance term and G is the gain coefficient.
The scaling coefficients are used to minimise the effects of aerosols. The blue spectral band
is atmospherically sensitive and is used to adjust the red spectral band for aerosol influences. The
coefficients used by MODIS to calculate EVI are substituted into equation (2.11) as
EVI = 2.5
λNIR − λRED
.
λNIR + 6λRED − 7.5λBLUE + 1
(2.12)
NDVI is the most widely used vegetation index, which could be attributed to its low computational
costs. The use of EVI always raises two questions:
1. Does the sensor measure the blue spectral band independently?
2. Are the scaling coefficients used in computing EVI applicable to the current geographical area?
NDVI is a good vegetation index if properly used and was included in this thesis because of its
popularity and to create a base performance for comparison [75, 76]. It should be noted that all methods
proposed in this thesis could be adapted to operate with other sets of spectral bands and vegetation
indices.
2.8
LAND COVER CHANGE DETECTION METHODS
Change detection can be viewed from a prototype theory mindset [77]. The prototype theory states
that the performance of the results generated from a change detection method is based on the user’s
requirements. This creates a paradigm that there is no single solution for detecting change for all
applications [18, 20]. Change detection methods are designed for a specific application and have their
own merits and limitations.
An example to demonstrate the user’s specific needs is shown in figure 2.9. A change in land
cover type from natural vegetation to human settlement is experienced in the red polygon, while only
seasonal change in the vegetation has occurred in the blue polygon. Applications and issues of change
detection in the remote sensing community are summarised into several categories [24], namely:
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
28
Chapter 2
Remote sensing used for land cover change detection
1. land cover classification and change detection [78, 79],
2. forest monitoring [80, 81],
3. fire detection [82, 83],
4. urban expansion and change [84, 85],
5. natural environment change [86, 87], and
6. specialised applications [88, 89].
The remote sensing community’s monitoring capabilities keep improving with the development and
deployment of new technologies. Global data sets are becoming more accessible and computational
resources are becoming more affordable [14]. These data sets come from several different sensors. The
more popular are: Landsat Multi-Spectral Scanner (MSS), TM, MISR, SPOT, AVHRR and MODIS.
The type of land cover change of interest also changes with technologies, which requires continuous
pursuit of new change detection methods [18, 20].
There are four major steps involved when constructing a change detection framework [90].
The first step is image preprocessing to ensure the image is corrected by removing any unwanted
artifacts [18, 20]. Preprocessing spatially registers and environmentally corrects each image to a
minimum product’s quality index. The product’s quality index is reached by using topographical
correction, spatial registration, radiometric calibration, atmospheric calibration and normalisation
between multi-temporal imagery.
The purpose of the preprocessing is the assurance that the images acquired over a geographical area
remain consistent through time and any changes in the reflectance values are not caused by processing
artifacts. Incorrect preprocessing has adverse effects on the accuracy of the change detection methods
[91, 92]. For example, if images are not correctly spatially registered, the geographical location of a
pixel in one image will not correspond with the geographical location of the same pixel in another
image.
The second step is proper feature extraction and selection. Suitable meaningful features must be
obtained from the images to give the change detection method the ability to detect change. A renowned
quotation is: If you can measure it, you can improve it - William Thomson. If no measurable feature
exists to detect the change, then no change detection method will be able to detect it.
The third step is to develop a suitable change detection method that uses the features to detect
changes according to the user’s requirements. The method must be reliable and robust in most
operating environments.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
29
Chapter 2
Remote sensing used for land cover change detection
(a) Quickbird image taken on 30 July 2004 (courtesy of GoogleTM Earth).
(b) Quickbird image taken on 31 December 2008 (courtesy of GoogleTM Earth).
F IGURE 2.9: A change in land cover type is shown by the red polygon in (a) and (b), while only a
seasonal change has occurred in the blue polygon.
The fourth step is the assessment of the previous three steps. How well did the change detection
method satisfy the requirements set by the user? The overall accuracy assessed in the system is affected
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
30
Chapter 2
Remote sensing used for land cover change detection
by several factors, including [24]; (1) the quality of the preprocessing, (2) availability of reliable
ground truth, (3) complexity of the environmental case study, (4) useful feature extraction, (5) feature
analysis and processing, (6) change detection algorithms used, (7) the analyst’s skills, (8) knowledge
and information about the study area, (9) critical assessment of the system’s outputs, and (10) time and
cost constraints.
Standard statistical tests are used to measure the performance of the change detection algorithm
quantitatively and are supported by visual assessment of the geographical areas. Change detection
methods are divided into multi-temporal and hyper-temporal change detection methods. Change
detection methods operating on multi-temporal images require only a few images; usually in the order
of 2–5 images of the same geographical area. Change detection methods operating on hyper-temporal
images usually requires hundreds of images taken at regular constant intervals; usually 8–30 days
between acquisitions.
Most change detection methods found in the literature can either provide change information
or a change alarm [93, 94]. A change alarm uses a threshold to provide binary change/no change
information from the images. A change information algorithm uses post-classification to provide a
from-to change.
Multi-temporal change detection methods evaluate local patterns in the reflectance values between
images to detect change. The change detection method should compensate for the difference in
environmental conditions, illumination conditions, and local trends in each of the images [95].
Multi-temporal change detection methods are grouped into several categories [24]: (1) algebra, (2)
transformation, (3) classification, (4) advanced models, (5) Geographical Information System approach
(GIS), (6) visual analysis, and (7) other methods.
The algebraic approach entails methods such as [24]: image differencing, image regression, image
ratioing, index differencing, trajectory vector analysis, and background subtraction [93, 94]. These
methods have low complexity and use manually adjusted thresholds to define change in the local
vicinity.
The advantage of using an algebraic approach is the ease of interpreting the execution of the
method. Another advantage is that it can operate on data sets which were captured in different
environmental conditions. The disadvantage of these methods is that they have the potential to enhance
the system noise, which effectively reduces the methods’ performances. Another disadvantage is the
setting of the threshold. The threshold has to be manually adjusted for each new data set. The methods
are sensitive to features with little separability or features that are subjected to external events or time
dependence.
The transformation approach uses methods to reduce the number of dimensions in the remote
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
31
Chapter 2
Remote sensing used for land cover change detection
sensing reflectance data set to create a new manifold [24]. The advantage of this approach is
the removal of redundant dimensions and it puts emphasis on the information-carrying components
[96, 97]. This approach includes transformation algorithms such as principal component analysis
(PCA), Gramm-Schmidt, Chi-square, independent component analysis, etc. The disadvantage is the
interpretation of the new manifold and the change trajectory of the geographical area.
The classification approach is characterised by classification methods such as: spectral combined
analysis, expectation-maximisation (EM) algorithm, hybrid classification, hierarchical classification,
and artificial neural networks (ANN). These methods require initial training on a set of labelled pixels.
Afterwards the method is applied using the information gathered to classify a set of unknown labelled
pixels. The advantage of using such a classification method is that it provides a change information
matrix. These methods are robust to external environmental conditions [8, 98]. The disadvantage is the
dependency on periodic updating of the training data sets.
The advanced model approach transforms the spectral reflectance values from multi-temporal
spectral reflectance values to physical process parameters. The advantage is that the extracted process
parameters are easier to interpret than the spectral reflectance values [99, 100]. Methods commonly
used in this category are: Linear Spectral Mixture Analysis (LSMA), Li-Strahler reflectance model,
spectral mixture models, and biophysical parameter estimation [24]. The disadvantage is finding a
suitable model for the conversion and the intensive procedure of converting the reflectance values.
The GIS-based approach uses a GIS system to analyse satellite imagery. The advantage of a GIS
system is the ability to incorporate several different layers of meta-data and satellite images for analysis
[101]. The disadvantage is that different data sets have different product quality standards and when
used together will degrade the results of the overall performance [24].
Visual interpretation of images can exploit the full capabilities of a remote sensing analyst’s
experience and knowledge. A skilled analyst can compensate for environmental conditions when
looking for change [102]. The disadvantage of this approach is the processing time, and labour cost
required for large geographical areas and the variability of skill level of the analyst.
There are many different change detection methods that cannot be grouped into the afore-mentioned
categories. These methods produce new approaches to the field of change detection and have their
associated advantages and disadvantages [103–105].
Land cover change is a function of time and can be abrupt or gradual. The ability to detect the
difference between abrupt and gradual change is based on the temporal acquisition rate, the change
detection method and the number of acquisitions.
Gradual change is defined as the slow change from one type of land cover to another. For example,
settlement expansion is the process of clearing the indigenous vegetation and constructing a new human
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
32
Chapter 2
Remote sensing used for land cover change detection
settlement, which could take several months. Abrupt change is defined as a fast change in land cover
type, for example, wild fire that can destroy all the natural vegetation in an area within a few hours
[106].
Multi-temporal change detection methods flag all their land cover changes as abrupt. Previous
studies have shown that multi-temporal change detection methods’ performance is limited by the
differences produced in the seasonal growth of vegetated areas [107]. Variations in surface reflectance
values are observed in vegetated areas when the images are acquired at different times of the
intra-annual growth cycle [19]. These phenological cycles induce variations that could raise the
false change detection rate, as they are flagged as land cover change when it is only a natural
seasonal variation. To overcome this limitation, a high temporal acquisition rate is required to
capture the seasonal variations of a particular land cover [108]. This motivates the investigation into
hyper-temporal change detection methods, as these methods can distinguish between phenological
cycles, gradual and abrupt change [106].
Hyper-temporal change detection methods are used on multiple images acquired from a satellite
with a short periodic revisit cycle and can be used to complement a multi-temporal change detection
method [109]. The hyper-temporal acquisition rate provides continuous monitoring of the Earth, and
is not limited by the availability of costly high-resolution images. This is used to augment information
about which areas should rather be tasked for acquisition of high spatial resolution imagery. For
example, a hyper-temporal change detection method maps the geographical areas with the highest
probability of land cover change at low costs, after which a costly high-resolution image is acquired to
confirm the change.
2.8.1
Hyper-temporal change detection methods
Majority of the change detection methods found in the literature are based on medium to high
spatial resolution multi-temporal image analysis [18, 20]. Certain multi-temporal change detection
methods can be extended to hyper-temporal images by applying the methods sequentially to subsets of
multi-temporal images. The approaches that have been extended for the hyper-temporal case are:
image differencing [110], image regression [111], image ratioing [112], index differencing [113],
Principle Component Analysis (PCA) [75, 76], and Change Vector Analysis (CVA) [114].
These multi-temporal change detection studies rely on bi-temporal and trajectory analysis [20, 21,
24] and the data are mostly treated as hyper-dimensional, but not necessarily as hyper-temporal. These
methods therefore do not fully capitalise on the temporal dimension, which captures the dynamics of
different land cover types.
Hyper-temporal change detection methods attempt to understand the underlying force structuring
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
33
Chapter 2
Remote sensing used for land cover change detection
the data in the time dimension by identifying patterns and trends, detecting changes, clustering,
modelling and forecasting [8, 40]. Hyper-temporal change detection methods are broadly divided into
three categories: regression analysis, spectrum analysis, and temporal metrics.
2.8.1.1
Regression analysis
Regression analysis is a parametric method used to model the underlying structure of the data. The
parameters of the model are estimated using the data set. For example, Kleynhans et al. assumed
the MODIS NDVI time series could be modelled as a triply modulated cosine function [30]. The
parameters for this model were estimated using an EKF. A labelled data set was used to estimate the
models’ covariance matrices manually to improve separability between different land cover classes.
The estimated parameters were evaluated to detect changes in land cover.
Regression is also used to fit time series to a hypothetical temporal trajectory [109]. A temporal
trajectory is a defined map of a finite sequence of points describing the expected observed values in
a time series. The goodness of fit of a particular time series is computed for a set of hypothetical
temporal trajectories and is measured using least squares. A set of hypothetical temporal trajectories is
derived for forest disturbance dynamics in [109], which is used to describe the type of change.
The advantage of these methods is that there is no need to set a threshold. The disadvantage of both
these methods is the assumption in the form of the model or temporal trajectories. Are all the changes
that could realistically occur encapsulated in the model? Is the model able to adapt by inserting more
parameters or creating a larger set of hypothetical temporal trajectories?
2.8.1.2
Spectrum analysis
Spectrum analysis is the analysis of harmonic frequencies within a time series. Fourier analysis
is a type of spectral analysis which uses a Fourier transform to express a time series as a sum of
a series of cosine and sine waves with varying frequencies, amplitudes and phases [115, Ch. 3].
The frequency of each wave component is related to the number of completed cycles defined in the
time series. In many applications, the Fourier transform of time series is used for classification and
segmentation [116]. Lhermitte et al. proposed a classification method that only evaluates the mean
and seasonal Fourier transform components. The reason for this is due to the high sampling rate of a
strong seasonal component in vegetation time series [116]. These components are then clustered using
a post-classification change detection method [40].
Verbesselt et al. proposed the BFAST (Break For Additive Seasonal and Trend) approach, which
uses trend, seasonal and remainder components to detect changes in the phenological cycles of plants
[106]. The seasonal component is derived using the Fourier transform and has been shown to be more
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
34
Chapter 2
Remote sensing used for land cover change detection
stable than a piecewise linear seasonal model [117].
The advantage of these methods is that they are not dependent on a predefined model. They extract
the harmonic frequencies from the time series, which means they allow the evaluation of all frequency
components. The disadvantage of these methods is that the time series is assumed to be stationary and
that enough harmonic frequencies are properly sampled within the time series.
2.8.1.3
Temporal metric
A temporal metric is derived from the time series by evaluating inter-annual differences in five temporal
units: annual maximum, annual minimum, annual range, annual mean and temporal vector. Spatial
information can also be included in some of these temporal metrics, such as: spatial mean and spatial
standard deviation. The temporal metric is compared to a predefined threshold to determine whether
change has occurred.
An example of a temporal metric is the evaluation of a moving average window’s standard deviation
on a time series. A time series is declared as a changed area when two different windows’ standard
deviation significantly differ from one another [118].
Another temporal metric is known as the disturbance index. The disturbance index is used to
detect large-scale ecosystem distance [119]. The disturbance index measures the ratio between annual
maximum land surface temperature and annual maximum EVI to the multiple year mean annual
maximum land surface temperature and multiple year mean annual maximum EVI. If the current
annual maximums are significantly higher than the long-term maximum, a disturbance is flagged.
The difference between the two is evaluated with a predefined threshold to categorise the level of
disturbance.
The annual NDVI differencing method is another temporal metric proposed by Lunetta et al. [19],
which calculates the difference between consecutive summation of the annual NDVI time series. The
pixel is flagged as change if a certain predefined threshold is exceeded in this difference. The threshold
is usually determined using standard normal distribution statistics.
The EKF change detection method is a temporal metric proposed by Kleynhans et al. [120], which
evaluates the Euclidean distance between parameters derived with an EKF within a spatio-temporal
window. The EKF fits a triply modulated cosine function to a time series to model the seasonal
variations. The pixel is flagged as change if the Euclidean distance exceeds a predefined threshold.
The autocorrelation function (ACF) change detection method is a temporal metric proposed by
Kleynhans et al. [121], which evaluates the stationarity of a time series. The ACF of a time series in
question is compared to the ACF of time series that did not change in the local geographical vicinity.
The pixel is flagged as change if the deviation between the two ACFs exceeds a predefined threshold.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
35
Chapter 2
Remote sensing used for land cover change detection
The advantage of using a temporal metric is that it operates on the raw time series data. This enables
observation of abnormal behaviour that is usually filtered out by regression and spectrum analysis. The
disadvantage of using a temporal metric is the selection of the threshold and the negative impact of the
additive noise in the time series has on the performance.
The noise is reduced by creating methods that operate on annual statistics, which reduces the
effective time series measurements significantly. For example, an original MODIS NDVI time series
for 10 years (+450 time samples) can be reduced down to only 10 annual measurements represented
by a temporal metric.
2.8.2
MODIS land cover change detection product
Since the launch of MODIS, several different products have been developed (see table 2.3 on page 22
for a listing). Only a few specific change detection products have been developed for a small range
of applications. Thus there is currently no operational MODIS product to detect any changes in land
cover. There have been two previous attempts to create an operational land cover change detection
product [122–124].
The first attempt was the MODIS land use and land cover (LULC) algorithm, which detects land
cover changes at a 1 km resolution using a CVA approach [114, 124]. The direction of the change
vector is compared to a predefined threshold value and when exceeded, a change is flagged. It was
suggested that neural network classifiers be used on a pixel-by-pixel basis to track the probability that
a specific pixel changed over time [124]. The neural network is a supervised classifier and is used to
derive a parameter for land cover classification. This parameter is used to determine if the new data
of a geographical area are mapped to an existing category or to create a new category for the area.
The monitoring of current and previous observations are used with the land cover parameter to declare
change.
The second attempt at a MODIS LULC product was the MODIS Vegetative Cover Conversion
(VCC) product. The VCC product uses the first two spectral bands of MODIS at a spatial resolution of
250 m to detect any changes caused by anthropogenic activities or extreme natural events [123]. Five
different change detection methods were proposed in the VCC product:
1. RED-NIR space partitioning method: A two-dimensional map is created of the brightness and
greenness at two separate time intervals and is used to detect change. The brightness is computed
as the mean between the first two spectral bands. The greenness is computed as the difference
between spectral bands 2 and 1.
2. RED-NIR space change vector: A change vector is mapped onto a spectral space (spectral band
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
36
Chapter 2
Remote sensing used for land cover change detection
1 and 2) between two different dates for the same pixel. The magnitude and trajectory of the
change vector between the two dates are used to determine if changed occurred.
3. Modified ∆-space threshold: Uses a polar notation to define the differences in the RED and NIR
values for a pixel at two different dates. The type of change is defined by the resulting vector in
the polar plane.
4. Texture thresholding: Measures a coefficient of variation within a 3x3 spatial kernel at two
different times. The coefficient of variation is calculated as the ratio between the standard
deviation and mean within the kernel. Change is declared when the coefficient of variation
exceeds a predefined threshold.
5. Linear feature thresholding: The method computes the mean and absolute difference of a pixel
value for each neighbouring pixel in a 3x3 spatial kernel. A threshold determines whether a
linear feature is present.
Neither the MODIS LULC [114] nor the MODIS VCC [123] product fully capitalises on the
temporal dimension, as only two dates are compared. A multi-temporal change detection method
was attempted, while disregarding the potential of a hyper-temporal change detection method, which
has been used successfully in other fields [125, 126]: telecommunications, voice recognition, control
systems, etc. Even though one of the primary objectives before the launch of the MODIS sensors was
an operational land cover change detection product, to date no operational product has been developed.
2.9
SUMMARY
In this chapter, the use of remote sensing for monitoring geographical areas was discussed. The joint
investment of many international organisations and national governments has led to the creation of
numerous Earth observation satellites for various different applications. The chapter focused on the
importance of using satellite remote sensing to detect new human settlement development in certain
regions of South Africa.
The method of choosing a satellite-based sensor was discussed by considering the spatial, spectral,
radiometric, and temporal resolutions. After considering multiple factors, the MODIS sensor was
chosen, followed by a detailed description of its properties, with emphasis on the benefits of the BRDF
corrected data products. The chapter concluded with a review of some of the popular multi-temporal
change detection methods, and expanded to the use case of hyper-temporal change detection methods.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
37
C HAPTER
THREE
S UPERVISED CLASSIFICATION
3.1
OVERVIEW
Using machine learning methods to classify data sets is a recognised solution in many remote sensing
applications. In this chapter several design considerations are introduced that should be heeded when
implementing a supervised classifier. This is important, since less than 30% of new designs are
correctly assessed [127]. In the previous chapter it was found that machine learning methods are more
readily used in modern research because of the large volumes of data sets becoming readily available to
the research community, and the great benefit of analysing these data sets in higher dimensional feature
space. This chapter focuses on discussing strong, feasible approaches when a supervised classifier is
used to solve real world problems.
3.2
CLASSIFICATION
Classification is the process of finding important similarities between objects and then grouping these
objects into several subjective classes (categories).
Conceptual clustering is a modern process of classification by which conceptual descriptions
are derived from objects, which is followed by the classification of the object according to these
descriptions. Conceptual clustering was promoted from a machine learning background. There are
two general methods of categorisation that apply to conceptual clustering, namely supervised and
unsupervised learning [98, 128]. Supervised learning is the process of supplying category labels
to objects in the machine learning algorithm, while an unsupervised learning algorithm attempts to
extract the categories without any labels. The way in which the two learning methods operate are
completely different. A supervised learning method uses the labels of multiple objects to extract the
information from the descriptions that will accurately predict the correct category. An unsupervised
Chapter 3
Supervised classification
F IGURE 3.1: An aerial photo taken in the Limpopo province, South Africa of two different land cover
which are indicated by a natural vegetation segment and settlement segment. A segment is defined as
a collection of pixels within a predefined size bounding box.
learning method examines the inherent structure between all objects, to create categories using the
most similar descriptions.
3.3
SUPERVISED CLASSIFICATION
Supervised classification is a form of conceptual clustering and is the process of allocating a predefined
class label to a certain input pattern. Several concepts will be introduced throughout this thesis in
considering a hypothetical problem of separating different land cover types in an image. In figure 3.1,
an aerial photo is used to illustrate two different land cover types: natural vegetation and human
settlement. Input patterns to the supervised classifier will be labelled as either natural vegetation or
human settlement. The supervised classifier is given a set of descriptors to infer a function that assigns
a predefined label to each segment of the image. This function produces output values, denoted by y,
as either discrete, continuous or probabilistic in nature. The supervised classifier assigns a class label
to the output value y that best matches the given input pattern and is denoted by Ck , k = 1, 2, . . . K,
where K is equal to the number of output classes.
Land cover example: In the case of the land cover example shown in figure 3.1, K is equal to two
and the output value that the supervised classifier produces will be assigned accordingly to either
the natural vegetation class or the human settlement class. 2
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
39
Chapter 3
Supervised classification
Observations from different data sources are often grouped together to form an input vector ~x̃, also
referred to as an input pattern. These input data sources are usually in descriptive forms that can be
interpreted by humans.
Land cover example: In the case of the land cover example, the input data sources provide a colour
metric that is either ordinal or real. The input data source in this instance is a set of real number
values derived from the green, blue and red colours extracted from the RGB (Red Green Blue)
colour buffer of all the pixels within a segment. This input data source is used to form a single
input vector with three dimensions, which is defined as
~x̃ = [(Red value) (Green value) (Blue value)],
(3.1)
where ~x̃ denotes the input vector. 2
3.3.1
Mapping of input vectors
The ability of the supervised classifier to map the input vector ~x̃ to the desired output value y is based on
the performance of the learning algorithm. Given a set of input vectors {~x̃} and the set of corresponding
desired output values {y}, the learning algorithm seeks to infer a function that will satisfy
y ≈ F(~x̃).
(3.2)
This implies that the input space is approximately mapped to the desired output space by using
a mapping function denoted by F. The mapping function F is optimised by introducing a scoring
function that evaluates the current mapping function’s performance.
The learning algorithm tries to find a solution to the mapping function that will maximise the
scoring function. There are two general approaches to solving equation (3.2) when a scoring function
is used: empirical risk minimisation and structural risk minimisation. Empirical risk minimisation
attempts to find the optimal inferred function that will minimise the error in the mapping of the input
space to the output space. Structural risk minimisation includes a penalty term that provides control
between the bias and variance trade-off within the learning algorithm [129]. Both approaches try to
minimise the mapping error between the input and output space.
In regression analysis, the learning algorithm attempts to model the conditional distribution of the
desired output values, given a set of input vectors. The desired output values will also be termed target
values. Mapping typically uses an error function to determine the goodness of fit between the input and
output space, and is based on the principle of maximum likelihood [130, Ch. 6 p. 195]. The likelihood
L is computed as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
40
Chapter 3
Supervised classification
L=
P
Y
p(TCp | ~x̃ p )P (~x̃ p ),
(3.3)
p=1
where P (~x̃ p ) denotes the probability of observing the pth input vector and p(TCp | ~x̃ p ) denotes the
conditional probability density of observing the target value T p , given that the input vector ~x̃ p is
C
present. The error function E is derived by converting equation (3.3) into the negative log-likelihood,
which is defined as
E = − ln L = −
P
X
p(TCp
| ~x̃ p ) −
p=1
P
X
P (~x̃ p ).
(3.4)
p=1
The minimisation of the error in the mapping requires the minimisation of error function E. The
minimisation of the error function E in equation (3.4) will result in the maximisation of the likelihood
in equation (3.3). A popular method of defining the error in mapping is the Sum of Squares Error
(SSE). The minimisation of the SSE is equivalent to minimising the error function E in equation (3.4).
The SSE equation over P patterns is given as
2
P X
p
p
F(~x̃ ) − T .
E = 0.5
C (3.5)
p=1
The vector ~x̃ p denotes the pth input vector and TCp denotes the corresponding target value of the
supervised classifier.
In regression analysis, the mapping derived by using equation (3.5) is regarded as optimal as long
as the following three conditions are met [130, Ch. 6 p. 203]. These three conditions are:
1. The input vector set {~x̃} is sufficiently large to capture the underlying data structure.
2. The mapping between the input space and the output space is flexible enough.
3. The optimisation of the mapping is done with a good learning algorithm to minimise
equation (3.5) effectively.
In classification analysis, the learning algorithm tries to model the posterior probability of the
class label. The SSE function was not specifically designed for classification problems, as it assumes
that the target values are generated from a smooth deterministic function with additive zero-mean
Gaussian distributed noise. The decision to use error functions within classification requires discrete
class labels with optional corresponding class membership probabilities [130, Ch. 6 p. 222]. Many
different approaches have been used to rescale the output values in regression problems to match the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
41
Chapter 3
Supervised classification
class membership probabilities [130, Ch. 6 p. 223]. The error function shown in equation (3.4) is
reformulated for a classification problem as
E =−
P
X
p(TCp
| ~x̃ p ) −
p=1
P
X
P (~x̃ p ) = −
p=1
P X
K
X
p(Ck | ~x̃ p )δTCp −
p=1 k=1
P
X
P (~x̃ p ).
(3.6)
p=1
If the pth input vector ~x̃ p is from class Ck then δTCp = 1, where δ denotes the Kronecker delta symbol.
The symbol k denotes the class label of interest and K denotes the number of output classes.
The output values of the supervised classifier correspond to the Bayesian posterior probabilities if
the SSE function is minimised as shown in equation (3.6) [131, 132]. In a regression application it is
acceptable to assume Gaussian residuals when using the SSE function, but for classification problems
the target values are discrete and the additive zero-mean Gaussian distributed noise is not a good
description. A more intuitive approach is to use a binomial distribution which leads to the definition of
the cross-entropy error function [133].
Cross-entropy starts by observing the probability that the set of target values is TCpk = δTCp when
the pth input pattern ~x̃ p is from class Ck . This results in the output of a supervised classifier denoting a
class membership probability p(Ck |~x̃ p ) [130, Ch. 6 p. 237]. The value of the conditional distribution is
then expressed as
L=
P
Y
p(TCp
| ~x̃ )P (~x̃ ) =
p
p
p=1
P Y
K
Y
p=1
p
p TCk
(y )
k=1
P (~x̃ p ),
(3.7)
which equates to the cross-entropy error function defined as
E =−
P X
K
X
p=1 k=1
TCpk
ln
yp
.
TCpk
(3.8)
To ensure that the output values of the supervised classifier equates to the posterior probabilities, the
following condition must hold, given as [130, 134]
l′ (1 − y)
1−y
=
,
′
l (y)
y
(3.9)
where a class of functions l which satisfies this condition is given by
l(y) =
Z
y r (1 − y)r−1 dy.
(3.10)
Both the cross-entropy error function and SSE function comply with the condition set in equation
(3.9). Either of these two error functions can be used in minimising the error in the mapping between
the input space and output space for a given classification application. The SSE function is more
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
42
Chapter 3
Supervised classification
F IGURE 3.2: The same aerial photo over the Limpopo province as shown in figure 3.1, with an RGB
histogram overlay showing the attributes of the two segments.
attractive owing to the ease of implementation.
Land cover example: In the case of the land cover example, a mapping of the input space to the
output space is planned. The output space has two categories and the class labels are defined
as; Ck ∈ {C1 , C2 } = {natural vegetation, human settlement}. The input vectors are grouped as
shown in equation (3.1). The learning algorithm infers a function that will map the input vector
to the corresponding output value. These output values are grouped according to their respective
class label for analysis of the supervised classifier. The learning algorithm will attempt to map
the correct intensities of the RGB buffer values that will prove to be the most probable match
between the input vector and the correct class membership. The learning algorithm uses a scoring
system, like the SSE, to minimise the number of incorrect class memberships that are present in
the current mapping. To demonstrate the results of the mapping, a histogram of each segment
is shown in figure 3.2 with all participating pixels. The supervised classifier assigns segments
with dominant red intensity to human settlement and segments with dominant green intensity to
natural vegetation. 2
The external evaluation of the mapping of the input space to the output space requires sound
empirical validation. It was shown that less than 30% of new classifiers and learning algorithms are
correctly assessed with proper empirical validation [127]. To ensure proper analysis, the results can be
assessed by running the supervised classifier on actual (non-synthetic) data sets. This approach will
ensure strong support in using the supervised classifier to solve real problems. A second approach to
proper external evaluation is the subdivision of the data set into several partitions. These partitioned
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
43
Chapter 3
Supervised classification
data sets allow proper tuning of the supervised classifier and are used to perform cross-validation [127].
A good method of tuning a supervised classifier is to subdivide the labelled data set (input vectors with
known class labels) into three different subsets:
1. A training data set, which is used to train the learning algorithm to derive a mapping function
that will minimise the errors on the entire set of input vectors {~x̃}.
2. A validation data set, which is used to test the performance periodically and to mitigate any
negative design effects of the supervised classifier [135]. The performance is bounded by the
intrinsic noise within the training data [130, Ch. 9 p. 372].
3. A test data set, which is used to verify the performance of the supervised classifier on unseen
data. The test data set is used to approximate the generalisation error; this data set is not included
in the training phase or optimisation phase of the classifier.
3.3.2
Converting to feature vectors
Preprocessing of the input vector ~x̃ before the learning algorithm and postprocessing of the output
vector ~y after the learning algorithm is an optional procedure used to improve an algorithm’s
performance. The performance improves even when evaluating the outputs derived from the learning
algorithm that is using a noisy and inconsistent data set [136]. Let ~x denote the preprocessed version
of the input vector ~x̃, and ~ŷ denote the postprocessed version of the output vector ~y . This processing
chain is illustrated in figure 3.3.
F IGURE 3.3: Flow diagram illustrating the processing steps that includes preprocessing and
postprocessing.
The input data set {~x̃} contains information from several input data sources and the information
from each individual source can either be real numbers, ordinal numbers, nominal numbers or an 1-of-c
coding. An adjective used to describe the numerical ranking of an object’s position in a set is known
as an ordinal number. A nominal number is a set of numbers used for labelling purposes alone and do
not provide an indication of any other type of measurement. A 1-of-c coding is a vector representation
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
44
Chapter 3
Supervised classification
of the input which is an all-zero vector except in one location. The input data sets must have the same
cardinality regardless of the form of the input source.
Preprocessing is the processing of raw data supplied from the input data set {~x̃} to another space
that can be more effectively analysed. Most machine learning algorithms learn faster and provide
better performance if the input data set {~x̃} is preprocessed. Numerous different methods are used for
preprocessing, including: sampling, transformation, denoising, standardisation and feature extraction.
1. Sampling selects representative subsets from a large population of input patterns to perform a
range of functions such as generalisation, cross-validation, etc.
2. Transformation translates the raw data set to another mathematical domain.
3. Denoising includes several techniques used to reduce the noise on samples in the input data set.
4. Standardisation refers to the scaling of the variables within the input pattern from multiple input
data sources to a common scale. This common scale allows the underlying properties of the
input data sources to be compared fairly within a machine learning algorithm.
5. Feature extraction extracts specific characteristics from the input patterns.
F IGURE 3.4: An alternative selection of natural vegetation and human settlement segments of the aerial
photo taken in the Limpopo province using the same input vector.
Land cover example: Revisiting the aerial photo, the advantage of feature extraction as a
preprocessing step can be shown when new segments are selected as shown in figure 3.4.
High correlation is observed in the histogram of the three RGB buffer values when the new
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
45
Chapter 3
Supervised classification
segments are captured with the original input vector defined in equation (3.1). This results in
poor separability within the input space and significant deterioration in the performance of the
machine learning algorithm. Both segments appear highly similar in figure 3.4, and will require
a complex classifier to separate the segment into the two predefined classes.
A feature extraction method is proposed in the example to extract both the moisture and
reflectivity of each segment. Once extracted, these features can be placed into a feature vector ~x
of two dimensions, which is defined as
~x = [(Moisture) (Reflectivity)].
(3.11)
By using the feature vector, the human settlement segment in the example has high reflectivity
and low moisture retention due to the bare soil. The natural vegetation segment has high moisture
retention and low reflectivity, as shown in figure 3.5. This creates an improved feature space for
the classifier to separate the two classes, regardless of the geographical positions of the segments.
2
Postprocessing is an important component in the analysis phase of the design [137]. Postprocessing is
the procedure of converting the output set {~y }, produced by the supervised classifier, back into either
the space of the original data set or to a more user-friendly format. This extracts information from the
results produced by the learning algorithm and is used to improve the overall system performance.
F IGURE 3.5: A new histogram created by extracting the feature vectors of the new segments selected
in figure 3.4.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
46
Chapter 3
Supervised classification
Numerous methods are used for postprocessing, which are categorised as: knowledge filtering,
interpretation, evaluation and knowledge integration [137].
1. Knowledge filtering is the filtering of the outputs produced by the supervised classifier. This
filtering improves the results when the mapping function in the supervised classifier is sensitive
to the noise within the training data set.
2. Interpretation is a form of knowledge discovery where input vectors are processed by the
supervised classifier and converted to an user-friendly format for human analysis.
These
postprocessed outputs are analysed to interpret the effect of the input vectors has on the
supervised classifier. This creates a new knowledge base for further improving the results of
the supervised classifier for the given application.
3. Evaluation is an approach that transforms the output values into a performance metric that is
used to evaluate the performance of the current supervised classifier. Typical performance
metrics include: classification accuracy, comprehensibility, computational complexity, visual
interpretation, etc.
4. Knowledge integration is the process of including additional selected information sources to
improve the performance of the supervised classifier.
Land cover example: In the case of the land cover example, the evaluation approach is used as a
postprocessing step. The classification accuracy is used as the performance metric to evaluate
the segment classification within the aerial photo. The supervised classifier produces an output
vector ~y of either discrete, continuous or probabilistic in nature.
Let the output vector ~y in this example denote the vector containing all the posterior class
probability values. The mapping of this vector to a class is expressed as

 C (natural vegetation) if y > y
1
1
2
Ck =
 C2 (human settlement) if y2 ≥ y1 .
(3.12)
The output vector ~y is classed as natural vegetation when the largest value in the vector is in
the first position and human settlement when in the second position. The classification accuracy
is maximised by selection of the most appropriate supervised classifier and feature extraction
method. 2
The preprocessing of the input vector ~x̃ will produce a new input vector ~x that is commonly referred
to as the feature vector. Feature vectors will be used throughout the thesis as it is assumed that with
proper feature extraction the overall system performance will improve.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
47
Chapter 3
3.4
Supervised classification
ARTIFICIAL NEURAL NETWORKS
An artificial neural network (ANN) is a computational learning method that was inspired by the neural
activities within the human brain [138]. ANNs have a range of capabilities to operate on non-linear
and non-parametric data sets. The advantage of the ANN is that it can model a non-linear relationship
between the input and output variables. The ANN is trained on a partial set of known data to perform
either classification, estimation, simulation or prediction of underlying structures within the data.
3.4.1
3.4.1.1
Network architecture
Perceptron
The first design consideration that will be evaluated is the network architecture, as several different
ANN architectures are proposed in the literature.
The simplest architecture is the single-layer
perceptron, which is a linear feedforward neural network that was first proposed by Frank Rosenblatt
at the Cornell Aeronautical Laboratory in 1957 [139]. The perceptron is discussed, as several other
concepts expand on it, as well as the important limitation the perceptron has in terms of the range of
functions it can represent. The perceptron is classified as a feedforward network, as the activation of the
neuron is propagated in one direction from the feature vector ~x to the output value y. The relationship
between the feature vectors and the output is stored within the ANN’s weight vector (also referred to
as the synaptic strengths within the ANN), and is defined within the network as
y = F(~x, ω
~ ).
(3.13)
The variable y denotes the corresponding ANN’s output value and ω
~ denotes the weight vector. The
feature vector presented to the network is denoted by ~x and F denotes the function inferred by the
ANN. The weight vector ω
~ and the feature vector ~x are multiplied such that equation (3.13) expands
in the case of the perceptron to
N
X
y = F ω0 +
xi ωi = F ω0 + ~x · ω
~ .
(3.14)
i=1
The symbol F denotes the activation function and the network inputs are denoted by the feature vector
~x = {x1 , x2 , . . . , xN }. The weight vector for the network is denoted by ω
~ = {ω1 , ω2 . . . , ωN } and the
neuron bias by ω0 .
The perceptron is trained with the perceptron learning rule, which minimises the error function
by evaluating the output value produced for a given feature vector. The perceptron learning rule
processes individual feature vectors ~x by presenting them to the network and adjusting the weight
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
48
Chapter 3
Supervised classification
vector ω
~ iteratively to improve the classification accuracy. The perceptron learning rule attempts
to fit a linear hyperplane through the feature space. The perceptron learning rule is limited by the
network architecture and will only converge if the classes are linearly separable within the feature
space [140, 141]. Other applications involving multiple separation regions are catered for by using
multiple perceptrons in parallel, with each output value corresponding to a specific region.
3.4.1.2
Multilayer perceptron
A more popular network architecture is the multilayer perceptron (MLP). A MLP is a feedforward
ANN model that contains multiple layers of neurons. The multilayer architecture allows the MLP to
distinguish feature vectors within a feature space that are not linearly separable. A two-layer network
architecture of a MLP, which has one hidden node layer, is illustrated in figure 3.6.
F IGURE 3.6: The topology of a feedforward multilayer perceptron with a single hidden layer.
This fully connected two-layer network’s links are mathematically expressed as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
49
Chapter 3
Supervised classification
M
N
X
X
yk = F2 ωk0 +
ωkj F1 ωj0 +
xi ωji
,
j=1
(3.15)
i=1
which is more compactly expressed in vector notation as a linear multiplication between vectors as
yk = F2
ωk0 + ω
~ k · F1 ωj0 + ~x · ω
~j
.
(3.16)
The network consists of N input nodes denoted by the vector ~x = {x1 , x2 , . . . , xN }. The weight vector
that connects the input nodes to the j th hidden node is denoted by the vector ω
~ j = {ωj1 , ωj2 . . . , ωjN },
with a corresponding neuron bias denoted by ωj0 . Similarly, the weight vector that connects the hidden
nodes to the k th output node is denoted by the vector ω
~ k = {ωk1 , ωk2 . . . , ωkM }, with a corresponding
neuron bias denoted by ωk0 . The MLP allows the use of multiple output nodes to produces an output
vector that expands equation (3.16) to
~yk = F2
ωk0 + ω
~ k · F1 ωj0 + ~x · ω
~j
,
(3.17)
with an output vector ~yk that uses a one-of-c coding.
Introducing a unity input on each neuron, x0 = 1, the weight vector is expanded to include the
neuron bias as ω
~ j = {ωj0 , ωj1 . . . , ωjN } for the hidden nodes and ω
~ k = {ωk0 , ωk1 . . . , ωkM } for the
weight vector for the output nodes. This simplifies equation (3.17) to
~yk = F2
ω
~ k · F1 ~x · ω
~j
.
(3.18)
Monotonic functions are usually used as activation functions. Neural networks typically use a
sigmoid activation transfer function in the hidden layers given in equation (3.18) as
F(a) =
1
.
1 + e−a
(3.19)
The sigmoid activation function is non-linear and allows the outputs of the neural network to be
interpreted as a posterior class probability [130, Ch. 6 p. 234]. If all the activation functions within the
network are converted to linear functions, then an equivalent single layer linear network without any
hidden layers can be derived. This follows from the observation that the composition of successive
linear transformations is itself a linear transformation [130, Ch. 4 p. 121].
By applying a linear transformation to equation (3.19), a tangent activation function is derived as
F(a) =
ea − e−a
.
ea + e−a
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(3.20)
50
Chapter 3
Supervised classification
The tangent activation function is of interest as through empirical simulations it has been proven to
provide faster training of the network (section 3.4.4) [130, Ch. 4 p. 127].
The number of layers and hidden nodes within each layer are flexible design parameters. The
general rule is that the layers and nodes are chosen to best model the feature space. It is known
from the Kolmogorov theorem that a two-layer network with finitely many discontinuities can closely
approximate any decision boundary to arbitrary precision using a sufficient number of hidden nodes
with sigmoidal activation functions [142].
Several different network architectures exist and are constructed on similar concepts. The focus of
this chapter will be on the MLP, but different ANNs will be briefly discussed in this chapter.
3.4.2
Regression using a multilayer perceptron
Regression analysis is a method for modelling and analysing a set of variables that focuses on the
mapping relationship between a dependent variable and multiple independent variables. This extends
to the understanding of inherent changes in the dependent variable when any one of the independent
variables is altered. An ANN is seen as a flexible non-linear regression method, which is readily
deduced from equation (3.18), where the network uses a training algorithm to find a weight ω
~ to map
a relationship between the feature vectors and the output vectors.
The training algorithm trains the network by presenting the patterns of the training set to the
network, and adjusting the weights (synapse strengths) to minimise the error function. The training
algorithm derives the optimal weight by using the error function given in equation (3.4) as
ω
~ opt = argmin{E} = argmin
ω
~ ∈Ω
ω
~ ∈Ω
−
P
X
p(TCp
p
| ~x ) −
p=1
P
X
p=1
P (~x ) .
p
(3.21)
The vector ω
~ opt denotes the optimised weight that provides the optimal fit for the mapping that is
found within the weight space Ω. P (~x p ) denotes the probability of observing the pth feature vector
and p(TCp | ~x p ) denotes the conditional probability density of the target value TCp given that the feature
vector ~x p is present. The probability of observing the pth feature vector denoted by P (~x p ) is an
additive constant in equation (3.21), and can not be improved through the network architecture or
learning algorithm procedures [130, Ch. 6 p. 195]. This term is dropped to simplify equation (3.21) to
ω
~ opt = argmin
ω
~ ∈Ω
−
P
X
p(TCp
p=1
| ~x ) .
p
(3.22)
The SSE function given in equation (3.5) is usually used as the error function in the MLP and is
substituted into equation (3.22) to compute the optimised weight as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
51
Chapter 3
Supervised classification
ω
~ opt
2 X
P p
p
F(~x , ω
= argmin 0.5
~ ) − TC .
ω
~ ∈Ω
(3.23)
p=1
The symbol F denotes the MLP’s inferred map and ~x p denotes the pth feature vector with the
corresponding target value denoted by TCp . The training algorithm attempts to find the optimal weight
ω
~ opt that provides the smallest error function value E.
3.4.3
Classification using a multilayer perceptron
The case was made that an ANN can be interpreted as a non-linear regression model in section 3.4.2.
A regression model is used to construct a classifier, which is used to interpret the dependent variable
as a posterior class membership probability. These posterior probabilities yield the most likely class
for each feature vector.
The reconstruction of the regression model to behave like a classifier starts by using a 1-of-c coding
output vector as shown in equation (3.18). The output layer responds like a logistic regression model
when sigmoid activation functions are used in each output node [130, Ch. 6 p. 232].
By setting the target value for each training pattern to the desired posterior class probability, with
a 1-of-c coding , the MLP is trained in the same manner as a regression model to obtain the optimal
weight ω
~ opt . Using the optimal weight ω
~ opt , the ANN maps the feature vectors to their corresponding
desired posterior class probabilities.
Since each MLP output node represents the posterior class probability for each class, a mapping
function is used to select the class that has the largest posterior probability. The mapping function Z
is expressed as
Ck = Z(~y ),
(3.24)
where Ck denotes the class membership and ~y denotes the MLP output vector.
Deriving the optimal weight ω
~ opt will assign the highest posterior class probability to the correct
class membership Ck for the corresponding feature vector ~x and is expressed as
P (Ck = Cf |~x) > P (Ck = Cg |~x)
∀(f 6= g),
(3.25)
where P (Ck = Cf |~x) denotes the probability of class membership of Ck being equal to Cf , given the
feature vector ~x was presented to the MLP.
The probability of error is equal to the probability of falling within the incorrect decision region
[143]. The probability of error for the class membership (Ck = Cc ) of the MLP is computed as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
52
Chapter 3
Supervised classification
Pe = 1 −
Z
p(~x|Ck = Cc )P (Ck = Cc )d ~x.
(3.26)
Rc
The procedure of minimising the probability of error Pe on the global population group of feature
patterns, requires that the complete population’s class memberships be known. This is not possible
for most actual data sets (non-synthetic), as acquiring the class membership on all feature vectors is
infeasible. The objective of the training algorithm is to minimise the probability of error Pe on the
global population by only using a subset of feature vectors with known class membership.
An external evaluation process is used for minimising the probability of error, as discussed in
section 3.3.1, that is used to improve overall system performance. The subdivision of the labelled data
set (feature vectors with known class memberships) for the MLP is briefly discussed:
1. A training data set is used to train the ANN to minimise the mapping errors on the data set by
means of adaptation of the weights. A popular method of calculating the error in the mapping is
the SSE shown in equation (3.5). The minimisation of the error is accomplished by initialisation
the weights with random values, followed by presenting the training data set to the network to
adjust the weights accordingly. Several different training algorithms exist in the literature that
attempts to minimise the error on the training data set.
2. A validation data set is periodically used to test the network performance to mitigate the effects
of overfitting [135]. A neural network with more hidden nodes has the ability to learn a more
complex mapping [144]. A complex mapping in the feature space has the ability to isolate
complex regions [145]. If proper design of the MLP is not adhered to, the network not only
extracts the characteristics of the feature space, but also memorises the noise within the training
data set.
3. A test data set is used to validate the performance of the MLP. The test data set is used to estimate
the generalisation error, and this data set is not included in the training phase or optimisation
phase.
3.4.4
Training of neural networks
As stated previously, the MLP network relies on the weights to assign the feature vector to the class
membership that has the largest posterior probability. This is under the assumption that the optimal
weight ω
~ opt is used to provide the decision regions. The design of a proper MLP requires the estimation
of a weight ω
~ that will minimise the error function and generalisation error for an application.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
53
Chapter 3
Supervised classification
The error function E(~ω ) is improved with a training algorithm by searching through the weight
space Ω, that uses the SSE metric given in equation (3.5), which is continuous and twice differentiable
in R|~ω| , where |~ω | denotes the total number of weights in the network.
A local minimum of E(~ω ) is defined as a vector ω
~ local , such that E(~ωlocal ) ≤ E(~ω ) for all |~ωlocal −
ω
~ | < Dω~ in R|~ω| , where Dω~ is a predefined constant.
It is possible that E(~ω ) may contain multiple local minima. Let Slocal denote the set of all such
local minima of E(~ω ) on R|~ω| . The global minimiser of E(~ω ) is then defined as
ω
~ ∗ = argmin E(~ω ).
(3.27)
ω
~ ∈ Slocal
Note that E(~ω ∗ ) ≤ E(~ω ), ∀ ω
~ ∈ R|~ω| . In addition, the derivative of the error function, ∇E(~ω ), is zero
for all ω
~ ∈ Slocal .
Owing to the non-linear nature of the error function E(~ω ), no closed form solution can be obtained.
Many iterative algorithms can be applied to minimise the error function E(~ω ), most of which iteratively
adjust the current weight ω
~ i such that
ω
~ (i+1) = ω
~ i + ∆~ωi ,
(3.28)
where ∆~ωi is typically chosen such that E(~ωi+1 ) < E(~ωi ). The manner in which ∆~ωi is determined at
each epoch i, will allow the algoirthm to converge to either a local minimum or a global minimum of
the error function E(~ω ).
Owing to the inherent difficulty of reliably locating the global minimum ω
~ ∗ of the error function
E(~ω ), most algorithms instead attempt to find the best local minimum, given a finite number of
iterations, which may be called an acceptable local minimum for a given training data set.
Another important aspect that should be considered is that the global minimum of the error function
E(~ω ) on a given training data set may not necessarily result in the best generalisation performance for
the application, hence it is typically sufficient to find an acceptable local minimum [130, Ch. 6 p. 194].
Several different approaches to calculating the weight update set ω
~ i in equation (3.28) will now be
discussed.
3.4.5
3.4.5.1
First order training algorithms
Gradient descent
The gradient of the error function E(~ω ) always points in the direction in which E(~ω ) will decrease
most rapidly in its local vicinity. Algorithms that exploit the gradient information can typically locate a
minimum in fewer iterations than algorithms that do not use gradients. The gradient descent algorithm
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
54
Chapter 3
Supervised classification
propagates along the negative slope of the error function [146]. The weight update ∆~ωi given in
equation (3.28) is iteratively computed in the gradient descent approach at each epoch i as
∆ωi = −Li ∇E|ω~ i + M∆ω(i−1) .
(3.29)
The variable Li denotes the learning rate and M denotes the momentum parameter. The derivative of
the error surface evaluated at weight ω
~ i is denoted by ∇E|ω~ i . The algorithm incorporates a learning
rate parameter Li that scales the rate of propagation of the weight down the negative slope. The correct
adjustment of the learning rate improves the convergences onto a local minimum of E(~ω ). If the
learning rate is set too high, the algorithm has difficulty in stabilising the weight and might cause ω
~i
to oscillate around the minimum, preventing convergence. When the learning rate is set too low, the
algorithm takes a long time to converge. Common practice states a gradual decrease in the learning
rate Li during training minimises the chance of oscillations within the training process.
Additional information for the training algorithm is acquired from the eigenvalues of the Hessian
matrix of the error. The learning rate can be set to Li = (2/λmax ) to improve the performance further,
where λmax denotes the largest eigenvalue in the Hessian matrix [147]. The disadvantage is that the
Hessian matrix varies as the weight is updated at each iteration with ∆ωi and calculating the Hessian
matrix is computationally expensive.
If the Hessian matrix is calculated, a metric is defined for characterising the expected rate of
convergence of steepest descent. This metric is the ratio of the smallest eigen value λmin and the
largest eigen value λmax and is expressed as
R(λ) =
λmin
.
λmax
(3.30)
A very small value of R(λ) usually means that the error surface contours are highly elongated elliptical
in shape and the progress to the minimum will be extremely slow when using steepest gradient
descent. The momentum parameter M is used for compensating when the ratio R(λ) is small [148].
The momentum term leads to faster convergence towards the minimum without causing divergent
oscillations, which may appear when the learning rate is too large. The momentum parameter acts as a
lowpass filter to incorporate recent trends in movement along the error surface. Inclusion of momentum
generally leads to a significant improvement in the performance of gradient descent.
3.4.5.2
Resilient backpropagation
Resilient backpropagation (RPROP) is a first-order heuristic algorithm that is used for training a
feedforward neural network [149]. The RPROP algorithm is based on the notion that the optimal
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
55
Chapter 3
Supervised classification
step size, at a given iteration, will differ for each dimension of ω
~ i . RPROP thus maintains a separate
weight update step ∆~ωi,j for each dimension j. A heuristic is employed to adjust each ∆~ωi,j at every
epoch as follows; if the sign of the gradient dimension j has changed from that of the previous epoch,
reduce the step size ∆~ωi,j and reverse its sign, otherwise increase the step size ∆~ωi,j .
The reasoning is that the gradient sign in dimension j will change if the algorithm has moved over
a local minimum, thus the algorithm must take smaller steps in the following iterations to approach the
minimum. This is analogous to implementing standard steepest descent, but with a separate adaptive
learning rate for each dimension.
3.4.5.3
Quickprop
The last heuristic first order training algorithm that will be discussed in the section is the Quickprop
algorithm [150]. Quickprop treats each weight within the network as quasi-independent. The idea is to
approximate the error surface with a quadratic polynomial function. The gradient information derived
with backpropagation is used to determine the coefficients of the polynomial. The step sizes are fixed
within the weight to ensure that the algorithm will converge to a minimum. The Quickprop algorithm
uses a local quadratic surface and cannot distinguish between propagating upwards or downwards on
the error surface. This drawback is easily overcome by determining the propagation direction by using
an algorithm such as the gradient descent algorithm in the first epoch.
3.4.5.4
Line search
The line search is a one dimensional minimisation problem, which finds the minimum of the error
function along a particular search direction [151]. It is used in several different algorithms to reduce
computational complexity and will be discussed briefly. Suppose that a certain algorithm is considering
a particular search direction d~i through the weight space for a potential future weight update (equation
(3.28)), the minimum along that particular search direction is calculated as
ω
~ (i+1) = ω
~ i + ∆d d~i ,
(3.31)
where the step size parameter ∆d is calculated as
E(∆d ) = argmin E(~ωi + ∆d d~i ).
(3.32)
∆d ∈R
In summary, the line search finds the optimal step size for a selected search direction. The line search
algorithm itself has several constraints, as every line minimisation involves several internal error
function evaluations, which could be computationally expensive. Line search introduces additional
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
56
Chapter 3
Supervised classification
parameters whose values will determine the termination criterion for each line search.
3.4.5.5
Conjugate gradient
The concept of choosing improved search directions is the main principle behind the conjugate gradient
algorithm [130, 152]. The conjugate gradient algorithm evaluates the performance of conjugate
directions with line search algorithms. The conjugate gradient algorithm is an iterative approach and
is applied with ease to applications having feature vectors with several dimensions. The conjugate
gradient algorithm operates under the assumption of a quadratic error function with a positive definite
Hessian matrix [130, Ch. 7 p. 276].
Owing to the fact that most data sets have a non-quadratic error surface, there is a high probability
that if the step size is small enough, the evaluation of E(~ωi + ∆~ωi ) will fall on an error surface that is
approximately quadratic in its local vicinity. This may lead to fast convergence to a minimum. Under
similar reasoning, if the local vicinity of the error surface is non-quadratic, the conjugate gradient
algorithm will converge slowly to the minimum.
The performance of the conjugate gradient algorithm is dependent on the type of line search
algorithm used. Line search allows the conjugate gradient algorithm to find the step size without
evaluating the Hessian matrix.
3.4.6
Second order training algorithms
The successive use of the local gradient vector as the search direction does not always result in the most
optimal search trajectory. The local gradient does not necessarily point directly at the minimum, which
may cause oscillating behaviour in a steepest descent algorithm. This slow progression to the minimum
can even be present with a quadratic error surface for poorly conditioned networks. The convergence
speed can be improved by evaluating and choosing superior search directions while propagating down
the error surface.
3.4.6.1
Newton method
The Newton method is an algorithm that calculates the Newton direction by assuming a positive definite
Hessian matrix and a quadratic error surface. The trajectory from the current weight to a nearby
minimum is known as the Newton direction. There are three obstacles when using the Newton method
[130, Ch. 7 p. 286]:
1. The calculation of the Hessian matrix is computationally expensive for a non-linear MLP which
requires O(P |~ω |2 ) operations to compute, where P is the number of feature vectors to evaluate
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
57
Chapter 3
Supervised classification
and |~ω | is the dimension of the weights.
2. The calculation of the inverted Hessian matrix is also computationally expensive, as it requires
O(|~ω |3 ) iterations to compute.
3. Regardless of whether the Hessian matrix is positive definite, the Newton direction can point to
either a maximum or a minimum.
The third obstacle can be resolved by using a model trust region approach that adds a positive
definite symmetrical matrix to the Hessian matrix [130, Ch. 7 p. 287], which is expressed as
Hnew = Hold + AI.
(3.33)
The matrix Hold is the current Hessian matrix and Hnew is the adjusted Hessian matrix. The identity
matrix is denoted by I and A denotes a constant factor. Equation (3.33) provides the Newton direction
if the constant factor A is set to a small value or it can provide the negative gradient descent direction
if the constant factor A is set to a large value [130, Ch. 7 p. 287].
The last consideration is the step size along the Newton direction. The step size calculated within
the Newton method is made under the assumption that the error surface is quadratic in shape. Most
real data sets have non-quadratic error surfaces and when the step size is too large, the algorithm may
fail to converge.
3.4.6.2
Quasi-Newton method
A more practical implementation of the Newton method is the Quasi-Newton method.
The
Quasi-Newton method is an approximation of the Newton method, as the Hessian matrix is
computationally expensive for complex neural networks [153].
The Quasi-Newton method
approximates the inverted Hessian matrix over several iterations, using only the first derivative of the
error function. After each iteration the estimated inverse Hessian matrix approximates more closely
the real inverse Hessian matrix for a given weight.
A popular quasi-Newton algorithms is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.
The BFGS algorithm updates the estimated Hessian matrix in each epoch to converge to the actual
Hessian matrix. The algorithm starts with the identity matrix to ensure that the minimum is tracked
and not the maximum. The length of the Newton step is calculated using a proper line search to
ensure stability. The accuracy of the line search is not as critical as it was with the conjugate gradient
algorithm [154].
The disadvantages of the Newton and the Quasi-Newton methods are the storage requirements
and the number of iterations to approximate the Hessian matrix [130, Ch. 7 p. 289]. Because of the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
58
Chapter 3
Supervised classification
non-quadratic error surface of most data sets, the approximate Hessian matrix must be estimated after
each weight update to ensure correct minimisation of the error function. The second disadvantage of
these methods is the introduction of the model trust region constant factor A and the correct scaling of
this constant.
3.4.6.3
Levenberg-Marquardt algorithm
The last second order training algorithm that will be discussed in the section is the
Levenberg-Marquardt algorithm [155, 156]. The Levenberg-Marquardt algorithm is an approach to
derive the second-order derivative without computing the Hessian matrix, as with the Quasi-Newton
method. The Levenberg-Marquardt algorithm is specifically designed to minimise the SSE. This is
accomplished by approximating the function in equation (3.5) with linearisation as
F(~xi , ω
~ i + ∆ωi ) ≈ F(~xi , ω
~ i ) + J~i ∆ωi .
(3.34)
The vector J~i is a gradient row vector of F with respects to ω
~ i and is computed as
∂F(~xi , ω
~ i)
.
J~i =
∂ω
~i
(3.35)
Substituting the approximation of equation (3.34) into equation (3.5) is expressed as
2
P X
p
p
~
E(~ω + ∆ωi ) = 0.5
F(~
x
,
ω
~
)
+
J
∆ω
−
T
i
i
C .
(3.36)
p=1
By setting the derivative as
∂E(~ω + ∆ωi )
= 0,
∂ ∆ωi
(3.37)
equation (3.36) can be expressed as
2 X
P p
F(~x p , ω
.
(J T J)∆ωi = J T 0.5
~
)
−
T
C (3.38)
p=1
The Jacobian matrix is denoted by J, with each row containing J~i . This Jacobian matrix contains the
first derivatives of the neural network’s error. Levenberg added a non-negative damping factor λdamp ,
which is adjusted at each epoch. This is expressed as
T
(J J + λdamp I)∆ωi = J
T
2 P X
p
p
0.5
~ ) − TC .
F(~x , ω
(3.39)
p=1
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
59
Chapter 3
Supervised classification
A smaller damping factor λdamp value allows the algorithm to behave more like the Newton method,
while a larger damping factor λdamp value allows the algorithm to behave like the gradient descent
method.
If the damping factor λdamp value is set too high, the inversion of (J T J + λdamp I) contributes
nothing to the algorithm. Marquardt then contributes a variable that will scale each component of the
gradient according to the curvature. This results in the Levenberg-Marquardt equation given as
(J T J + λdamp diag(J T J))∆ωi = J T
2 P X
p
F(~x p , ω
,
0.5
~
)
−
T
C (3.40)
p=1
where the identity matrix I in equation (3.39) is replaced to ensure larger propagation in the desired
direction when the gradient becomes smaller.
3.5
OTHER VARIANTS OF ARTIFICIAL NEURAL NETWORKS USED
FOR CLASSIFICATION
3.5.1
Radial basis function network
The radial basis function (RBF) network is another ANN that is discussed in this chapter [130, 157]. In
the case of the MLP, the hidden neurons create multi-dimensional hyperplanes to separate different
classes within the feature space. In the case of the RBF network, the network uses local kernel
functions, which are represented by a prototype vector within each hidden neuron to model different
classes. The activation of the hidden neurons is based on the distance from the prototype vector,
which in effect creates a spherical multi-dimensional hypersphere. The RBF network can be used for
classification; the posterior class probabilities of the network at the output is computed as
p(Ck |~x) =
D
X
ω
~ kd ϕd (~x).
(3.41)
d=1
The RBF uses D basis functions that are denoted by ϕd . The ϕd basis function in the network’s hidden
neurons is expressed as a normalised basis function given by
p(~x | d)P (d)
= p(d | ~x).
ϕd (~x) = PE
x | e)P (e)
e=1 p(~
(3.42)
The dth basis function evaluating feature vector ~x is denoted by ϕd (~x) [130, Ch. 5 p. 181]. The
denominator is used to normalised the basis function by iterating through all the basis functions within
the network with variable e. The outputs of all the radial basis functions are linearly combined with a
weight vector to form an output vector. The weight vector for each output node is given by
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
60
Chapter 3
Supervised classification
ω
~ kd =
p(d | Ck )P (Ck )
= p(Ck | d).
P (d)
(3.43)
The radial basis function network can be designed in a fraction of the time required to train a MLP, but
requires a large sample of input vectors to train reliably [158].
3.5.2
Self organising map
F IGURE 3.7: The training of the SOM will map the gridded topological map to the training data set.
Another popular ANN design is the Self Organising Map (SOM) [159, 160]. The SOM is trained
with an unsupervised learning algorithm to convert a high dimensional data set to a lower dimensional
representation of the data, typically two-dimensional. The SOM converts the higher dimensional data
set to a lower dimension using a topological map that comprises prototype neurons. This topological
map is used to illustrate the relationship between feature vectors by placing similar feature vectors in
close vicinity to each other on the map and dissimilar feature vectors further apart. Each prototype
neuron has a prototype vector; these are comparable to weights in other ANNs, and are initialised to
either random samples or uniform subsampling of the feature vector set.
The training algorithm used on the SOM is a competitive learning algorithm which searches for
the part of the network that strongly responds to the given feature vector. The response is evaluated
by presenting a feature vector ~x to the SOM’s prototype neurons to determine the Euclidean distances
to all prototype vectors. The prototype neuron with the most similar prototype vector is termed as
the best matching unit (BMU). The prototype vector within the BMU is adjusted towards the feature
vector. The prototype neurons in close vicinity of the BMU in the topological map are known as the
neighbouring neurons and are also updated to a certain degree towards the current feature vector. The
magnitude of the adaptation of the neighbouring neurons decreases with epochs and distance from the
BMU.
A SOM is trained in batch mode, where all the feature vectors are presented to the network and
only the BMU is trained. A monotonically increasing penalty factor is added to that feature vector to
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
61
Chapter 3
Supervised classification
ensure that a particular feature vector does not dominate the training algorithm. In the beginning of
the training phase, the neighbourhood relationship within the topological map is large, but with each
epoch the mapping of neighbourhood size shrinks within the map and the network converges (Figure
3.7). The creation of a topological map, particularly if the data are not intrinsically two-dimensional,
may lead to suboptimal placement of the feature vectors [130, Ch. 5 p. 188].
3.5.3
Hopfield networks
The third ANN briefly discussed is the Hopfield network. A Hopfield network is a recurrent network
with feedback loops between the outputs and the inputs [161–163]. The neurons in the Hopfield
network have binary threshold activation functions and the internal state of the network evolves
to a stable state that is a local minimum of the Lyapunov function. The Lyapunov function is a
monotonically decreasing energy function that puts less emphasis on the previous set of feature vectors
than on the current set of feature vectors. A Hopfield network is an associative memory, which enables
it to train on a set of target vectors, and when a new set of feature vectors are presented it will cause the
network to settle into an activation pattern corresponding to the most closely resembling target vector
presented in the training phase. The drawback of the Hopfield network is that it can only retrieve all
the fundamental memorised target vectors [164].
3.5.4
Support vector machine
A Support Vector Machine (SVM) is a supervised learning algorithm that was developed in the AT&T
Bell laboratories in 1995. SVM is based on the principle of structural risk minimisation, which involves
constructing a non-linear hyperplane with kernel functions to separate the feature space into several
output regions [129].
The SVM training algorithm attempts to fit a non-linear hyperplane through the feature space. It
focuses on maximising the distance between the decision boundary and the sets of feature vectors. The
SVM is a maximum margin classifier and does this by identifying the feature vectors within the feature
space that prohibits the training algorithm from increasing the margin between the output regions.
These feature vectors are called the support vectors within the feature space.
The method by which the SVM handles non-separable feature vectors is relaxing the constraints
on the hyperplane that maximises the separability. This is accomplished by including a cost function
into the separating marginal regions and penalises the feature vectors that severely hinders the SVM’s
performance.
The advantage of a SVM is that it uses a weighted sum of kernel functions to separate the feature
vectors in the feature space. The kernel functions reduce the number of dimensions and decouples the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
62
Chapter 3
Supervised classification
computational complexity of the SVM from the feature vector’s dimensionality. Another advantage is
that it is less prone to overfitting. If the hyperplanes are properly designed, the results of the SVM are
similar to a properly designed MLP classifier [165].
A disadvantage in the SVM is that the choice of kernel used in the algorithm is very important.
Several adjustable dimensions of the parameters are encapsulated within the kernel, which only leaves
the penalty parameter available for adjustment. Proper choice of kernel is still an active research field;
using prior knowledge during kernel selection usually improves performance. Further disadvantages
are potentially slow training and substantial memory usage during training. It is observed that the
speed is significantly reduced when training on larger data sets [129].
The last design consideration is the proper setting of the penalty term used to classify non-separable
feature vectors. This penalty term must be optimised either through brute force searching or any other
heuristic search methods.
3.6
DESIGN CONSIDERATION: SUPERVISED CLASSIFICATION
In this section a brief overview is given of some considerations when designing a supervised classifier.
The first consideration is the investigation of the input vector set {~x̃} and the desired output vector
set {~y }. The first question is whether a plausible mapping function exists that can successfully map
the input space to the output space with meaningful descriptors. Should the input vector set {~x̃} be
preprocessed into a feature vector set {~x} and should the output vector set {~y } be postprocessed to
improve overall performance? This analysis provides insight into all further design decisions.
On completing the analysis, the next step is finding a suitable supervised classifier. The choice
of ANN and the corresponding training algorithm is critical in finding acceptable performance in the
mapping. The reason why only acceptable performance is pursued, rather than optimal, is that finding
the best feature vector set and the optimal supervised classifier requires an exhaustive search, which is
not feasible in terms of computational costs.
The adaptation for using a supervised classifier optimally entails the use of a proper training
algorithm. Training algorithms typically focus on monotonically decreasing the value of the error
function. Unfortunately, this type of training algorithm is more prone to becoming trapped in a local
minimum when a small incremental steps are used. If the incremental step size is too large, the training
algorithm will overshoot the minima. The convergence rate of the training algorithm is hindered even
more when the direction of the propagation in the error surface does not point to the minimum. Several
different training algorithms try to find the direction to the minimum since the local gradient does not
always point straight at the minimum.
The training algorithm utilises training patterns in two general methods: iterative and batch
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
63
Chapter 3
Supervised classification
learning. Batch learning is an offline learning method that evaluates all the available training patterns
before adapting the network. Iterative learning can either be online or offline, as it only evaluates
sequentially a single training pattern before adapting the network [166]. An offline system stores all
its patterns in a data set, while an online system processes and discards a pattern.
Another important consideration is that most ANNs are prone to overfit. This can be controlled by
proper implementation of an early stopping criteria. The most common methods of stopping a training
algorithm are:
1. The preset number of epochs is reached.
2. The predetermined computational time has expired in the execution of the training algorithm.
3. The training algorithm is stopped when a predefined lower threshold of the error function is
reached.
4. The training algorithm is stopped when the first derivative of error function falls below a
predefined lower threshold.
5. The error on the validation data set (section 3.4.3) is minimised.
It is commonly believed that a MLP with many hidden neurons has a high generalisation error, as
the network is more prone to overfit [130, Ch. 1 p. 14]. This excess capacity (large number of hidden
neurons) offers the MLP the ability to learn more complex models. If too much training is applied on a
MLP, with excess capacity, it starts to learn the intrinsic noise within the data set. This is an undesirable
property in most applications of a supervised classifier and much emphasis is placed on limiting the
capacity of the network to prevent overfitting (Occam Razor’s principle). It is also commonly believed
that a MLP network with a large number of hidden neurons requires a large number of training vectors
(section 3.4.3) to find a suitable mapping function between the feature and output space [167].
This common knowledge was questioned when a contradiction was shown by Caruana et al. [168].
They showed that a MLP with excess capacity has better generalisation error than a MLP with sufficient
capacity. A MLP can be trained to map highly non-linear regions with a large number of hidden
neurons, but still have the ability to retain a proper mapping of the linear regions [168] with a limited
number of training patterns.
The concept is based on a slowly converging training algorithm that will first train the linear regions
and then progress to the non-linear regions. If a good stopping criterion is adhered to, the training
algorithm will terminate properly before it overfits. Some second-order methods, e.g. conjugate
gradient descent algorithm, do not exhibit this property, as they have fast convergence, and will indeed
overfit if the network has excess capacity.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
64
Chapter 3
Supervised classification
This behaviour is intrinsically built into the slower training algorithms, as the set of weights {~ω }
is usually initialised with small non-zero values and only after many epochs do certain values within
the weights tend to large values. This implies that the MLP first considers simple mapping functions
before exploring more complex functions [168, 169].
Small initial values are used within the weights to ensure that there is no saturation of the sigmoidal
activation function. This initialisation ensures that contours are created on the error surface when
backpropagation is applied in the training phase, otherwise the saturation of the sigmoidal activation
functions will create a very flat error surface.
The last design consideration is the choice of initial weights, which is very important in achieving
good results. A suitable initial choice has the potential of allowing the training algorithm to train
fast and efficiently. Even stochastic algorithms, such as gradient descent, which have the possibility
of escaping from local minima, can be sensitive to the initial weights used. This results in the
rule of thumb to run several training phases with different initial weights in parallel to evaluate the
performance of different minima [130, Ch. 7 p. 260].
The ANN used in this thesis is the MLP with a stochastic gradient descent as used by Caruana et
al. [168]. The gradient descent uses a learning and momentum parameter in the training process to
speed up convergences and a validation data set to apply proper early stopping.
3.7
SUMMARY
This chapter presented a methodology for designing a supervised classifier for real world applications.
Emphasis was placed on the design of a proper mapping function between the input and output space.
The mapping function’s fit was then measured using a suitable error function. The performance of the
classifier improves when a training method is used which adapts the network to minimise the error
function.
This can be seen as a regression approach to determine the relationship between the dependent
and independent variables within the network. The output values produced by the network can be
interpreted as a set of posterior class probabilities under certain assumptions. The chapter concludes
with a range of good practice notes on how to design and develop a good supervised classifier.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
65
C HAPTER
FOUR
U NSUPERVISED CLASSIFICATION
4.1
OVERVIEW
In this chapter a brief overview is given of the notion of grouping objects into different categories
without any supervision. The previous chapter described a supervised approach to grouping objects and
how the relationship between the desired class membership and input vectors was derived using labels.
The possibility is now explored of grouping objects based on their perceived intrinsic similarities.
A formal definition is provided on an unsupervised method known as clustering, followed by the
advantages of exploring an unsupervised approach. The design considerations behind producing good
clustering results are then explored, followed by the challenges inherent when using clustering methods
to solve real world problems.
Clustering algorithms are broadly divided into hierarchical and partitional clustering approaches
[40, 170]. Four popular hierarchical clustering methods and two partitional clustering methods are
discussed with their corresponding properties. The chapter concludes with a discussion on how clusters
can be converted to classes to obtain a supervised classifier.
4.2
CLUSTERING
Clustering is a form of conceptual clustering, which is an unsupervised method used for grouping
unlabelled input vectors into a set of categories. Clustering groups a set of input vectors through
perceived intrinsically similar or dissimilar characteristics.
Let {y k }, y k ∈ N, 1 ≤ y k ≤ K, denotes the set of cluster labels. Let FC : Rn → {y k } denote the
function that maps the input vector ~x̃ p , ~x̃ p ∈ Rn , to a cluster label. The variable p denotes the index of
the vector within the input vector set. The function FC is said to cluster the input vector set {~x̃ p } into
K clusters.
Chapter 4
Unsupervised classification
Several motivations exist to justify the use of clustering algorithms for many non-synthetic data
sets:
1. Significant costs are involved when gathering information about the data set to create reliable
class labels for supervised classification.
2. The underlying data structure of a large unlabelled data set can be captured to provide reliable
clustering on a smaller labelled data set.
3. Accommodate a dynamic input space. This is when the input space changes over time or in
response to a triggered event.
4. Assisting in creating a well-conditioned input vector from the input space to gain insight into
what improves the cluster label allocation.
4.2.1
Mapping of vectors to clusters
A cluster label is derived by evaluating several different input data sources from the input space. These
data sources are grouped together to form an input vector ~x̃. These input vectors are the same as with
the supervised classifier and have descriptive forms that can be interpreted. The preprocessing and
postprocessing of the input and output vectors is an optional procedure used to improve the clustering
algorithm’s performance [136]. Using feature vectors ~x and postprocessed output value y is assumed
to improve the performance significantly and is used throughout this chapter.
The clustering algorithm constructs a function FC to determine the cluster label and is based on the
set of feature vectors {~x p }. The mapping function is expressed as
y k = FC (~x p ).
(4.1)
The clusters typically encapsulate properties of the non-synthetic data set; each cluster should have a
homogeneous set of feature vectors.
4.2.2
Creating meaningful clusters
No theoretical guideline exists on how to extract the optimal feature vector set from the input vector
set for a specific clustering application. Owing to the limited intrinsic information within the feature
vector set, it is difficult to design a clustering algorithm that will find clusters to match the desired
cluster labels.
This constraint is created by a clustering algorithm, as it tends to find clusters in the feature space
irrespective of whether any real clusters exist. This constraint motivates the notion that any two
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
67
Chapter 4
Unsupervised classification
F IGURE 4.1: An aerial photo taken in the Limpopo province, South Africa of two different land cover
which are indicated by a natural vegetation segment and settlement segment. A segment is defined as
a collection of pixels within a predefined size bounding box.
arbitrary patterns can be made to appear equally similar when evaluating a large number of dimensions
of information in the feature space. This will result in defining a meaningless clustering function
FC . This makes clustering a subjective task in nature, which can be modified to fit any particular
application.
The advantage in this versatility is that the clustering algorithm can be used as either an exploratory
or a confirmatory analysis tool [170]. Clustering used as an exploratory analysis tool is there to explore
the underlying structures of the data. No predefined models or hypotheses are needed when exploring
the data set. Clustering used as a confirmatory analysis tool is to confirm any set of hypotheses or
assumptions. In certain applications, clustering is used as both; first to explore the underlying structures
to form new hypotheses. Second, to test these hypotheses by clustering the feature vector set. This
makes clustering a data-driven learning algorithm and any domain knowledge that is available can
improve the forming of clusters [170].
Domain knowledge is used to reduce complexity by aiding in processes such as feature selection
and feature extraction. Proper domain knowledge leads to good feature vector representation that will
yield exceptional performance with the most common clustering algorithms, while incomplete domain
knowledge leads to poor feature vector representation that will only yield acceptable performance with
a complex clustering algorithm.
An aerial photo is used to illustrate the clustering of different land cover types in figure 4.1. In this
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
68
Chapter 4
Unsupervised classification
F IGURE 4.2: A two-dimensional illustration of feature vectors within the feature space. The green
feature vectors represent the natural vegetation class and the red feature vectors represent the human
settlement class.
image two land cover types are of interest: natural vegetation and human settlement.
Land cover example: In the case of the land cover example shown in figure 4.1, domain knowledge is
used for feature extraction and selection. Let it be assumed that the domain knowledge provided
information that the feature vector given in equation (4.2) will provide better separability
between the two categories.
~x = [(Moisture) (Reflectivity)].
(4.2)
The natural vegetation segments have feature vectors with low reflectivity and high moisture
levels, while the human settlement segments have feature vectors with high reflectivity and low
moisture levels. This is illustrated in a two-dimensional plot shown in figure 4.2. When natural
clusters exist in the feature space and the number of clusters is set to K=2, a well-designed
clustering algorithm will produce two perfect clusters, as shown in figure 4.2. 2
Domain knowledge in many fields is incomplete or unavailable. Verifying the domain knowledge
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
69
Chapter 4
Unsupervised classification
from actual (non-synthetic) data sets is extremely resource-expensive and is difficult to relate to the
feature space. The most practical approach for designing an unsupervised learning algorithm is to
learn from example [171]. The learning from example approach requires that the clustering algorithm
be subjected to an external evaluation process. The external evaluation is hampered by the fact that
thousands of different clustering algorithms have been developed and evidence suggests that none of
them is superior to any other [172]. This is addressed in the impossibility theorem, which states three
criteria which no clustering algorithm can satisfy [172]. The three criteria to satisfy in the impossibility
theorem are:
1. Scale invariance; the scaling of the feature vectors should not change the assigned cluster labels.
2. Richness; the clustering algorithm must be able to achieve all possible partitions in the feature
space.
3. Consistency; the change in distance within all clusters will not change the assigned cluster labels.
Based on the impossibility theorem, each clustering application is different and requires an unique
design to obtain good clustering results. This emphasises the importance of obtaining acceptable
performance in the search for a clustering algorithm, as it is infeasible to search through all the
permutations of clustering designs.
The admissibility criterion is a more practical approach to
consider when applying external evaluation to a clustering algorithm [170]. The admissibility criterion
comprises three important design considerations:
1. The manner in which the clusters are formed.
2. The intrinsic structure of the feature vectors.
3. The sensitivity of the clusters created.
4.2.3
Challenges of clustering
Humans cluster with ease in two and three dimensions, while a machine learning method is required to
cluster in higher dimensions. Several design implications arise when clustering in higher dimensions
[171]:
- Determining the number of clusters K (section 4.6).
- Determining whether the feature vectors carry representative information to produce clusters that
will hold a relation to the desired classes for the application (section 4.2.2).
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
70
Chapter 4
Unsupervised classification
- Deciding which pairwise similarity metric should be used to evaluate the feature space
(section 4.3).
- Determining how the feature vectors should be evaluated to form clusters. Clustering algorithms
are broadly divided into hierarchical and partitional clustering approaches [40, 170]. The first
approach is hierarchical clustering, which produces a nested hierarchy of clusters of discrete
groups (section 4.4). The second approach is partitional clustering, which creates an unnested
partitioning of the data points with K clusters [173] (section 4.5).
4.3
SIMILARITY METRIC
A clustering algorithm defines clusters with feature vectors that are similar to one another, and separate
them from feature vectors that are dissimilar. This similarity between feature vectors is usually
measured using a distance function.
Let {~x}, ~x ∈ RN denote a set of N -dimensional feature vectors. Let D : RN → R+ denote the
distance function that calculates the distance between the vector ~x p and ~x q . The function D is said to
return the distance (similarity metric) between the two feature vectors.
The properties of the distance function D are:
- Non-negative, D(~x p , ~x q ) ≥ 0.
- Identity axiom, D(~x p , ~x q ) = 0, iff p = q.
- Triangle inequality, D(~x o , ~x p ) + D(~x p , ~x q ) ≥ D(~x o , ~x q ).
- Symmetry axiom, D(~x p , ~x q ) = D(~x q , ~x p ).
The non-negative and identity axioms produce a positive definite function. The distance metric is
as important in the design as the clustering algorithm itself. Proper selection of a distance metric
will result in the distance between feature vectors of the same cluster being smaller than the distance
between the feature vectors of other clusters.
Choosing a distance function opens a broad class of distance metrics. The first to consider is the
general Minkowski distance, which is used to derive some of the most common distance functions used
in clustering applications. The Minkowski distance Dmink is expressed as
Dmink (~x p , ~x q ) =
N
X
|xnp − xnq |m
n=1
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
! m1
.
(4.3)
71
Chapter 4
Unsupervised classification
The variable m, m ∈ N, is the Minkowski parameter that is used to adjust the nature of the distance
metric. The Minkowski distance simplifies to the popular Euclidean distance Ded if the Minkowski
parameter m is set to 2 in equation (4.3). The Euclidean distance is computed as
v
u N
uX p
Ded (~x p , ~x q ) = t
|xn − xnq |2 .
(4.4)
n=1
The advantage of the Euclidean distance is that it is invariant to translation or rotation of the feature
vector ~x. The Euclidean distance however does vary under an arbitrary linear transformation.
The squared Euclidean distance is an alteration to the Euclidean distance, as it places a greater
weight on a set of vectors that are considered to be outliers in the vector space. The squared Euclidean
distance is expressed as
p
q
Dsq (~x , ~x ) =
N
X
|xnp − xnq |2 .
(4.5)
n=1
If the Minkowski parameter is set to m=1, equation (4.3) simplifies to the Manhattan distance. The
Manhattan distance is the sum of the absolute difference between vectors. The Manhattan distance is
expressed as
p
q
Dman (~x , ~x ) =
N
X
|xnp − xnq |.
(4.6)
n=1
The Mahalanobis distance metric is used in statistics to measure the correlations between
multivariante vectors. The Mahalanobis distance metric Dmahal is expressed as
q
x p − ~x q ),
Dmahal (~x , ~x ) = (~x p − ~x q )G−1
mahal (~
p
q
(4.7)
where Gmahal denotes the covariance matrix.
4.4
HIERARCHICAL CLUSTERING ALGORITHMS
A clustering algorithm uses a set of feature vectors {~x p }, cluster parameters and a similarity metric
q
to construct a mapping function FC . Let ϑ = (∪Q
q=1 ϑ ) denote the set of cluster parameters that the
clustering algorithm needs to determine when constructing FC .
As stated previously, clustering algorithms are broadly divided into either a hierarchical or
partitional clustering approach [40, 170]. The hierarchical clustering approach produces a nested
hierarchy of clusters of discrete groups according to a certain linkage criterion. The nested clusters are
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
72
Chapter 4
Unsupervised classification
recursively linked in either an agglomerative mode or divisive mode. The second approach to clustering
is partitional clustering, which creates an unnested partitioning of the vectors into K clusters [173]. In
hierarchical clustering using an agglomerative mode, the clustering parameter set {ϑ} is determined
iteratively in four steps:
Step 1: The clustering algorithm starts by allocating each feature vector to its own cluster.
The
initialisation phase is defined as
ϑpI = ~x p ,
∀p and I = 0.
(4.8)
The variable ϑpI denotes the pth set of cluster parameters at epoch I, with I set to zero for the
initialisation phase. The vector ~x p denotes the pth feature vector.
Step 2: The similarity between two clusters is defined by a linkage criterion. The linkage criterion
evaluates two clusters using a similarity metric (section 4.3) to compute the dendrogrammatic
distance T (ϑlI , ϑkI ). The dendrogrammatic distance is computed as
T (ϑlI , ϑkI ) = β(ϑlI , ϑkI ),
(4.9)
where the linkage criterion is denoted by the function β, β ∈ {Tsing , Tcom , Tave , Tward }.
This expression states that all the feature vectors in cluster y l must be compared to all the feature
vectors in cluster y k using a predefined argument. The linkage criterion’s function β returns a
dendrogrammatic distance between the two clusters.
∗
Step 3: Select the shortest dendrogrammatic distance T (ϑlI , ϑkI ) between all pairs of clusters. Let ϑlI and
∗
ϑkI be selected such that
∗
∗
[ϑlI , ϑkI ] =
argmin T (ϑlI , ϑkI ).
(4.10)
l,k ∈ [1,K];l6=k
Step 4: Merge the two clusters with index l∗ and k ∗ as
∗
ϑl(I+1)
=
∗
ϑlI
∪
∗
ϑkI
,
∗
ϑk(I+1) = ∅.
(4.11)
(4.12)
Steps 2–4 are repeated until all the clusters are merged into a single cluster. The sequence of
merging clusters can be graphically presented by a tree diagram, called a dendrogram. The dendrogram
is a multi-level hierarchy with two clusters merging at each level.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
73
Chapter 4
Unsupervised classification
F IGURE 4.3: An alternative selection of five new segments of the aerial photo taken in the Limpopo
province which indicates different types of land cover types.
Land cover example: Five new segments are defined in figure 4.3.
A hierarchical clustering
algorithm operating in agglomerative mode creates a dendrogram shown in figure 4.4 when
applied to the five segments. In the first iteration the similarity between segment 4 and segment
5 is the highest (shortest dendrogrammatic distance). These segments are merged to form a
new cluster. The dendrogrammatic distances between the merging clusters are indicated on the
vertical axis. The shorter the distance on the vertical axis, the more similar the two joining
clusters. In the second iteration, segment 1 and segment 3 are joined as being the next most
similar clusters. These two newly formed clusters are joined together, as they are more similar
to each other than to segment 2. Segment 2 is joined to form a single cluster containing all
segments, which completes the dendrogram.
In the divisive mode, the clustering algorithm starts by placing the entire feature vector set in a
single cluster. In this mode, a comparison is made between all the feature vectors within the cluster
to determine which feature vectors are the most dissimilar and split the cluster into two separate
clusters. This process is repeated until every single cluster retains a single feature vector. The sequence
of separating the clusters is also represented on a dendrogram. Only the agglomerative mode was
considered, as it is a bottom-up approach and the concept could easily be derived for a divisive mode
with the same methodology in a top-down approach.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
74
Chapter 4
Unsupervised classification
Dendrogrammatic distance
2.5
2
1.5
1
4
5
1
Cluster index
3
2
F IGURE 4.4: An illustration of an hierarchical clustering approach operating in agglomerative mode.
4.4.1
4.4.1.1
Linkage criteria
Single linkage criterion
The merging of clusters is based on the dendrogrammatic distance between clusters.
The
dendrogrammatic distance is computed using a linkage criterion. The single linkage criterion is the
first linkage criterion that is considered, as it searches for the shortest distance between two feature
vectors; each residing in two different clusters. The single linkage criterion Tsing (ϑlI , ϑkI ) is expressed
as
Tsing (ϑlI , ϑkI ) = min{D(~x p , ~x q )} ∀~x p ∈ ϑlI , ~x q ∈ ϑkI and l 6= k.
(4.13)
The variable ~x p denotes the pth feature vector and ~x q denotes the q th feature vector. The similarity
metrics shown in section 4.3 (equation (4.3)–(4.7)) or any other distance metric found in the literature
can be used as the distance metric D(~x p , ~x q ). The single linkage criterion has a chaining effect as a
characteristic trait when forming clusters. This results in clusters that are straggly and elongated in
shape [174]. The advantage of elongated clusters is that they can extract spherical clusters from the
feature space.
4.4.1.2
Complete linkage criterion
The complete linkage criterion computes a dendrogrammatic distance by finding the maximum
possible distance between two feature vectors that reside in different clusters. The complete linkage
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
75
Chapter 4
Unsupervised classification
criterion Tcom (ϑlI , ϑkI ) is expressed as
Tcom (ϑlI , ϑkI ) = max{D(~x p , ~x q )} ∀~x p ∈ ϑlI , ~x q ∈ ϑkI and l 6= k.
(4.14)
The variable ~x p denotes the pth feature vector and ~x q denotes the q th feature vector. The complete
linkage criterion has the characteristic trait of forming tightly bounded compact clusters. The complete
linkage criterion creates more useful clusters in many actual (non-synthetic) data sets than the single
linkage criterion [170, 175].
4.4.1.3
Average linkage criterion
The average linkage criterion is the most intuitive linkage criterion, as it calculates a dendrogrammatic
distance between two clusters by finding the average distance among all pairs of feature vectors residing
in different clusters. The average linkage criterion Tave (ϑlI , ϑkI ) is expressed as
Tave (ϑlI , ϑkI ) =
X X
1
D(~x p , ~x q ),
|ϑlI ||ϑkI | p l q k
l 6= k.
(4.15)
~
x ∈ϑI ~
x ∈ϑI
|ϑlI | denotes the number of feature vectors in cluster ϑlI and |ϑkI | denotes the number of feature vectors
in cluster ϑkI . The average linkage criterion is a compromise between the complete linkage criterion’s
sensitivity to outliers and the chaining effect produced by the single linkage criterion.
4.4.1.4
Ward criterion
The Ward criterion computes a dendrogrammatic distance between clusters by finding the clusters that
will maximise the coefficient of determination R2 [176]. The Ward criterion Tward (ϑlI , ϑkI ) is expressed
as
Tward (ϑlI , ϑkI ) =
X
p∈ ϑlI ∪ϑkI
2
p
2 X p
~x − E ϑlI −
~x − E ϑlI ∪ ϑkI −
X
p
2
~x − E ϑkI .
p∈ϑlI
(4.16)
p∈ϑkI
The expected value of the feature vectors in the cluster is denoted by E[~x p ]. The Ward criterion
attempts to minimise the variance between the K clusters and only uses the Euclidean distance. Most
linkage criteria in the literature are variants of the single linkage, complete linkage, average linkage or
Ward criterion.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
76
Chapter 4
4.4.2
Unsupervised classification
Cophenetic correlation coefficient
A dendrogram is created iteratively as the function FC is derived with a hierarchical clustering
algorithm. The dendrogram illustrates the dendrogrammatic distances obtained with the linkage
criterion (section 4.4.1). The cophenetic correlation coefficient is a statistical measure of correlation
between the dendrogrammatic distances and the similarity distances for all pairs of feature vectors
[177]. The cophenetic correlation coefficient is computed as
PP
q=2
Dcc = qP
P
q=2
Pq
x
p=1 (D(~
p
, ~x q ) − E[D(~x p , ~x q )])(T (ϑl0 , ϑk0 ) − E[T (ϑl0 , ϑk0 )])
Pq
,
(4.17)
x p , ~x q ) − E[D(~x p , ~x q )])2 (T (ϑl0 , ϑk0 ) − E[T (ϑl0 , ϑk0 )])2
p=1 (D(~
with ~x p ∈ ϑl0 and ~x q ∈ ϑk0 . The function D(~x p , ~x q ) denotes the distance between the feature vector
~x p and ~x q as shown in section 4.3. The T (ϑl0 , ϑk0 ), ~x p ∈ ϑl0 , ~x q ∈ ϑk0 , denotes the dendrogrammatic
distance between the feature vector ~x p and ~x q as shown in equation (4.9). The higher the correlation,
the better the dendrogram preserves the information of the feature space when using a particular
linkage criterion. The cophenetic correlation coefficient is used to evaluate several different distance
metrics and linkage criteria that will best retain the original distances of the feature space in the
dendrogram [177].
4.5
PARTITIONAL CLUSTERING ALGORITHMS
A partitional clustering algorithm operates on the actual feature vectors, which significantly reduces
the required space and computations to operate, which makes it more suitable for larger data sets when
compared to hierarchical clustering [173].
Let {y k }, k ∈ N, 1 ≤ k ≤ K denote the set of cluster labels. Let FC : RN → {y k } denote the
function that maps feature vectors {~x}, {~x} ∈ RN , onto the clusters. Then FC is said to cluster ~x into
K clusters.
In a general case of partitional clustering, a set of clustering parameters is determined when
constructing the mapping function FC . Let {ϑkI }, {ϑkI } ∈ Ωϑ , denote the set of clustering parameters.
The variable k, 1 ≤ k ≤ K, denotes the index in the set {ϑkI } which refers to the cluster label y k .
The variable I denotes the current epoch. The partitional clustering algorithm uses a distance metric
D(~x p , ϑkI ) to measure the distance between the pth feature vector ~x p and cluster y k . The feature vector
~x p is then mapped onto {y k } using the function FC , such that
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
77
Chapter 4
Unsupervised classification
p
k
FC (~x ) = argmin D(~x , ϑI ) .
p
(4.18)
y k ∈{y k }
Intuitively, the function FC maps a vector ~x p to the nearest cluster.
The function FC is constructed by determining the set of cluster parameters {ϑkI } to minimise the
overall distance between a given set of feature vectors {~x} and the K corresponding clusters. One
possible definition of this process is
k∗
ϑI
= argmin
ϑkI ∈Ωϑ
(
P
X
p=1
D
F (~
x p)
~x p , ϑI C
)
.
(4.19)
The clustering algorithm simultaneously determines the parameters ϑkI of each cluster, as well as the
cluster assignment of each feature vector ~x p .
4.5.1
K-means algorithm
The first partitional clustering algorithm explored is the popular K-means algorithm [178]. The
K-means algorithm attempts to find the center points of the natural clusters. The K-means clustering
algorithm accomplishes this by partitioning the feature vectors into K mutually exclusive clusters.
K-means is a heuristic, hill-climbing algorithm that attempts to converge to the center mass point
of the natural clusters. It can be viewed as a gradient descent approach which attempts to minimise the
sum of squared error of each feature vector to the nearest cluster centroid [179]. The clusters created
with the K-means algorithm are compact and isolated in nature.
Minimising the SSE has been shown to be a NP-hard problem, even for a two-cluster problem [180].
This gives rise to a variety of heuristic approaches to solving the problem for practical applications.
The most common method of implementing the K-means algorithm is the Lloyd’s approach. The
Lloyd’s approach is an iterative method which comprises three steps:
Step 1: Initialise a set of K centroids {ϑkI }.
Step 2: Assign each feature vector to its closest centroid. This is accomplished by creating K empty sets
~s k = ∅, k = 1, 2, . . . , K, for each of the corresponding centroids {ϑkI }. The assignment step is
expressed as
p
p
k
p
l
~s = {~x } : D(~x , ϑI ) < D(~x , ϑI ), ∀l =
6 k .
k
(4.20)
The vector ~x p denotes the pth feature vector and D denotes the distance function.
Step 3: The update step adjusts the centroids’ position to minimise the sum of distance given in
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
78
Chapter 4
Unsupervised classification
equation (4.19). The adjustment is made for each centroid as
ϑk(I+1) =
1 X p
~x ,
|~s k | p k
∀ k.
(4.21)
~
x ∈~s
|~s k | denotes the number of elements in the set.
Steps 2–3 are repeated until all the feature vectors within each cluster remain unchanged or a predefined
stopping criterion is reached.
The performance of the K-means algorithm is dependent on the density distribution of the feature
vectors in the feature space. K-means will minimise the SSE with high probability to the global
minimum if the feature vectors are well separated [181]. The ability of the K-means algorithm to
handle a large number of feature vectors enables the parallel execution of multiple replications with
different initial seeds to avoid local minima. The K-means clustering algorithm is usually used as a
benchmark against other algorithms, and has been used successfully in many other fields [171].
4.5.2
Expectation-maximisation algorithm
The Expectation-Maximisation (EM) algorithm is another partitional clustering algorithm, which
attempts to fit a mixture of probability distributions on the set of feature vectors [182]. The EM
algorithm was designed on the assumption that the feature vectors are extracted from a feature space
with a multi-modal distribution.
Given a set of observable vectors {~x} and unknown variables {y k }, the EM algorithm finds the
maximum likelihood or maximum aposterior estimates for the parameters ω
~, ω
~ ∈ Ω. The maximum
likelihood estimation of the parameters ω
~ ML is expressed as
ω
~ ML = argmax
ω
~ ∈Ω
log p(~x|~ω )
= argmax J (~ω ) .
ω
~ ∈Ω
(4.22)
The log-likelihood of the conditional probability in equation (4.22) is expanded to incorporate the
unknown variables y k as
J (~ω ) = log p(~x|~ω ) = log
X
p(~x, y k |~ω ) = log
k
X
k
q(y k |~x, ω
~)
p(~x, y k |~ω )
.
q(y k |~x, ω
~)
(4.23)
The function q(y k |~x, ω
~ ) is an arbitrary density over y k . Considering the following lower bound
inequality to equation (4.23) as
log
X
k
q(y k |~x, ω
~)
p(~x, y k |~ω ) X
p(~x, y k |~ω )
k
≥
q(y
|~
x
,
ω
~
)
log
,
k |~
q(y k |~x, ω
q(y
~)
x
,
ω
~
)
k
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(4.24)
79
Chapter 4
Unsupervised classification
which for convenience is rewritten as
J (~ω ) ≥
X
q(y k |~x, ω
~ ) log
k
p(~x, y k |~ω )
.
q(y k |~x, ω
~)
(4.25)
It is easier if the EM algorithm instead attempts to maximise the lower bound shown in equation (4.25).
The EM algorithm iteratively adjusts the parameters of the distributions in two steps. The first step
is the expectation step (E-step) which calculates the log likelihood function, with respect to the
conditional distribution of y k given ~x with the current estimate of the parameter ω
~ as
k
q(y |~x, ω
~)
new
= argmax
q(y k |~
x,~
ω)
X
k
p(~x, y k |~ω )
q(y |~x, ω
~ ) log
.
q(y k |~x, ω
~)
k
(4.26)
Calculating the E-step requires the vector ω
~ to be fixed while attempting to optimise over the space of
distributions. The second step is the maximisation step (M-step), which tries to maximise the vector ω
~
using the result from equation (4.26). The M-step is computed as
ω
~
new
= argmax
ω
~
X
k
q(y |~x, ω
~)
new
k
p(~x, y k |~ω )
.
log
q(y k |~x, ω
~ )new
(4.27)
The EM algorithm iterates through both steps until it converges to a local maximum. The feature vector
is assigned to a cluster that maximises the aposterior probabilities of a given distribution.
The disadvantage of the EM algorithm is that even though the probability of the feature vectors
does not decrease, it does not guarantee that the algorithm will converge to the global maximum for a
multi-modal distribution. This implies that the EM algorithm can converge to a local maximum. This
can be avoided with multiple replications of the algorithm executed with different initial seeds. The
EM algorithm is well suited to operate on data sets that contain missing vectors and data sets with low
feature space dimensionality.
4.6
DETERMINING THE NUMBER OF CLUSTERS
The most difficult design consideration is to determine the correct number of clusters that should be
extracted from the data set. Hundreds of methods have been developed to determine the number of
clusters within a data set. The choice in determining the number of clusters K is always ambiguous
and is a distinct issue from the process of actually solving the unsupervised clustering problem.
The problem if the number of clusters K is increased without penalty in the design phase (which
defeats the purpose of clustering), is that the number of incorrect cluster assignments will steadily
decrease to zero. In the extreme case; each feature vector is assigned to its own cluster, which results
in zero incorrect clustering allocations. Intuitively this makes the choice in the number of clusters a
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
80
Chapter 4
Unsupervised classification
balance between the maximum compression of the feature vectors into a single cluster and complete
accuracy by assigning each feature vector to it own cluster.
The silhouette value is used as a measure of how close each feature vector is to its own cluster
when compared to feature vectors in neighbouring clusters [183]. The silhouette value S(~x p , K) for
the feature vector ~x p is computed as
S(~x p , K) =
min{SBD (~x p , l) − SWD (~x p )}
,
max{SWD (~x p ), min{SBD (~x p , k)}}
∀k, l.
(4.28)
The function SWD (~x p ) denotes the average distance for the feature vector ~x p to the other feature vectors
in the same cluster. The cluster index is denoted by k, k ∈ N, 1 ≤ k ≤ K, and SBD (~x p , k) denotes the
average distance for the feature vector ~x p to the feature vectors in the k th cluster. The average distance
within the same cluster SWD (~x p ) for the feature vector ~x p is computed as
p
SWD (~x ) =
x p)|
|ϑFX
C (~
q=1
D(~x p , ~x q )
q
FC (~
x p )\~
xp
: ∀~x ∈ ϑ
.
|ϑFC (~x p ) | − 1
(4.29)
p
The variable |ϑFC (~x ) | denotes the number of feature vectors in the cluster where ~x p reside. The average
distance between the feature vector ~x p and the k th cluster is computed as
p
SBD (~x , k) =
x q )|
|ϑFX
C (~
q=1
D(~x p , ~x q )
q
FC (~
x q)
q
FC (~
x p)
q
k
: ∀~x ∈ ϑ
, ~x 6∈ ϑ
, FC (~x ) = y .
|ϑFC (~x q ) |
(4.30)
q
The variable |ϑFC (~x ) | denotes the number of feature vectors within the k th cluster.
The silhouette value S(~x p , K) ranges from -1 to 1. A silhouette value S(~x p , K) → 1 indicates that
the feature vector ~x p is very distant from the neighbouring K clusters. A silhouette value S(~x p , K) →
0 indicates the feature vector ~x p is close to the decision boundary between two clusters. A silhouette
value S(~x p , K) → −1 indicates that the feature vector ~x p is probably in the wrong cluster.
A silhouette graph is a visual representation of the silhouette values and is a visual aid used to
determine the number of clusters. The x-axis denotes the silhouette values and the y-axis denotes the
cluster labels. The silhouette graph shown in figure 4.5 was created from a larger set of segments
defined in the example of land cover classification (figure 4.3). In this silhouette graph; cluster 3 has
high silhouette values present, which implies that the current feature vectors within cluster 3 are well
separated from the other two clusters. Cluster 1 also has high silhouette values, but with a few feature
vectors considered to be ill-positioned. Cluster 2 has significantly lower silhouette values and most
of its feature vectors are closely positioned at the boundary between clusters. This might suggest that
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
81
Chapter 4
Unsupervised classification
Cluster index
1
2
3
−0.2
0
0.2
0.4
Silhouette value
0.6
0.8
1
F IGURE 4.5: A silhouette plot of 3 clusters formed of example given in figure 4.3.
cluster 2 can be subdivided into two separate clusters.
An analytical method of deciding on the correct number of clusters K, is the computation of the
average of the silhouette value. The average silhouette value is calculated as
Save ({~x}, K) =
P
max
X
S(~x p , K),
(4.31)
p=1
where Pmax denotes the total number of feature vectors in set {~x}. A range of K can be evaluated
without any prior knowledge to determine the performance of the clustering algorithm. The number of
clusters K that produces the highest average silhouette value is then selected.
4.7
CLASSIFICATION OF CLUSTER LABELS
Clusters typically encapsulate properties of the feature vector set and this homogeneous property
motivates the assignment of class labels to the clusters. The class labels are assigned using a supervised
classifier, which assigns a set of class labels {Ck } to the K cluster labels [171].
The supervised classifier assigns a class label to a cluster with the most frequently occurring
class label from the labelled training data set. Assigning the class labels to the cluster labels with
a supervised classifier is expressed as
Ck = Z(y k ).
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(4.32)
82
Chapter 4
Unsupervised classification
Owing to the fact that there is no one cluster represents one class property, feature vectors of a certain
class might end up in the incorrect cluster and therefore be assigned the wrong class label.
Land cover example: The clustering algorithm uses a function FC to assign a cluster label to each of
the two segments in figure 4.1. The supervised classifier is then used to assign a class label to
each of the clusters. In this example the number of clusters K is set to two and the supervised
classifier will assign either the natural vegetation class or the human settlement class to the
cluster label. This is accomplished by mapping the cluster label y k , as

 C (natural vegetation) if y k = 1
1
Ck =
 C2 (human settlement) if y k = 2.
(4.33)
The cluster label y k is classified as natural vegetation when the label is in the first cluster and
human settlement when the label is in the second cluster. 2
4.8
SUMMARY
In this chapter a methodology was presented to aid in the design process of an unsupervised classifier.
The way in which a clustering method tends to find clusters in the feature space irrespective of whether
any real clusters exist was discussed. This shows that proper design criteria must be adhered to and the
most practical approach to designing a clustering method is to learn from example [171].
The design of the clustering method requires the simultaneous optimisation of the:
• feature extraction and feature selection,
• clustering algorithm, and
• similarity metric.
Six popular clustering algorithms were explored. These algorithms are based on basic concepts, which
explore the properties of the feature vectors. Thousands of clustering algorithms have been developed
in the last couple of decades and most of them only use different permutations and combinations of the
concepts defined in these six clustering algorithms. These basic concepts will provide insight into the
intrinsic properties of the feature vectors that populate a high-dimensional feature space.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
83
C HAPTER
FIVE
F EATURE EXTRACTION
5.1
OVERVIEW
In this chapter, four different feature extraction methods that could be used on time series are
investigated. The chapter starts with a discussion on how a series of images are used to create a
time series of reflectance values for a particular geographical area. From there the feature extraction
methods are discussed, which are:
• EKF,
• least squares model fitting,
• M-estimator model fitting, and
• Fourier transform.
The EKF is a regression approach which uses a process model and an internal state space. The
least squares and M-estimator methods are regression approaches that aim to minimise the fitting error
(residuals) of a predefined model on a time series. The Fourier transform is a frequency analysis
approach, which decomposes time series into several harmonic frequencies.
5.2
TIME SERIES REPRESENTATION
A time series is a sequence of data points measured at successive (often uniformly spaced) time
intervals. A time series x of length I, is defined as
x = [~x1 ~x2 . . . ~xI ],
(5.1)
Chapter 5
Feature extraction
F IGURE 5.1: Multiple aerial photos are acquired in the Limpopo province at different time intervals
of the same geographical area. Natural vegetation and human settlement segments are mapped out to
form a set of time series.
with
~xi = [xi,1 xi,2 . . . xi,T ].
(5.2)
The variable T denotes the number of elements in vector ~xi .
The analysis of time series comprises methods that attempt to understand the underlying structure
of the data gathered. Analysing the structure allows the identification of patterns and trends, detection
of change, clustering, modelling and forecasting [40]. A time series which is extracted from multiple
images is used in this chapter to illustrate various concepts.
Land cover example: In figure 5.1, multiple aerial photos are acquired of the same geographical area
with segments mapped out over a duration of time. These segments illustrate an example of
two different land cover types which do not change over time. The two land cover types are:
natural vegetation and human settlement. These hyper-temporal segments are processed to
provide a single reflectance value for a given geographical segment at each time interval. A
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
85
Chapter 5
Feature extraction
1200
Original time series
1100
Reflectance value
1000
900
800
700
600
500
400
Jan 02
Feb 02
Mar 02
Apr 02
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
F IGURE 5.2: Time series consisting of reflectance values reported through time for a single image
segment shown in figure 5.1.
single reflectance value is obtained from a linear mixture of all the intensities within a segment.
The reflectance values for a segment creates a time series shown in figure 5.2. It is observed that
the reflectance values in the time series undergo seasonal changes through the course of the year.
2
5.3
STATE-SPACE REPRESENTATION
Numerous real world systems are approximated with an underlying process description. This process
determines the output of a system which is driven by an internal state. The behaviour at time
i of such a system can be predicted based on the information observed from the system at time
(i − 1). This description of a system’s internal operation is known as a state-space model. It was
originally developed by control engineers [184, Ch. 3 p. 41]. A state-space model is a mathematical
representation frequently used to model a system with a set of state-space variables. The state-space
model uses a set of state-space variables to predict the next output of the system.
The state-space variables in most applications are a function of time; as such the use of a time
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
86
Chapter 5
Feature extraction
domain representation is a convenient method for analysing the state-space model of a system [184,
Ch. 3 p. 41]. The current state is thus represented by a first order differential function in the time
domain. The assumption thus far has been that the process function used within the state-space model
and the set of state-space variables are known and that all the system’s internal operations have been
incorporated. This is usually not the case, as both should be estimated. This results in an erroneous
prediction of the output, which leads to assessing the accuracy of the system.
Assessing the accuracy of the state-space model requires the comparison of the actual system’s
output to the predicted output. The output is usually observed with the addition of noise [185, Ch. 1].
The noise is contributed by several factors, which include:
1. the limited description of the process function,
2. the state-space variables that are not estimated perfectly, and
3. any unknown internal or external source of noise.
This leads to two models required to express the dynamic model: the process model and observation
model. The process model is used to describe the adaptation of the state-space variables from time
~ i as
(i − 1) to time i. The state-space variables are encapsulated at time i in a state-space vector W
~ i = [Wi,1 Wi,2 . . . Wi,S ],
W
(5.3)
where S denotes the number of elements in the state-space vector. The adaptation of the state-space
~ i for time i is predicted using the
vector is known as the prediction step. The state-space vector W
transition equation, which is given as
~ i = f (W
~ i−1 ) + ~zi−1 .
W
(5.4)
~ i and W
~ i−1 is described by a known transition function f . A process noise
The relation between W
vector ~zi−1 is added owing to the incomplete description ability inherent in function f and/or any
~ i−1 . The noise vector ~zi−1 is assumed to be a
previous incorrect estimates of the state-space vector W
stochastic vector with a zero-mean and covariance matrix Qi−1 .
~ i and the
The observation model is used to describe the relation between the state-space vector W
actual output of the system at time i. The actual output at time i is termed the observation vector ~xi
and is used in the updating step. The updating step uses a measurement equation which is given as
~ i ) + ~vi .
~xi = h(W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(5.5)
87
Chapter 5
Feature extraction
~ i is related to the observation vector ~xi by means of the known measurement
The state-space vector W
~ i might not be perfectly estimated.
function h. The measurement function h and state-space vector W
This is compensated for by including an observation noise vector ~vi , where the noise vector ~vi
is a stochastic vector with zero mean and covariance matrix Ri . Equations (5.4) and (5.5) are
known as the state-space form of a linear dynamic model. The time domain approach to state-space
model representation provides an iterative model that recursively processes each observation vector
sequentially.
It is assumed that both the noise vectors ~zi−1 , ~zi−1 ∼ Nu (0, Qi−1 ), and ~vi , ~vi ∼ Nu (0, Ri ), are
uncorrelated and distributed by a known distribution Nu for all time increments. This property is
expressed as


~zi−1
~vi


 = Nu 
0
0
 
,
Qi−1
0
0
Ri

 ,
∀i.
(5.6)
~ 0 , which
It is also assumed that the noise vectors are uncorrelated with the initial state-space vector W
is expressed as
~ 0~zi−1 ] = E[W
~ 0~vi ] = 0,
E[W
∀i.
(5.7)
The recursive nature of a linear dynamic model requires that a state-space vector must be adapted
at each time increment i using the newest observation vector ~xi . This requires the derivation of a
posterior probability density function of the state-space vector, given that all previous observation
vectors are available [185, Ch. 1]. This is accomplished by obtaining the initial state-space vector
~ i ), after which the posterior probability density function p(W
~ i |~xi , ~xi−1 , . . . ~x0 ) is recursively
P (W
estimated using the predict (equation (5.4)) and update (equation (5.5)) steps. The posterior probability
~ i |~xi−1 , ~xi−2 , . . . ~x0 ) is obtained using the Chapman-Kolmogoroff equation given as
p(W
~ i |~xi−1 , ~xi−2 , . . . ~x0 ) =
p(W
Z
~ i |W
~ i−1 )p(W
~ i−1 |~xi−1 , ~xi−2 , . . . ~x0 )dW
~ i−1 .
p(W
(5.8)
~ i |W
~ i−1 ) is estimated using the transition equation
The conditional probability density function p(W
shown in equation (5.4) and known covariance matrix Qi−1 . In this prediction step the transition
equation expands the current state-space probability density function. The measurement equation then
uses the newest observation vector ~xi to tighten the state-space probability density function [185,
Ch. 1]. The state-space probability density function is updated using the observation vector ~xi via
Bayes’ rule as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
88
Chapter 5
Feature extraction
~ i |~xi , ~xi−1 , . . . ~x0 ) =
p(W
~ i )p(W
~ i |~xi−1 , ~xi−2 , . . . ~x0 )
p(~xi |W
,
p(~xi |~xi−1 , ~xi−2 , . . . ~x0 )
(5.9)
which is expanded to
~ i |~xi , ~xi−1 , . . . ~x0 ) = R
p(W
~ i )p(W
~ i |~xi−1 , ~xi−2 , . . . ~x0 )
p(~xi |W
.
~ i )p(W
~ i |~xi−1 , ~xi−2 , . . . ~x0 )dW
~i
p(~xi |W
(5.10)
~ i ) is calculated using equation (5.5) and known
The conditional probability density function p(~xi |W
covariance matrix Ri . The accuracy of the state-space vector can be measured if knowledge of the
~ i |~xi , ~xi−1 , . . . ~x0 ) is available [185, Ch. 1].
posterior probability density function p(W
5.4
KALMAN FILTER
The Kalman filter was originally developed by Rudolf Kalman in 1960 and was published in two
journals [186, 187]. The Kalman filter was designed to recursively solve the state-space form of the
linear dynamic model given in equations (5.4) and (5.5). The Kalman filter assumes that the transition
function f is a known linear matrix F and the process noise vector ~zi−1 , ~zi−1 ∼ N (0, Qi−1 ), is normally
distributed. This simplifies the transition equation given in equation (5.4) to
~ i = FW
~ i−1 + ~zi−1 .
W
(5.11)
The Kalman filter also assumes that the measurement function h is a known linear matrix H and the
observation noise vector ~vi , ~vi ∼ N (0, Ri ), is normally distributed. This simplifies the measurement
equation given in equation (5.5) to
~ i + ~vi .
~xi = HW
(5.12)
~ i |~xi−1 , . . . , ~x0 ), p(W
~ i−1 |~xi−1 , . . . , ~x0 ) and p(W
~ i |~xi , . . . , ~x0 ) in equation
The distributions p(W
(5.8) and equation (5.10) are assumed to be normally distributed.
~ i |~xi−1 , . . . ~x0 ) is thus expressed as
p(W
~ i |~xi−1 , . . . ~x0 ) =
p(W
with
The posterior probability
q
|2πP(i|i−1) | exp(A1 ),
1 ~
~
~
T −1
~
A1 = − ( W
i − Ŵ(i|i−1) ) P(i|i−1) (Wi − Ŵ(i|i−1) ).
2
(5.13)
(5.14)
The matrix P(i|i−1) denotes the covariance matrix at time i, given all the previous covariance matrices
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
89
Chapter 5
Feature extraction
~
~
up to and including time (i − 1). The vector Ŵ(i|i−1) denotes the estimate of the state-space vector W
at time i, given all estimates of state-space vectors up to and including time (i − 1). The other posterior
probability given in equation (5.8) is expressed as
~ i−1 |~xi−1 , . . . ~x0 ) =
p(W
with
q
|2πP(i−1|i−1) | exp(A2 ),
1 ~
~
~
T −1
~
A2 = − ( W
i−1 − Ŵ(i−1|i−1) ) P(i−1|i−1) (Wi−1 − Ŵ(i−1|i−1) ).
2
(5.15)
(5.16)
The matrix P(i−1|i−1) denotes the covariance matrix at time (i − 1), given all the previous covariance
~
matrices up to and including time (i − 1). The vector Ŵ(i−1|i−1) denotes the estimate of the state-space
~ time (i − 1), given all the previous estimates of state-space vectors up to and including time
vector W
(i − 1). The posterior probability given in equation (5.10) is expressed as
~ i |~xi , . . . ~x0 ) =
p(W
q
|2πP(i|i) | exp
1 ~
~
~
T −1 ~
− (Wi − Ŵ(i|i) ) P(i|i) (Wi − Ŵ(i|i) ) ,
2
(5.17)
where P(i|i) denotes the covariance matrix at time i, given all the previous covariance matrices up to
~
~ at time i, given
and including time i. The vector Ŵ(i|i) denotes the estimate of the state-space vector W
all estimates of state-space vectors up to and including time i.
The Kalman filter recursively estimates the probability density functions given in equations
(5.13)–(5.17).
The prediction parameters used in the prediction step (equation (5.4)) include
~
the predicted state-space vector Ŵ(i|i−1) and predicted covariance matrix P(i|i−1) . The predicted
~
state-space vector’s estimate Ŵ(i|i−1) is computed as
~
~
Ŵ(i|i−1) = FŴ(i−1|i−1) ,
(5.18)
and the predicted estimate of the covariance matrix is computed with
P(i|i−1) = Qi−1 + FP(i−1|i−1) FT .
(5.19)
The parameters used in the updating step (equation (5.5)) include the posterior estimate of the
~
state-space vector Ŵ(i|i) and posterior estimate of the covariance matrix P(i|i) . These parameters
require the computation of the innovation term and optimal Kalman gain. The innovation term Si
is computed as
Si = HP(i|i−1) HT + Ri .
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(5.20)
90
Chapter 5
Feature extraction
The optimal Kalman gain Ki is computed as
Ki = P(i|i−1) HT Si−1 .
(5.21)
~
The posterior estimate of the state-space vector Ŵ(i|i) is computed as
~
~
~
Ŵ(i|i) = Ŵ(i|i−1) + Ki (~xi − HŴ(i|i−1) ),
(5.22)
and the posterior estimate of the covariance matrix P(i|i) is computed as
P(i|i) = P(i|i−1) − Ki Si KT
i .
(5.23)
~
If the process function is precise and the initial estimates of Ŵ(0|0) and P(0|0) are accurate, then the
following five properties will hold. The first two properties, which are relevant to the state-space
vector’s estimate, are
~
~
~ i − Ŵ
~
E[W
(i|i) ] = E[Wi − Ŵ(i|i−1) ] = 0,
~
E[~xi − HŴ(i|i−1) ] = 0.
(5.24)
(5.25)
The last three properties hold a relation to the covariance matrices, which accurately reflect the
estimated covariance as
~
~ i − Ŵ
P(i|i) = cov(W
(i|i) ),
~
~ i − Ŵ
P(i|i−1) = cov(W
(i|i−1) ),
~
Si = cov(~xi − HŴ(i|i−1) ).
(5.26)
(5.27)
(5.28)
The performance of the Kalman filter is usually inhibited by the poor estimation of the observation
noise’s covariance matrix Ri and the process noise’s covariance matrix Qi−1 . The Kalman filter is
~ i |~xi , ~xi−1 , . . . ~x0 )
unable to compute the mean and covariance of the Gaussian posterior probability p(W
accurately if poor initial estimates are made of the observation and process noise’s covariance matrices.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
91
Chapter 5
5.5
Feature extraction
EXTENDED KALMAN FILTER
The EKF is the non-linear extension of the standard Kalman filter in estimation theory. The EKF
has been considered to be the de facto standard in the theory of non-linear state estimate, navigation
systems and global positioning system (GPS) [188].
~ i is estimated at each time
The EKF is similar to the standard Kalman filter as a state-space vector W
~ i is estimated at time i recursively by using the set of observation
increment i. The state-space vector W
vectors {~xi , ~xi−1 , . . . , ~x0 }. The state-space model’s equations are reformulated for the EKF in this
section. The transition equation in equation (5.11) is rewritten as
~ i = f (W
~ i−1 ) + ~zi−1 .
W
The
transition
~zi−1 , ~zi−1
∼
function
f
is
a
non-linear
function,
(5.29)
and
the
process
noise
vector
N (0, Qi−1 ), is assumed to be normally distributed. The measurement equation
in equation (5.12) is rewritten as
~ i ) + ~vi .
~xi = h(W
(5.30)
The measurement function h is a non-linear function and the observation noise vector
~vi , ~vi ∼ N (0, Ri ) is assumed to be normally distributed. The idea behind the EKF is that the
non-linear transition function f and non-linear measurement function h can be sufficiently described
using local linearisation of the two functions.
~ i |~xi , . . . , ~x0 ) is approximated by means of a
The posterior probability density function p(W
Gaussian distribution, which implies that equations (5.13)–(5.17) described in the Kalman filter section
(section 5.4) still hold. Prediction parameters and updating parameters are reformulated to take
into account the non-linear transition and measurement functions. The predicted state-space vector’s
~
estimate Ŵ(i|i−1) is expressed as
~
~
Ŵ(i|i−1) = f (Ŵ(i−1|i−1) ),
(5.31)
where f denotes the non-linear transition function. The predicted estimate of the covariance matrix
P(i|i−1) is expressed as
P(i|i−1) = Qi−1 + Fest P(i−1|i−1) FT
est .
(5.32)
The matrix Fest is the local linearisation of the non-linear transition function f . The matrix Fest is
~
defined as the Jacobian evaluated at Ŵ(i−1|i−1) as [185, Ch. 2]
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
92
Chapter 5
Feature extraction
Fest
∂
∂
T ~ =
.
.
.
f
(
W
)
.
i
∂Wi,1
~ ~
∂Wi,S
Wi =Ŵ(i−1|i−1)
(5.33)
~
In the case of the updating parameters, the posterior estimate of the state-space vector Ŵ(i|i) is
expressed as
~
~
~
Ŵ(i|i) = Ŵ(i|i−1) + Ki (~xi − h(Ŵ(i|i−1) )).
(5.34)
The function h denotes the non-linear measurement function and Ki denotes the EKF’s optimal Kalman
gain given as
−1
Ki = P(i|i−1) HT
est Si .
(5.35)
The matrix Hest is the local linearisation of the non-linear measurement function h. The matrix Hest
~
is defined as the Jacobian evaluated at Ŵ(i|i−1) as [185, Ch. 2]
Hest
∂
∂
T ~ =
...
h ( Wi ) .
~
∂Wi,1
∂Wi,S
~ i =Ŵ
W
(i|i−1)
(5.36)
The innovation term for the EKF is defined as
Si = Hest P(i|i−1) HT
est + Ri .
(5.37)
The posterior estimate of the covariance matrix P(i|i) is expressed as
P(i|i) = P(i|i−1) − Ki Si KT
i .
(5.38)
Land cover example: The time series example given in figure 5.1 produces a time series which is
shown in figure 5.2. Kleynhans et al. proposed a triply modulated cosine function for the process
function [30]. The triply modulated cosine function is expressed as
~xi = µi + αi cos(2πfsamp i + θi ).
(5.39)
The variable i denotes the time index and fsamp denotes the temporal sampling rate of the image
acquisitions. The cosine function is characterised by three variables: µi , αi and θi . These three
variables form the state-space vector, which is defined as
~ i = [Wi,1 Wi,2 Wi,3 ] = [Wi,µ Wi,α Wi,θ ].
W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(5.40)
93
Chapter 5
Feature extraction
1200
Original time series
Fitted process function
1100
Reflectance value
1000
900
800
700
600
500
400
Jan 02
Feb 02
Mar 02
Apr 02
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
~ i to fit
F IGURE 5.3: The Extended Kalman filter estimates the parameters of the state-space vector W
the triply modulated cosine function onto the time series shown in figure 5.2. The estimated state-space
vector is used to create a fitted process function to measure the accuracy of the fit.
The triply modulated cosine function is a non-linear function and the EKF was proposed to solve
the state-space model. It is assumed that the state-space vector remains constant from one time
increment to the next. This reduces the transition equation given in equation (5.29) to
~i =W
~ i−1 + ~zi−1 .
W
(5.41)
The measurement equation shown in equation (5.30) is defined for this example as
~ i ) + ~vi ,
~xi = h(W
(5.42)
where the measurement function h is the triply modulated cosine function given in equation
(5.39) as
~ i ) = Wi,µ + Wi,α cos(2πfsamp i + Wi,θ ).
h(W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(5.43)
94
Phase parameter
Amplitude parameter
Mean parameter
Chapter 5
Feature extraction
780
760
740
Jan 02
Mar 02
May 02
Jul 02
(a)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(b)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(c)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(d)
Sep 02
Nov 02
180
160
140
Jan 02
0
−10
−20
Jan 02
Residual
400
200
0
Jan 02
~ i . Figure
F IGURE 5.4: The Extended Kalman filter estimates the parameters in the state-space vector W
(a) shows the mean parameter µi estimates. Figure (b) shows the amplitude parameters αi estimates.
Figure (c) shows the phase parameter θi estimates. Figure (d) shows the absolute error in tracking the
output of the system.
It should be noted that the measurement function produces a vector with a single dimension.
Thus for this example, equation (5.42) is further reduced to a single output as
~ i ) + vi .
xi = h(W
(5.44)
~
The predicted state-space vector’s estimate Ŵ(i|i−1) shown in equation (5.31) is rewritten by
substituting the transition function with the identity matrix for the example as
~
~
~
Ŵ(i|i−1) = f (Ŵ(i−1|i−1) ) = Ŵ(i−1|i−1) .
(5.45)
The matrix Fest is an identity matrix, which simplifies the predicted estimate for the covariance
matrix P(i|i−1) shown in equation (5.32) to
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
95
Chapter 5
Feature extraction
P(i|i−1) = Qi−1 + Fest P(i−1|i−1) FT
est = Qi−1 + P(i−1|i−1) .
(5.46)
~
The posterior estimate of the state-space vector Ŵ(i|i) shown in equation (5.34) is expressed for
this example as
with
~
~
~
Ŵ(i|i) = Ŵ(i|i−1) + Ki (~xi − h(Ŵ(i|i−1) ))
(5.47)
~
~
= Ŵ(i|i−1) + Ki (~xi − Hest (Ŵ(i|i−1) ))
T
~ i ) ∂hT (W
~ i ) ∂hT (W
~ i ) ∂h (W
~
= Ŵ(i|i−1) + Ki ~xi − ,
∂Wi,µ
~
∂Wi,α
∂Wi,θ W
~ i =Ŵ
(i|i−1)
~ i)
∂h(W
= 1
∂Wi,µ
~ i)
∂h(W
~
= cos(2πfsamp i + Ŵ(i|i−1),θ )
∂Wi,α
~ i)
∂h(W
~
~
= −Ŵ(i|i−1),α sin(2πfsamp i) cos(Ŵ(i|i−1),θ ) +
∂Wi,θ
~
cos(2πfsamp i) sin(Ŵ(i|i−1),θ ) .
(5.48)
(5.49)
(5.50)
The time series shown in figure 5.2 is fitted with the triply modulated cosine function by
~ i for each time increment. The estimated output of the EKF,
estimating a state-space vector W
using the newest available observation vector at time i, is plotted with the actual observation
vector ~xi in figure 5.3. It is observed that the EKF requires an initial number of observations
before the state-space vector starts to stabilise. The stabilised state-space vector corresponds to
a more accurate tracking of the actual observations.
The progressive estimation of the state-space vectors is shown in figure 5.4. Figure 5.4(a)
illustrates the estimation of the mean parameter µi (the first element in the state-space vector
denoted by Wi,µ ). Figure 5.4(b) illustrates the estimation of the amplitude parameter αi (the
second element in the state-space vector denoted by Wi,α ). Figure 5.4(c) illustrates the estimation
of the phase parameter θi (the third element in the state-space vector denoted by Wi,θ ). The
absolute error in the tracking of the output is illustrated in figure 5.4(d).
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
96
Chapter 5
Feature extraction
Reflectance value
1500
Original time series
1000
500
May 02 Jun 02
Reflectance value
Time index of interest
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
(a)
Jan 03
Feb 03
1
Mar 03
Apr 03
Original model
0.5
0
−0.5
−1
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
(b)
Jan 03
Feb 03
Mar 03
Apr 03
Reflectance value
1500
Original time series
Fitted model
1000
Estimated output
500
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
(c)
Jan 03
Feb 03
Mar 03
Apr 03
~ i to fit the model onto the time series.
F IGURE 5.5: Least squares estimates the parameter vector W
5.6
LEAST SQUARES MODEL FITTING
The least squares method was first discovered by Carl Friedrich Gauss in 1795 and was later published
by the French mathematician Legendre in 1805. The least squares is a method used to fit the triply
~ i . It estimates the parameter vector by evaluating
modulated cosine model with a parameter vector W
the fit of the model to the actual observation vector. The parameter vector in this context can be viewed
as the state-space vector defined in the state-space model and the model can be viewed as the process
function (section 5.3).
The least squares is a linear regression method, which uses a model h to predict a set of dependent
~ i } from a set of independent observation vectors {~xi }. The least squares’ goal is
parameter vectors {W
~ i that will minimise the sum of squares between the observation vectors
to find a parameter vector W
~xi and the model’s estimated output vector ~x̂i . The sum of squares is computed as a summation of the
error residuals to measure the performance and is expressed as
ELS =
I
X
i=1
(~xi − ~x̂i )2 =
I
X
~ i ))2 .
(~xi − h(~xi , W
(5.51)
i=1
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
97
Chapter 5
Feature extraction
Reflectance value
1200
Original time series
Fitted model
1000
800
Time index of interest
600
400
Jan 02
Mar 02
May 02
Jul 02
(a)
Sep 02
Reflectance value
1200
Nov 02
Original time series
Fitted model
1000
800
600
400
Mar 02
Time index of interest
May 02
Jul 02
Sep 02
(b)
Nov 02
Jan 03
Reflectance value
1200
Original time series
Fitted model
1000
800
600
Time index of interest
400
May 02
Jul 02
Sep 02
Nov 02
(c)
Jan 03
Mar 03
~ i by shifting the model over the time
F IGURE 5.6: Least squares estimates the parameter vector W
series.
The variable ELS denotes the sum of squares and h denotes the model. The sum of squares can be
minimised using standard approaches, which evaluate the partial derivatives. The partial derivative of
the sum of squares is solved as
I
X
dELS
d(~xj − ~x̂j )
=2
(~xj − ~x̂j )
= 0,
~i
~i
dW
dW
∀i.
(5.52)
j=1
Several variations of the least squares exist; the most popular method is the ordinary least squares
(OLS) algorithm. The OLS assumes the observation noise vector ~vi is normally distributed and the
model h is linear.
The least squares is considered optimal when a set of criteria is met in the estimates of the parameter
vector. These criteria are:
1. The observation vectors are randomly sampled from a well defined data set.
2. The underlying structure within the data set is linear.
3. The difference between the observation vector ~xi and the fitted model has an expected zero mean.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
98
Chapter 5
Feature extraction
1200
Original time series
Fitted model
1100
Reflectance value
1000
900
800
700
600
500
400
Jan 02
Feb 02
Mar 02
Apr 02
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
~ i to fit triply modulated cosine model onto
F IGURE 5.7: Least squares estimates the parameter vector W
a time series.
4. The parameter vector’s variables are linearly independent from each other.
5. The difference between the observation vector ~xi and the fitted model is normally distributed and
uncorrelated to the parameter vector.
In addition to the five criteria stated, if the Gauss-Markov condition also holds; then the OLS
estimates are considered to be equivalent to the maximum likelihood estimates of the parameter vectors.
More sophisticated adaptations have been made to the OLS and the most frequently used of these are:
the weighted least squares, alternating least squares and partial least squares.
The OLS can be extended to include the field of non-linear models. The drawback is that the
standard approach of evaluating the derivative of a non-linear model in equation (5.52) is not always
~ i are functions which are dependent on both
possible. This is because the derivatives of d(~xj − ~x̂j )/dW
~ i }.
the observation vectors {~xi } and the parameter vectors {W
This changes the least squares from a closed-form solution for the linear case to a non closed-form
~ i}
solution for the non-linear case. This requires that the estimation of the set of parameter vectors {W
is derived using an analytical iterative algorithm. The algorithm iterates through the parameter vector’s
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
99
Phase parameter
Amplitude parameter
Mean parameter
Chapter 5
Feature extraction
780
760
740
Jan 02
Mar 02
May 02
Jul 02
(a)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(b)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(c)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(d)
Sep 02
Nov 02
300
200
100
0
Jan 02
6
4
2
0
Jan 02
Residual
200
100
0
Jan 02
~ i to fit triply modulated cosine model onto
F IGURE 5.8: Least squares estimates the parameter vector W
a time series.
space using the derivative of the sum of squares ELS at each epoch. The gradient descent algorithm is
a popular iterative method used in this case.
Land cover example: In this example the least squares predicts the set of parameter vectors for the
time series shown in figure 5.2. The problem lies in the fact that the least squares requires a
~ i . The lowest number of
set of observation vectors {~xi } to estimate a single parameter vector W
~ i | + 1).
observation vectors required to estimate the parameter vector is (|W
This concept is illustrated in figure 5.5 by using a set of observation vectors the length of
a single year.
of interest.
In figure 5.5(a) the time series in figure 5.2 is shown with a time index
~ i for observation vector ~xi is estimated using the
The parameter vector W
set {~xi−N , ~xi−N +1 , . . . , ~xi+N −1 , ~xi+N } of observation vectors. The variable N is chosen to
~i
encapsulate the entire period of the model shown in figure 5.5(b). The parameter vector W
is then determined using the least squares to minimise the sum of squares to produce the fitted
model shown in figure 5.5(c).
~ i , ∀i. This is accomplished by moving the
The next step is to estimate a parameter vector W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
100
Chapter 5
Feature extraction
~ i+c for observation vector ~xi+c is estimated
model across the time index. The parameter vector W
using the set {~xi−N +c , ~xi−N +c+1 , . . . , ~xi+N +c−1 , ~xi+N +c }. This iterative approach to moving the
model is shown in three different figures in figure 5.6.
After shifting through the entire time series, the predicted output of the least squares is plotted,
along with the actual observation vectors in figure 5.7.
The progressive estimation of the parameter vectors is shown in figure 5.8. Figure 5.8(a)
illustrates the estimation of the model’s mean parameter µi .
Figure 5.8(b) illustrates the
estimation of the model’s amplitude parameter αi . Figure 5.8(c) illustrates the estimation of
the model’s phase parameter θi . The absolute error in tracking of the output is illustrated in
figure 5.8(d). 2
5.7
M-ESTIMATE MODEL FITTING
Various attempts have been made to create robust statistical estimators, which are used to fit models.
M-estimates rely on the maximum likelihood approach to estimate the parameters of a particular
statistical model. An M-estimator is generally defined as a zero of the estimating function, while
the estimating function is usually the derivative of a statistical function of interest. The advantage of a
M-estimator is that it does not assume that the residuals are normally distributed. M-estimators attempt
to minimise the mean absolute deviation in the residuals for a given distribution using a maximum
likelihood approach.
The assessment of different distributions in the M-estimator allow for different weighting functions
to be associated with outliers. Normally distributed residuals usually associate greater weights to
outliers when compared to a Lorentzian distribution of residuals [189, Ch. 15]. This deviant behaviour
in relative weighting points in a model makes it difficult to apply standard gradient descent. The
Nelder-Mead method is thus the chosen optimisation method, as it only requires function evaluations
and not the derivatives [189, Ch. 15].
The Nelder-Mead algorithm was first proposed by John Nelder and Roger Mead in 1965 [190].
~ i for a
The Nelder-Mead algorithm is a non-linear method which estimates the parameter vector W
particular model. The Nelder-Mead algorithm is a well-defined numerical method that operates on a
twice differentiable, unimodal, multi-dimensional function. The method makes use of a direct search
by evaluating a function at the vertices of a simplex. A N -simplex is a N -dimensional polytope which
is the convex hull of (N +1) vertices. The algorithm then iteratively moves and scales the simplex’s
vertices through the set of dimensions in search of the minimum. It continually attempts to improve
the evaluated function until a predefined bound is reached.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
101
Chapter 5
Feature extraction
1200
Original time series
Fitted model
1100
Reflectance value
1000
900
800
700
600
500
400
Jan 02
Feb 02
Mar 02
Apr 02
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
~ i to fit the triply modulated cosine model
F IGURE 5.9: M-estimator estimates the parameter vector W
onto a time series.
Each epoch requires the execution of six steps to compute the new position of the simplex. The
algorithm in summary starts with initialising the vertices of the simplex. It then iteratively rejects
and replaces the worst performing vertex point with a new vertex point. This process of setting new
vertex points creates a sequence of new N -simplexes. The initialisation with a small initial N -simplex
converges rapidly to a local minimum, while a large N -simplex becomes trapped in non-stationary
points in the vector space.
Land cover example: In this example the M-estimator predicts a set of parameter vectors for the time
series shown in figure 5.2. The same problem exists for the M-estimator, as for the least squares,
~ i for observation
when estimating the sequence of parameter vectors. The parameter vector W
vector ~xi is estimated using the set {~xi−N , ~xi−N +1 , . . . , ~xi+N −1 , ~xi+N } of observation vectors.
This is rectified by shifting the model through all the time indices. The initial estimate of the
M-estimator is contained in a certain parameter space by using the mean and standard deviation
of the time series as the initial parameter vector for the model. The previous parameter vector
~ i−1 is then used to initialise the M-estimator when determining the current parameter vector
W
~ i.
W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
102
Phase parameter
Amplitude parameter
Mean parameter
Chapter 5
Feature extraction
780
760
740
Jan 02
Mar 02
May 02
Jul 02
(a)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(b)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(c)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(d)
Sep 02
Nov 02
200
100
0
Jan 02
6
4
2
0
Jan 02
Residual
200
100
0
Jan 02
~ i to fit the triply modulated cosine model
F IGURE 5.10: M-estimator estimates the parameter vector W
onto a time series.
The predicted output of the M-estimator is plotted with the actual observation vectors ~xi in
figure 5.9.
The progressive estimation of the parameter vectors are shown in figure 5.10. Figure 5.10(a)
illustrates the estimation of the model’s mean parameter µi . Figure 5.10(b) illustrates the
estimation of the model’s amplitude parameter αi . Figure 5.10(c) illustrates the estimation of
the model’s phase parameter θi . The absolute error in the tracking of the output is illustrated in
figure 5.10(d). 2
5.8
FOURIER TRANSFORM
The Fourier transform of a discrete time series is a representation of the sequence in terms of the
complex exponential sequence {ej2πf i }, where f is the frequency variable. The Fourier transform
representation of a time series, if it exists, is unique and the original time series can be recovered by
applying an inverse Fourier transform [115, Ch. 3].
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
103
Chapter 5
Feature extraction
Let x, x = [x1 x2 . . . xI ], denote the time series and let I → ∞, then the Fourier transform
X (ej2πf ) is defined as
X (e
j2πf
∞
X
)=
x(I/2) ej2πf i .
(5.53)
i=−∞
The Fourier transform X (ej2πf ) is a complex function and is written in rectangular form as
X (ej2πf ) = Xreal (ej2πf ) + jXimag (ej2πf ),
(5.54)
where Xreal (ej2πf ) denotes the real part and Ximag (ej2πf ) denotes the imaginary part of X (ej2πf ). The
components of the rectangular form are expressed as
Xreal (ej2πf ) = |X (ej2πf )| cos θX ,
(5.55)
Ximag (ej2πf ) = |X (ej2πf )| sin θX .
(5.56)
The quantity |X (ej2πf )| denotes the magnitude function of the Fourier transform. The quantity θX
denotes the phase function, which is given as
θX = arctan
Ximag (ej2πf )
.
Xreal (ej2πf )
(5.57)
In the case of a finite length time series x, x = [x1 x2 . . . xI ], I ∈ N, I < ∞, there is a simpler
relation between the time series and its corresponding Fourier transform X (ej2πf ) [115, Ch. 3]. For
a time series x of length I, only I values of X (ej2πf ) at I distinct harmonic functions at frequency
points, 0 ≤ f ≤ I, are sufficient to construct the unique time series x. This leads to the concept of a
second transform domain representation that operates on a finite length time series [115, Ch. 3].
This second transform is known as the discrete Fourier transform (DFT). The relation between a
finite length time series x, x = [x1 x2 . . . xI ], and its corresponding Fourier transform X (ej2πf ) is
obtained by uniformly sampling X (ej2πf ) on the frequency domain between 0 ≤ f ≤ 1 at increments
of f = i/I, 0 ≤ i ≤ (I − 1). The DFT is computed by sampling equation (5.53) uniformly as
Xi = X (e
j2πf
)
f =i/I
=
I−1
X
xn ej2πin/I ,
0 ≤ i ≤ (I − 1).
(5.58)
n=0
The inverse discrete Fourier transform (IDFT) is given by
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
104
Chapter 5
Feature extraction
1200
Original time series
Fitted Fourier components
1100
Reflectance value
1000
900
800
700
600
500
400
Jan 02
Feb 02
Mar 02
Apr 02
May 02 Jun 02
Jul 02
Aug 02 Sep 02
Oct 02
Nov 02 Dec 02
~ i to fit multiple
F IGURE 5.11: Fast Fourier transform (FFT) estimates the parameters of the vector W
harmonics onto time series x.
xn =
I−1
X
Xi e−j2πin/I ,
0 ≤ n ≤ (I − 1).
(5.59)
i=0
The computation of the DFT and IDFT requires O(I 2 ) complex multiplications and O(I 2 − I)
complex additions. A fast Fourier transform (FFT) refers to an algorithm that has been developed to
reduce the computational complexity of computing the DFT to about O(I(log2 I)) operations. As
there is no loss in precision in using these fast computing algorithms, they will be used throughout this
thesis when referring to the DFT of a time series. Similarly, an inverse fast Fourier transform (IFFT)
algorithm has been developed to compute the IDFT efficiently.
The FFT function is denoted by F and is mathematically computed as
X = F(x).
(5.60)
The sequence X is the DFT of the time series x. The time series x is a process in the time domain and
the value of x is dependent on the corresponding time index i. The DFT X , on the other hand, is a
process in the frequency domain by which the process is defined by the amplitude |xf | and phase ∠xf
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
105
|First component|
Chapter 5
Feature extraction
0.68
0.66
|Second component|
0.64
Jan 02
Mar 02
May 02
Jul 02
(a)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(b)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(c)
Sep 02
Nov 02
Mar 02
May 02
Jul 02
(d)
Sep 02
Nov 02
0.1
0.08
∠(Second component)
Jan 02
6
4
2
0
Jan 02
Residual
40
35
30
Jan 02
~ i to fit multiple
F IGURE 5.12: Fast Fourier transform (FFT) estimates the parameters of the vector W
harmonics onto time series x.
of harmonic frequency samples f , f ∈ {−∞, ∞}.
The inverse Fourier transform is denoted by F−1 and is mathematically computed as
x = F−1 (X ).
(5.61)
The conversion to the frequency domain allows the analysis of periodic (such as seasonal) effects and
trends within the time series x.
Land cover example: In this example the fast Fourier transform is used to predict a set of Fourier
components for the time series shown in figure 5.2.
~ i for observation vector ~xi and are estimated
The Fourier components are stored in a vector W
using the set {~xi−N , ~xi−N +1 , . . . , ~xi+N −1 , ~xi+N } of observation vectors. The variable N is
chosen to capture enough energy in each harmonic function of interest. This happens to be
the entire process function of a complete phenological cycle of one year.
A set of harmonic functions is stored in the state-space model as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
106
Chapter 5
Feature extraction
~ i = [Wi,1 Wi,2 Wi,3 ] = [Wi,µ Wi,α Wi,θ ] = [|X1 | 2|X2 | ∠(X2 )].
W
(5.62)
~ i , ∀i. This is accomplished by moving a window
The next step is to estimate a vector W
~ i+c for observation vector ~xi+c is estimated using the
across the time index. The vector W
set {~xi−N +c , ~xi−N +c+1 , . . . , ~xi+N +c−1 , ~xi+N +c }. This iterative approach moves the window of
the DFT similar to the least squares and M-estimator. The predicted output of the Fourier
components is plotted along with the actual observation vectors in figure 5.11.
The progressive estimation of the vectors is shown in figure 5.12. Figure 5.12(a) illustrates the
estimation of the magnitude of the first frequency component in X . Figure 5.12(b) illustrates the
estimation of the magnitude of the second frequency component in X . Figure 5.12(c) illustrates
the phase of the second frequency component X . The absolute error in tracking of the output is
illustrated in figure 5.12(d). 2
5.9
SUMMARY
In this chapter, four different feature extraction methods were investigated. The feature extraction
methods are all based on the same principle of fitting a cosine model to the time series. The first three
methods; EKF, least squares model fitting and M-estimator model fitting, are regression approaches,
which attempt to estimate the mean, amplitude, and phase component of the cosine function. All three
features are comparable among the three regression methods. The Fourier transform method is similar
to the other three methods, except for the fact that a complex vector is estimated, which contains the
combined power of both a cosine and sine function. The feature vectors extracted using these methods
will be used by machine learning methods to determine the corresponding class labels.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
107
C HAPTER
SIX
S EASONAL F OURIER F EATURES
6.1
OVERVIEW
In this chapter, the concept of extracting meaningful features from a time series is investigated. The
chapter starts by defining the difference between the concept of whole clustering and subsequence
clustering. It continues by exploring a fundamental pitfall inherent when using subsequence clustering
to analyse time series. This is motivated at the hand of an experiment presented by Keogh [29] and
a worked-out visual example. A key feature extraction method, that will extract the Seasonal Fourier
Features (SFF) is presented in section 6.4, which will overcome the disadvantage of using subsequence
clustering. The chapter concludes by defining how this SFF is used in a post-classification change
detection algorithm to detect change in time series.
6.2
TIME SERIES ANALYSIS
A time series is a sequence of measurements, typically recorded at successive time intervals [191].
Time series have a distinct natural temporal ordering. This induces a high correlation between
measurements taken at a shorter interval from a system, when compared to measurements taken at
a longer interval from the same system. Time series analysis comprises methods for analysing time
series to extract statistics and underlying characteristics. Several different types of analysis can be
applied to time series and are categorised as: exploration, description, prediction and forecasting.
1. Exploration provides in-depth information on serial dependence and any cyclic behaviour
patterns within time series. The time series can also be graphically examined to observe any
salient characteristics.
Chapter 6
Seasonal Fourier Features
2. Description provides information of underlying structures hidden within the time series.
Algorithms were developed to decompose time series into several components to examine any
hidden trends, seasonality, slow and fast variations, cyclic irregularities and anomalies.
3. Prediction provides information on any near future event in the time series and can be used as
feedback to control a system’s behaviour that is providing the data points of the time series.
4. Forecasting uses statistical models to generate variations of the time series to observe alternative
possible events that might occur in the future.
Clustering is the most frequently used exploration tool in data mining algorithms. The vast
quantities of important information typically hidden in time series have attracted substantial attention
[29].
Clustering is used in many algorithms as either: rule discovery [192], indexing [193],
classification [194], prediction [195], or anomaly detection [196]. Clustering of time series is broadly
divided into two categories: whole clustering and subsequence clustering [29].
Whole clustering: Whole clustering is similar to the conventional clustering of discrete objects. Each
time series is viewed as an individual discrete object and is thus clustered into groups with other
time series. 2
Subsequence clustering: Subsequence clustering is when multiple individual time series (subsequences) are extracted with a sliding window from a single time series.
Let x, x =
[~x1 , ~x2 , . . . , ~xI ], denote a time series of length I. A subsequence extracted from time series
x is given as
xp = ~xp , ~xp+1 , . . . , ~xp+Q−1 ,
(6.1)
for 1 ≤ p ≤ I-Q+1, where Q is the length of the subsequence. The sequential extraction of
subsequences in equation (6.1) is achieved by using a temporal sliding window that has a length
of Q and position p, p ∈ N0 , that is incremented with a natural number N to extract sequential
subsequences xp from x. This set of subsequences are clustered into groups, similar to how
whole clustering clusters an entire time series. 2
6.3
MEANINGLESS ANALYSIS
Recently the data mining community’s attention was drawn to a fundamental limitation in the clustering
of subsequences that are extracted with a sliding window from a time series [29]; the sliding window
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
109
Chapter 6
Seasonal Fourier Features
causes the clustering algorithms to create meaningless results. This is due to the fact that clusters
extracted from the subsequences are forced to obey a certain constraint that is pathologically unlikely
to be satisfied by any data set. The term meaningless originates from the effect of creating random
clusters when applying a clustering algorithm to such subsequences [29].
It should be noted that it is well understood that clustering in a high-dimensional feature space
usually produces useless results if proper design considerations are not followed [197, 198]. For
example, the K-nearest neighbour algorithm produces fewer useful clusters in higher dimensions.
This is because the ratio between the nearest neighbour and the average neighbour distance rapidly
converges to one in higher dimensions. However, the analysis on time series usually results in high
dimensionality, which typically has a low intrinsic dimensionality [199]. This is not the limitation that
will be discussed in this chapter.
Keogh and Lin [29] made a surprising claim, which called into question dozens of published
results. The problem identified lies in the way the features are extracted from the sliding window
when presented to the clustering algorithm. This claim is supported by the following experiment.
Experiment presented in [29]: The variability in the clusters formed will be tested using the same
clustering design considerations and methodology on different data sets containing time series. It
is shown that any partitional or hierarchical clustering algorithm would suffice in this experiment,
and under this assumption the K-means was used for its robustness in forming reliable clusters.
The K-means clustering algorithm forms clusters, which are used to define a set of functions.
Let ϑ(a) = {ϑ1 (a), ϑ2 (a), . . . , ϑK (a)} denote the cluster centroids derived with the K-means
algorithm from the first data set.
Let ϑ(b) = {ϑ1 (b), ϑ2 (b), . . . , ϑK (b)} denote the cluster centroids derived with the K-means
algorithm from the second data set.
Let Ded (ϑi , ϑj ) denote the Euclidean distance between two cluster centroids. The distance metric
Ded (ϑi , ϑj ) determines the shortest possible distance for an one-to-one mapping of two sets of
centroids ϑ(a) and ϑ(b).
The difference between the two sets of cluster centroids is defined as
DM (ϑ(a), ϑ(b)) =
K
X
i=1
min[Ded (ϑi (a), ϑj (b))].
j
(6.2)
The consistency of a clustering algorithm to form similar sets of clusters is measured if the first
data set used to find cluster centroids ϑ(a) and the second data set used to find cluster centroids
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
110
Chapter 6
Seasonal Fourier Features
ϑ(b) is the same data set. A more important measurement is to determine the similarity between
the centroids when they are not the same data set.
Keogh and Lin [29] proposed a clustering meaningfulness index as
CM (ϑ(a), ϑ(b)) =
DM (ϑ(a), ϑ(a))
.
DM (ϑ(a), ϑ(b))
(6.3)
The clustering meaningfulness index measures the similarity between two data sets’ clusters
despite the fact that two different data sets are used.
Intuitively, if proper clustering design considerations were applied the numerator in
equation (6.3) should converge to zero. In contrast to this statement, if the data sets are unrelated,
then the denominator should tend to a large number. This in effect naturally makes the clustering
meaningfulness index CM (ϑ(a), ϑ(b)) → 0.
The results produced in this experiment were unexpected. When a random walk data set was
compared to a stock market data set, the clustering meaningfulness index averaged between 0.5
and 1 when subsequence clustering was applied to the time series. This means that if clustering
was performed on the stock market data set, the centroids derived could be re-used for the random
walk data set and the difference in clustering results could not be observed.
The same was not true when whole clustering was used on these two data sets. The clustering
meaningfulness index converged to zero when the stock market data set and random walk data
set were clustered using a whole clustering approach. Several additional experiments were
conducted in [29] to motivate this behaviour as a property of the sliding window. 2
The sliding window causes the clustering algorithm to create meaningless results, as it forms sine
wave cluster centroids regardless of the data set, which clearly makes it impossible to distinguish one
data set’s clusters from another. Furthermore, the sine waves within the cluster centroids are always
out of phase with each other by exactly 1/K period [29]. The inability to produce meaningful cluster
centroids revealed a new question: how do the cluster centroids obtain this special structure [29]? In
this section a visual example is shown to illustrate why the clustering algorithm produces meaningless
results.
Visual example: Assume a triply modulated cosine function, which is given as
xi = µi + αi cos(2πf i + θi ),
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(6.4)
111
Chapter 6
3.5
Seasonal Fourier Features
f (p)
f (p)
Time series
Subsequence
5
1
3
Extra c te d S lid in g Win d ow
2.5
f (p)
f2(p)
4
Reflectance value
2
f (p)
3
1.5
1
0.5
0
−0.5
−1
0
π
2
π
3 π2
2π
5 π2
3π
7 π2
4π
9 π2
5π
11 π2
6π
F IGURE 6.1: The five feature points, separated by a period of π2 , are extracted from the sliding window,
and is denoted by the set {f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)}.
where the mean µi , amplitude αi , frequency f , and phase θi are fixed for all time increments in
this example. A visual plot of this triply modulated cosine function is shown in figure 6.1. A
sliding window is placed on the time series with features extracted from the window at multiples
of
π
2
of the period.
The five features are extracted at interval {0, π2 , π, 3π
, 2π} from the sliding window and are
2
denoted by {f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)}. The position of the sliding window is denoted
by the variable p, p ∈ N0 . This is mathematically expressed as
xp = f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)
= xpπ/2 , x(p+1)π/2 , x(p+2)π/2 , x(p+3)π/2 , x(p+4)π/2 .
(6.5)
The initial extracted features, p = 0, are extracted from the sliding window and are expressed as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
112
Chapter 6
Seasonal Fourier Features
x0 = f1 (0), f2 (0), f3 (0), f4 (0), f5 (0)
= x0 , xπ/2 , xπ , x3π/2 , x2π .
(6.6)
It should be noted that the length of the sliding window in this example is set at Q=5. The
position of the sliding window is incremented by 1 (equivalent shift of π2 ) to evaluate a new
range of observations in the time series (figure 6.2), which is expressed as
x1 =
f1 (1), f2 (1), f3 (1), f4 (1), f5 (1)
= xπ/2 , xπ , x3π/2 , x2π , x5π/2 .
(6.7)
As the position is incremented, the five features extracted from the time series in set
{f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)} are presented to a clustering method. To understand the claim
of Keogh [29], focus will only be placed on the first feature f1 (p) without loss of generality. The
feature extracted at point f1 (p) for the sliding window at position p is expressed as
f1 (p) = xpπ/2 .
(6.8)
Equation (6.8) is used to create a time series f1 for all the values of f1 (p) for all positions p of
the sliding window and is expressed as
f1 = x0 , xπ/2 , xπ , . . . x(I−Q)π/2 .
(6.9)
The values of the triply modulated cosine function is substituted into f1 as
f1 = αi , µi , −αi , µi , αi . . . αi .
(6.10)
This shows that inadvertly all the features are sequentially presented to every dimension of
the feature vector. The fundamental problem becomes intuitive, as every feature dimension is
sequentially attempting to learn the same thing. This is better illustrated by tabulating the set of
features {f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)}. Table 6.1 shows what each feature point measures as
a function of the sliding window increments. 2
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
113
Chapter 6
3.5
Seasonal Fourier Features
f1(0)
f5(0)
S liding Window 1
S liding Window 2
f4(1)
3
2.5
f1(1)
f4(0)
f2(0)
f3(1)
f5(1)
Reflectance value
2
f2(1)
f3(0)
1.5
1
0.5
0
−0.5
Time series
Subsequence
−1
0
π
2
π
3 π2
2π
5 π2
3π
7 π2
4π
9 π2
5π
11 π2
6π
F IGURE 6.2: Two sets of five feature points {f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)}, are separated by a
period of π2 , are shown to be extracted by two sliding windows.
Table 6.1: The sequence of features extracted as a function of the sliding window’s position from
figure 6.2.
Sliding window
position
0
1
2
3
4
Time
increment
0
π
2
π
3π
2
2π
f1
αi
µi
-αi
µi
αi
Feature points
f2 f3 f4
µi -αi µi
-αi µi αi
µi αi µi
αi µi -αi
µi -αi µi
f5
αi
µi
-αi
µi
αi
The intuition behind understanding this problem is to imagine an arbitrary data point somewhere
in the time series which enters the sliding window and the contribution this data point makes to the
overall mean of the sliding window. As the sliding window passes by, the data point first appears as the
rightmost value in the window and then sequentially appears exactly once in every possible location
within the sliding window. Thus all feature points will present the same information at different times
and different dimensions to the clustering algorithm. This is equivalent to only presenting one data
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
114
Chapter 6
Seasonal Fourier Features
3.5
S liding Window 1
S liding Window 2
f1(0)
f5(0)
f1(4)
f5(4)
3
2.5
f4(0)
f2(0)
f4(4)
f2(4)
Reflectance value
2
f3(4)
f3(0)
1.5
1
0.5
0
−0.5
Time series
Subsequence
−1
0
π
2
π
3 π2
2π
5 π2
3π
7 π2
4π
9 π2
5π
11 π2
6π
F IGURE 6.3: Two sets of five feature points {f1 (p), f2 (p), f3 (p), f4 (p), f5 (p)}, are separated by a
period of 2π, are shown to be extracted by two sliding windows.
point to a clustering algorithm and sequentially shifting through the time series.
Several ideas were formulated on how to create meaningful clusters [29]. The first idea was to
increment the position of the sliding window by more than the length of the sliding window. This does
not solve the problem, as the subsequence clustering becomes a whole clustering application. The
second idea considered by Keogh and Lin [29] was to set the number of clusters much higher than
the true number of clusters within the data set. Empirically this only worked if the number of clusters
was set impractically high. The authors concluded that there is no simple solution to the problem of
subsequence clustering.
Proposition 6.3.1 A tentative solution was presented by Keogh and Lin [29] to find meaningful
clusters using subsequence clustering. The example is in essence whole clustering, but it does
emphasise an interesting property. The tentative solution proposes a single time series with a repetitive
pattern, as shown in figure 6.3. The sliding window is shifted by exactly one period of the repetitive
pattern within the time series. The new features are extracted and presented to the clustering algorithm.
The solution becomes more intuitive if the features are tabulated in sequence of extraction.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
115
Chapter 6
Seasonal Fourier Features
Table 6.2: The sequence of features extracted as a function of the sliding window’s position from
figure 6.3.
Sliding window
position
0
1
2
3
4
Time
increment
0
2π
4π
6π
8π
f1
αi
αi
αi
αi
αi
Feature points
f2 f3 f4
µi -αi µi
µi -αi µi
µi -αi µi
µi -αi µi
µi -αi µi
f5
αi
αi
αi
αi
αi
Table 6.2 now shows that each feature point is acquiring a single property of the time series.
Through feature selection it becomes apparent that features f3 –f5 can be discarded. This tentative
solution provides meaningful clusters when the sliding window position p is incremented by the period
of the repetitive pattern.
This however becomes a whole clustering solution if the sliding window’s position is incremented
by more than its length. This results in analysing non-overlapping sliding windows. 2
Since remote sensing time series data have a strong periodic component due to the seasonal
vegetation dynamics, the extracted sequential time series could potentially be processed to yield usable
features. A feature extraction method is proposed in the next section that will reduce the feature space’s
dimensionality and removes the restriction of the tentative solution proposed in [29]. The removal of
the restriction on the sliding window’s position p will enable effective subsequence clustering that does
not suffer from the afore-mentioned limitations.
6.4
MEANINGFUL CLUSTERING
In this section a method is shown that will create usable features from a subsequence xp extracted
from a MODIS MCD43A4 time series data set. The fixed acquisition rate of the MODIS product and
the seasonality of the vegetation in the study area make for an annual periodic signal x that has a
phase offset that is correlated with rainfall seasonality and vegetation phenology. The FFT [200] of xp
is computed, which decomposes the time sequence’s values into components of different frequencies
with phase offsets. This is often referred to as the frequency (Fourier) spectrum of the time series.
Because the time series xp is annually periodic, this would translate into frequency components in the
frequency spectrum that have fixed positions with varying phase offsets. The varying phases limits
the shifting of the sliding window’s position p to exactly a periodic cycle [29], except if the clustering
algorithm can cater for the varying phases.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
116
Chapter 6
Seasonal Fourier Features
Reflectance value
0.8
Sliding Window 2
Sliding Window 1
0.6
0.4
0.2
0
Jan 02 Apr 02
Jul 02
Oct 02 Jan 03 Apr 03
Jul 03
Oct 03 Jan 04 Apr 04
Jul 04
Oct 04 Jan 05
Magnitude
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
F IGURE 6.4: The feature components Xp (f ) extracted from two sliding windows at random positions
using equation (6.11) yields similar features.
This limitation is addressed by computing the magnitude of all the FFT components, which
removes all the phase offsets. This makes it possible to compensate for both the restrictive position
p of the sliding window and the seasonality. This means that p, which is the position of the sliding
window, does not have to be incremented by only a fixed annual period, but can be incremented by any
natural number. The features for the clustering method are extracted from the sliding window xp by
the methodology discussed above, and are termed as the SFF Xp . The SFF is computed as
Xp = | F( xp ) |,
(6.11)
where F(·) represents the Fourier transform. From the discussion above, a sliding window of any
length can be applied to the MODIS time series and moved along the time axis at any rate as long as
the feature extraction rule in equation (6.11) is applied. Figure 6.4 illustrates how the SFFs that are
extracted using two different sliding window positions in time maintain their position in the feature
space, even though the two sliding windows are arbitrarily positioned in time.
The seasonal attribute typically associated with MODIS time series and the slow temporal variation
relative to the acquisition interval [15], makes the first few FFT components dominate the frequency
spectrum. This reduces the number of features needed to represent the feature space and thus reduces
the dimensionality, making clustering an even more feasible option [201].
The mean and annual FFT components from equation (6.11) were considered, as it was shown
by Lhermitte [116] that considerable class separation can be achieved from these components. Many
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
117
Chapter 6
Seasonal Fourier Features
FFT-based classification and segmentation methods consequently only consider a few FFT components
[116, 202, 203].
6.5
CHANGE DETECTION METHOD USING THE SEASONAL FOURIER
FEATURES
In this section the meaningful clustering approach discussed in section 6.4 is incorporated into a land
cover change detection method. The change detection method operates on multiple spectral bands, as
shown in figure 6.5.
F IGURE 6.5: Temporal sliding window used to define a subsequence of the time series for classification
and change detection.
The mean µ and annual α component of the SFF were considered from each of the MODIS spectral
bands. These features are expressed using the same methodology discussed above as
Xbp = | Fbµ ( xbp ) Fbα ( xbp ) |,
(6.12)
where Fbµ denotes the mean component extracted from the bth spectral band’s Fourier transform. The
function Fbα denotes the annual component extracted from the bth spectral band’s Fourier transform.
The subsequence xbp is extracted from the bth spectral band at position p.
This selection of frequency components reduces the number of features to represent the feature
space and thus reduces the dimensionality. A feature vector is defined to encapsulate multiple spectral
bands’ SFF. The feature vector is defined as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
118
Chapter 6
Seasonal Fourier Features
F IGURE 6.6: Subsequences of the time series extracted from the two spectral MODIS bands are
processed for clustering and change detection.
XpN = [ X1p X2p . . . XN p ].
(6.13)
Here N denotes the number of spectral bands, and p, p ∈ [1, (I − Q)], the position of the sliding
window. The first feature vector is the NDVI time series (N =1), which is denoted by Xp1 . This is
where the NDVI is computed for Xbp in equation (6.1), which uses a combination of the first two
spectral bands (RED and NIR spectral bands) of the MODIS instrument. The second feature vector is
to use the first two spectral bands separately (N =2), which is denoted by Xp2 . The last feature vector
uses all seven spectral bands separately (N =7), which is denoted by Xp7 .
These SFFs are processed by a machine learning algorithm to detect change. The processing chain
for the two spectral bands feature vector Xp2 is shown as an illustration in figure 6.6. The outputs
produced a time series of classifications for a given pixel as a function of the sliding window position
p. Land cover change is defined then as the transition in class label of a pixel’s time series from one
class to another class, after which it remains in the newly assigned class for the remainder of the time
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
119
Chapter 6
Seasonal Fourier Features
series.
6.6
SUMMARY
In this chapter a detailed overview was given of the pitfall of creating meaningless clusters. An example
was presented to illustrate the real limitation of subsequence clustering, followed by a few tentative
solutions proposed by Keogh and Lin [29] to solve this problem. Keogh and Lin admit that these
solutions are not a fully worked out solution to the problem, but with further investigation a possible
solution could be identified. In section 6.5, the SFF was proposed as a solution for a particular data set,
which in this case was a time series that had inherent seasonal variations. The SFF will be one of the
extracted features used in chapter 8 to detect land cover change.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
120
C HAPTER
SEVEN
E XTENDED K ALMAN F ILTER FEATURES
7.1
OVERVIEW
In this chapter, the Extended Kalman filter (EKF) is used as a feature extraction method, and is studied
in-depth. The chapter discusses how the state space variables are used within the EKF, followed by
how these are used to separate a set of time series into several classes. The importance of the initial
parameters used to set the EKF is discussed in section 7.2.3, illustrating how the behaviour is dependent
on these initial parameters.
A novel criterion called the Bias-Variance Equilibrium Point (BVEP), is proposed in section 7.2.4,
which defines a desired set of initial parameters that will provide optimal performance. The BVEP
criterion is derived using both the temporal and spatial information to design a system with desirable
behaviour. A specifically designed search algorithm called the Bias-Variance Search Algorithm
(BVSA) is proposed that will adjust the Bias-Variance Score (BVS) to best satisfy the BVEP criterion
that will provide good initial parameters for the EKF. The chapter concludes by briefly overviewing
the Autocovariance Least Squares (ALS) method, which will be used as benchmark when evaluating
the method proposed in section 7.2.4.
7.2
7.2.1
CHANGE DETECTION METHOD: EXTENDED KALMAN FILTER
Introduction
An EKF is discussed as a feature extraction method in this section, which is based on the assumption
that the parameters of the underlying model can be used to separate a set of time series into different
classes. The model is based on the seasonal behaviour of a specific land cover class. It should be
noted that a certain model would better describe a particular land cover class than another and that
proper model selection must be done for each different land cover class. It follows that more separable
Chapter 7
Extended Kalman Filter features
parameters derived by the EKF make it easier to detect changes in the assigned classes.
Lhermitte et al. proposed a method that separates different land cover classes using a Fourier
analysis of NDVI time series [116]. It was concluded that good separation is achievable when
evaluating the magnitude of the coefficients of the Fourier transform associated with the NDVI signal’s
mean and amplitude components. Kleynhans et al. proposed a method which jointly estimates the mean
and seasonal component of the Fourier transform using a triply modulated cosine function [30]. The
EKF uses the triply modulated cosine function to model NDVI time series by updating the mean (µ),
amplitude (α), and phase (θ) parameters for each time increment.
The method proposed in this section expands on the method of Kleynhans [30] et al. by modelling
the spectral bands separately and addresses the second constraint of the manual estimation of the initial
parameters for the EKF to ensure proper tracking of the observation vectors. The initial parameters
include the initial state-space vector, process noise covariance matrix and observation noise covariance
matrix. An operator typically uses a training set to supervise the adjustment of the initial parameters
until acceptable performance is obtained for a set of time series.
7.2.2
The method
The EKF is a non-linear estimation method, which estimates the unobserved parameters using noisy
observation vectors of a related observation model. The EKF has been used in the remote sensing
community for parameter estimation of values related to physical, biogeochemical processes or
vegetation dynamics models [204, 205].
In figure 7.1, a Fourier transform is used to observe that the majority of the signal energy is
contained in the mean and seasonal component of the first spectral band. This implies that the time
series in spectral band 1 are well represented in the time domain as a single cosine function with a
mean offset, amplitude and phase, as shown in figure 7.2.
This single cosine model is, however, not a good representation if the time series is non-stationary,
which is often the case; for example, inter-annual variability or land cover change. The triply
modulated cosine function proposed in [30] is extended here to model a spectral band as
xi,k,b = µi,k,b + αi,k,b cos(2πfsamp i + θi,k,b ) + vi,k,b .
(7.1)
The variable xi,k,b denotes the observed value of the bth spectral band’s time series, b ∈ {1, 7}, of
the k th pixel, k ∈ [1, N ], at time index i, i ∈ [1, I]. The noise sample of the k th pixel at time i for
each spectral band is denoted by vi,k,b . The noise is additive with an unknown distribution on all the
spectral bands. The cosine function model is separately fitted to each of the spectral bands and is based
on several different parameters; the frequency fsamp can be explicitly calculated based on the annual
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
122
Chapter 7
Extended Kalman Filter features
Reflectance value
1400
Spectral band 1
1200
1000
800
600
Jan 02
Mar 02
May 02
Jul 02
Sep 02
Nov 02
|Frequency component|
(a) Time series of reflectance values recorded by the MODIS spectral band 1.
0.8
Spectral band 1
0.6
0.4
0.2
0
0
2
4
6
8
10
12
14
16
18
20
22
(b) Discrete Fourier transform of the time series shown in (a).
F IGURE 7.1: The time series recorded by the first spectral band for a geographical area is shown in (a)
with the corresponding discrete Fourier transform shown in (b).
vegetation growth cycle, and the sampling rate of the MODIS sensor. Given the 8 daily composite
MCD43A4 MODIS data set, fsamp is set to
8
.
365
The non-zero mean of the bth spectral band of the k th
pixel at time index i is denoted by µi,k,b , the amplitude by αi,k,b and the phase by θi,k,b . The values of
µi,k,b , αi,k,b and θi,k,b are dependent on time and must be estimated for each pixel k, ∀k, k ∈ [1, N ],
given the spectral band observation vectors xi,k,b for i, ∀i, i ∈ [1, I], and b, b ∈ {1, 7}.
The MODIS spectral bands however are assumed to be uncorrelated and are treated independently
in this method. The index b is omitted for convenience, with no loss in generality in the description of
this method. A state-space vector is estimated by the EKF at each time increment i for each spectral
band and contains all the parameters. This is expressed as
~ i,k = [Wi,k,1 Wi,k,2 Wi,k,3 ] = [Wi,k,µ Wi,k,α Wi,k,θ ].
W
(7.2)
~ i,k
For the present example of land cover classification, it is assumed that the state-space vector W
does not change significantly through time; hence, the process model is linear. The measurement
model, however, contains the cosine function and, as such, is evaluated via the standard Jacobian
formulation, through linear approximation of the non-linear measurement function around the current
~ i,k is related to the observation vector xi,k via a non-linear
state-space vector. The state-space vector W
measurement function. Both the transition function and measurement function are assumed to be
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
123
Chapter 7
Extended Kalman Filter features
Spectral band 1:
Reflectance value
1400
Original time series
Kalman tracking
1200
1000
800
600
400
Jan 02
Mar 02
May 02
Jul 02
Sep 02
Nov 02
Spectral band 2:
Reflectance value
(a) Extended Kalman filter tracking the observation vectors extracted from spectral band 1.
Original time series
Kalman tracking
3500
3000
2500
2000
1500
Jan 02
Mar 02
May 02
Jul 02
Sep 02
Nov 02
(b) Extended Kalman filter tracking the observation vectors extracted from spectral band 2.
F IGURE 7.2: The tracking of the first two spectral bands using the triply modulated cosine function.
non-perfect, so the addition of process and observation noise is required.
Converting state-space vectors to land cover classes
A machine learning algorithm is used to process the estimated state-space vectors to assign class labels.
A class label is assigned to each state-space vector for each pixel at each time increment. This is
expressed as
~ i,k ),
Ci,k = FC (Wi,k,1 , . . . , Wi,k,S ) = FC (W
(7.3)
where the function FC denotes either a supervised or unsupervised classifier. The class label for the
k th pixel at time i is denoted by Ci,k . Change is declared when a pixel k changes in class label as a
function of time i. This is expressed as
Ci,k 6= Cj,k ,
0 ≤ i ≤ j, ∀i, j.
(7.4)
The importance of the initial parameters will be discussed in the next section.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
124
Chapter 7
7.2.3
Extended Kalman Filter features
Importance of the initial parameters
The EKF recursively solves the state-space form of a linear dynamic model [185, Ch. 1]. In this section
the importance of the initial estimates of the system’s variables is shown.
th
Let xk = {~xi,k }i=I
time series in the set of time series consisting
i=1 , k ∈ [1, N ], denote the k
of observation vectors, with each observation vector denoted by ~xi,k = xi,k as the spectral bands are
~ i,k = {Wi,k,s }s=S
treated independently. Let W
s=1 denote the corresponding state-space vector for xi,k .
Then it is said that the EKF solves the state-space form recursively using the transition equation given
as
~ i,k = f (W
~ (i−1),k ) + z(i−1),k ,
W
(7.5)
and the measurement equation given as
~xi,k
~
= h Wi,k + vi,k .
(7.6)
The transition function is denoted by f and the measurement function is denoted by h. A brief
overview of the operations of the EKF which is shown in section 5.5 is revisited for convenience. It is
well known from estimation theory that many prediction results simplify when Gaussian distributions
are assumed. The process noise vector and observation noise vector are thus assumed to be Gaussian
distributed. The process noise vector is thus denoted by z(i−1),k , z(i−1),k ∼ N (0, Q(i−1),k ), and the
observation noise vector is denoted by vi,k , vi,k ∼ N (0, Ri,k ).
The EKF recursively adapts the state-space vector for each incoming observation vector by
~
predicting and updating the vector. In the prediction step the state-space vector Ŵ(i|i−1),k and
~
covariance matrix B(i|i−1),k are predicted. The predicted state-space vector’s estimate Ŵ(i|i−1),k is
computed as
~
~
Ŵ(i|i−1),k = f Ŵ(i−1|i−1),k ,
(7.7)
and the predicted covariance matrix B(i|i−1),k is computed as
B(i|i−1),k = Q(i−1),k + Fest B(i−1|i−1),k FT
est .
(7.8)
The matrix Fest is the local linearisation of the non-linear transition function f . In the updating step,
~
the posterior estimate of the state-space vector Ŵ(i|i),k is computed as
~
~
~ i,k ,
Ŵ(i|i),k = Ŵ(i|i−1),k + Ki,k ~xi,k − h W
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(7.9)
125
Chapter 7
Extended Kalman Filter features
using the optimal Kalman gain denoted by Ki,k which is computed as
−1
Ki,k = B(i|i−1),k HT
est Si,k .
(7.10)
The matrix Hest is the local linearisation of the non-linear measurement function h. The matrix Si,k
denotes the innovation term, which is computed as
Si,k = Hest B(i|i−1),k HT
est + Ri,k .
(7.11)
The posterior estimate of the covariance matrix B(i|i),k is computed as
B(i|i),k = B(i|i−1),k − Ki,k Si,k KT
i,k .
(7.12)
The tracking performance of the EKF is assessed by evaluating the stability of the state-space
vector and error in estimating the observation vector. The error in estimating the observation vector is
computed as the absolute error between the estimated observation vector ~x̂i,k and the actual observation
vector ~xi,k . This is expressed as
~ (i|i),k .
E~x,i,k = |~xi,k − ~x̂i,k | = ~xi,k − h W
(7.13)
~
In equation (7.13), it is observed that the state-space vector Ŵ(i|i),k determines the observation
~
error E~x,i,k . Thus the state-space vector Ŵ(i|i),k can be selected to minimise the observation error. The
MODIS spectral bands are assumed to be uncorrelated and only produce a single reflectance value for
each pixel. This simplifies equation (7.13) to
E~x,i,k
~ (i|i),k .
= |xi,k − x̂i,k | = xi,k − h W
(7.14)
~
The observation error is easily minimised by significantly varying Ŵ(i|i),k to accommodate the
fluctuation in observation vectors. This does not bode well if the underlying structure of the system
~
is also being analysed. A significantly varying state-space vector Ŵ(i|i),k is indicative of an unstable
model. The conclusion is that the state-space model must be kept stable, while also attempting to
minimise equation (7.14).
The initial estimates provided to the EKF will now be discussed to illustrate their importance. A
~
~
stable state-space vector requires a small adaptation from Ŵ(i−1|i−1),k to Ŵ(i|i),k . The initial estimated
~
~
state-space vector Ŵ(0|0),k , Ŵ(0|0),k ∈ W, at the first observation vector ~x0,k is optimised using a local
search method or domain knowledge which satisfies
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
126
Chapter 7
Extended Kalman Filter features
~
~ Ŵ(0|0),k = argmin ~x0,k − h Ŵ ,
(7.15)
~
Ŵ ∈W
then
~
E~x,0,k = ~x0,k − h Ŵ(0|0),k,b .
(7.16)
~
The recursive adaptation of the state-space vector’s estimate Ŵ(i|i),k is then calculated using the
predicted step given in equation (7.7) and the updating step in equation (7.9). Equation (7.7) is
substituted into equation (7.9) to yield
~
~
~
Ŵ(i|i),k = f Ŵ(i−1|i−1),k + Ki,k ~xi,k − h f Ŵ(i−1|i−1),k
.
(7.17)
The Kalman gain Ki,k determines the rate of change in the error between the predicted and estimated
state-space vector. If the observation error is large and the Kalman gain is large, then large changes
will be made to the current state-space vector. If the observation error is large and the Kalman gain
~
is small, then the state-space’s estimate Ŵ(i|i),k will adapt slowly, which typically leads to a large
observation error E~x,i,k (equation (7.13)) until it eventually converges. If the observation error is small
and the Kalman gain is large, then the state-space vector will struggle to converge, as it will continually
overshoot the desired state-space vector that will minimise equation (7.13). Substituting the optimal
Kalman gain given in equation (7.10) into equation (7.17) expands it to
~
~
~
−1
T
.
Ŵ(i|i),k = f Ŵ(i−1|i−1),k + B(i|i−1),k Hest Si,k ~xi,k − h f Ŵ(i−1|i−1),k
(7.18)
The Kalman gain is dependent on the predicted covariance matrix B(i|i−1),k and innovation term Si,k .
The innovation term controls the trust region within the state-space vector’s space. This is dependent
on the predicted covariance matrix B(i|i−1),k and observation covariance noise Ri,k . Substituting the
innovation term given in equation (7.11) into equation (7.18) results in
~
~
T
Ŵ(i|i),k = f Ŵ(i−1|i−1),k + B(i|i−1),k HT
est (Hest B(i|i−1),k Hest +
~
Ri,k )−1 ~xi,k − h f Ŵ(i−1|i−1),k
.
(7.19)
The last term to evaluate is the predicted covariance matrix B(i|i−1),k . The predicted covariance
matrix B(i|i−1),k is substituted to yield an updated state-space vector as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
127
Chapter 7
Extended Kalman Filter features
~
~
T
Ŵ(i|i),k = f Ŵ(i−1|i−1),k + (Q(i−1),k + Fest B(i−1|i−1),k FT
est )Hest
T
−1
(Hest (Q(i−1),k + Fest B(i−1|i−1),k FT
est )Hest + Ri,k )
~
~xi,k − h f Ŵ(i−1|i−1),k
.
(7.20)
The transition function f and measurement function h are assumed to be known. The observation
vector ~xi,k is supplied by the real system. The only variables left within equation (7.20) are: (1)
~
previous state-space vector’s estimate Ŵ(i−1|i−1),k , (2) process noise’s covariance matrix Q(i−1),k , (3)
previous estimate of covariance matrix B(i−1|i−1),k , and (4) observation noise’s covariance matrix Ri,k .
The previous estimation of the covariance matrix B(i−1|i−1),k will be briefly explored, as it is part
of equation (7.20). The covariance matrix B(i−1|i−1),k is updated with
B(i−1|i−1),k = B(i−1|i−2),k − K(i−1),k S(i−1),k KT
(i−1),k .
(7.21)
Substituting the Kalman gain of equation (7.10) into equation (7.21) yields
−1
−1
T
T
B(i−1|i−1),k = B(i−1|i−2),k − (B(i−1|i−2),k HT
est S(i−1),k )S(i−1),k (B(i−1|i−2),k Hest S(i−1),k,b ) .
(7.22)
Substituting the innovation term of equation (7.11) into equation (7.22) yields
T
−1
B(i−1|i−1),k = B(i−1|i−2),k − (B(i−1|i−2),k HT
est (Hest B(i−1|i−2),k Hest + R(i−1),k ) )
T
(Hest B(i−1|i−2),k HT
est + R(i−1),k )(B(i−1|i−2),k Hest (Hest B(i−1|i−2),k
−1 T
HT
est + R(i−1),k ) ) .
(7.23)
The predicted covariance matrix B(i−1|i−2),k given in equation (7.8) is substituted into equation (7.23),
which yields
T
T
B(i−1|i−1),k = (Q(i−2),k + Fest B(i−2|i−2),k FT
est ) − ((Q(i−2),k + Fest B(i−2|i−2),k Fest )Hest
T
−1
(Hest (Q(i−2),k + Fest B(i−2|i−2),k FT
est )Hest + R(i−1),k ) )(Hest (Q(i−2),k +
T
T
T
Fest B(i−2|i−2),k FT
est )Hest + R(i−1),k )((Q(i−2),k + Fest B(i−2|i−2),k Fest )Hest
T
−1 T
(Hest (Q(i−2),k + Fest B(i−2|i−2),k FT
est )Hest + R(i−1),k ) ) .
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(7.24)
128
Chapter 7
Extended Kalman Filter features
Equation (7.20) is computed for every newly obtained observation vector. The state-space vector’s
~
estimate Ŵ(i|i),k requires the results from equation (7.24) to compute the current estimates. The
transition function Fest and measurement function Hest are known, then the only variables left to
compute in equation (7.24) are: (1) initial covariance matrix B(0|0),k , (2) process covariance matrix
Q(i−1),k , and (3) observation noise’s covariance matrix Ri,k . The conclusion from equation (7.20) and
equation (7.24) is that the initial parameters of importance are:
~
1. the initial state-space vector’s estimate Ŵ(0|0),k ,
2. the initial covariance matrix estimate B(0|0),k ,
3. the process covariance matrix Q(i−1),k , and
4. the observation covariance matrix Ri,k .
~
The initial state-space vector’s estimate Ŵ(0|0),k is initialised using equation (7.15). Even if an
~
incorrect estimate is used, the state-space vector Ŵ(i|i),k should converge to the correct vector as i →
∞. The same is true about the initial covariance matrix B(0|0),k . As i → ∞, the covariance matrix
B(i|i),k should tend to converge to the correct matrix. The usual operation of the EKF sets the initial
covariance matrix equal to an identity matrix.
The initial covariance matrix B(0|0),k will stabilise, as equation (7.8) is known as a discrete Riccati
equation, and under certain circumstances will converge, which results in equation (7.24) converging
to a stable state [206]. The conditions for convergences of the discrete Riccati equation are:
1. the process covariance matrix Q(i−1),k is a positive definite matrix,
2. the observation covariance matrix Ri , k is a positive definite matrix,
3. the pair (Fest , z(i−1),k ) is controllable, i.e.,
−1
rank z(i−1),k |Fest z(i−1),k |F2est z(i−1),k | . . . |FN
est z(i−1),k = N,
(7.25)
4. and the pair (Fest , Hest ) is observable, i.e.,
T
T
T 2 T
T N −1 T
rank HT
Hest = N,
est |Fest Hest |(Fest ) Hest | . . . |(Fest )
(7.26)
with N ∈ N. Under the above conditions, the predicted covariance matrix B(i|i−1),k converges to a
constant matrix
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
129
Chapter 7
Extended Kalman Filter features
lim B(i|i−1),k = Bconst ,
i→∞
(7.27)
where Bconst is a symmetric positive definite matrix. Bconst is a unique positive definite solution of the
discrete Riccati equation and Bconst is independent of the initial distribution of the initial state-space
~
vector’s estimate Ŵ(0|0),k .
~
The system can also estimate Ŵ(0|0),k and B(0|0),k using an offline training phase. Offline refers
to observation vectors that are stored and are used recursively for estimation. The process covariance
matrix Q(i−1),k and observation covariance matrix Ri,k are assumed to be constant throughout the
recursive estimation of the observation vector. This is usually manually set by a system analyst in an
offline training phase through successive adjustments. In this thesis the initial EKF is defined as:
~
1. The initial state-space vector Ŵ(0|0),k is estimated offline.
2. The initial covariance matrix B(0|0),k is estimated offline.
3. The process covariance matrix Q(i−1),k is set to a fixed matrix.
4. The observation covariance matrix Ri,k is set to a fixed matrix.
The EKF will track the observation vectors with minimum residual and have a stable internal
state-space vector if all initial parameters are properly estimated.
7.2.4
Bias-Variance Equilibrium Point
The general approach to estimating and initialising the state-space vectors, as well as the observation
and process noise’s covariance matrices for the EKF, is usually for an analyst to determine these offline
using a training data set. Proper estimation of the initial parameters through various methods leads to
good feature vectors from the EKF, while improper estimation could cause system instability, which
leads to delayed tracking or abnormal system behaviour.
A novel BVEP criterion is proposed in this section that will use temporal and spatial information
to design a parameter space where desirable system behaviour is expected. This is accomplished
by first observing the dependencies between the initial parameters. The proposed criterion uses an
unsupervised BVSA to adjust the BVS iteratively to determine proper initial parameters for the EKF.
The characteristics of the initial parameters are first explored before describing the criterion. The first
parameter is the observation covariance matrix Ri,k . The observation covariance matrix Ri,k is defined
as
Ri,k = E[(xi,k −E[xi,k ])2 ].
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
(7.28)
130
Chapter 7
Extended Kalman Filter features
This is due to the fact that the spectral bands are assumed to be uncorrelated and that the MODIS
sensor only produces a single reflectance value per pixel per spectral band. The second parameter is
the process covariance matrix Qi,k . The process covariance matrix Qi,k is defined as



Qi,k = 

E[(Wi,k,1 −E[Wi,k,1 ])(Wi,k,1 −E[Wi,k,1 )]
..
.
...
..
.
E[(Wi,k,1 −E[Wi,k,1 ])(Wi,k,S −E[Wi,k,S )]
..
.
E[(Wi,k,S −E[Wi,k,S ])(Wi,k,1 −E[Wi,k,1 )]
...
E[(Wi,k,S −E[Wi,k,S ])(Wi,k,S −E[Wi,k,S )]



 . (7.29)

The state-space variables within the state-space vector are assumed to be uncorrelated; the process
covariance matrix simplifies to
Qi,k = diag E[(Wi,k,s −E[Wi,k,s ])2 ] , ∀s.
(7.30)
The setting of the initial parameters has a major effect on the overall system performance. The
~ (0|0),k for the first observation vector ~x0,k is optimised using equation (7.15).
initial state-space vector W
The initial estimated covariance matrix B(0|0),k is usually set to the identity matrix. This only leaves
the estimation of the observation covariance matrix Ri,k and process covariance matrix Qi,k . Let the
uncorrelated observation covariance matrix’s diagonals be placed into a vector called the observation
candidate vector ΥR,i,k , were ΥR,k is selected from the space υR , and it is expressed as
ΥR,i,k = 10 ζi,k /10 ,
(7.31)
ζi,k = 10 log10 E[(~xi,k −E[~xi,k ])2 ] .
(7.32)
with
Let the uncorrelated process covariance matrix’s diagonals be placed into a vector called the process
candidate vector ΥQ,i,k , were ΥQ,k is selected from space υQ , which is expressed as
ΥQ,i,k = 10[ςi,k,1 ... ςi,k,S ]/10 = 10~ςi,k /10 ,
(7.33)
ςi,k,s = 10 log10 E[(Wi,k,s −E[Wi,k,s ])2 ] .
(7.34)
with
~ (i|i),k , and
It should be noted that the EKF only updates recursively the state-space vector W
covariance matrix B(i|i),k . The time index of the observation covariance matrix Qi,k has been left
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
131
Chapter 7
Extended Kalman Filter features
inserted to emphasise the time effect in a dynamic linear system. The EKF, however, does not alter the
observation covariance matrix at each time increment and is thus constant for all time indices. This
is formally stated as Q=Qi , ∀i. The process covariance matrix is also retained as a constant for all
time indices and this is stated as R=Ri , ∀i. This concludes that the observation covariance matrix
and process covariance matrix are independent of time. This property allows the observation candidate
vector to be rewritten as
ΥR,k = 10 ζk /10 ∀k,
(7.35)
and the process candidate vector rewritten as
ΥQ,k = 10[ςk,1 ... ςk,S ]/10 = 10~ςk /10 ∀k.
(7.36)
It was stated earlier that the performance of the Kalman filter is measured by the residual error in
tracking the observation vectors and the internal stability of the state-space vector. A parameter space
is thus defined to describe the system behaviour.
The first desired behaviour is the tracking of the observation vector with minimal residual. This
desired behaviour is expressed as the minimal achievable sum of absolute residuals σE , which is
computed as
σE =
min
ΥR,k ∈υR ,ΥQ,k ∈υQ
then
[RσE , QσE ] =
(
)
N X
I
X
x̂i,k − xi,k ,
(7.37)
k=1 i=1
argmin
ΥR,k ∈υR ,ΥQ,k ∈υQ
(
)
N X
I
X
x̂i,k − xi,k .
(7.38)
k=1 i=1
Thus σE is the minimal residual, and [RσE , QσE ] represents the parameters required to achieve this
value. The minimal residual is computed as
N X
I
X
x̂i,k − xi,k σE =
k=1 i=1
.
(7.39)
R=RσE ,Q=QσE
The second criterion is to have internal stability of the state-space vector. This can be measured as
the variations in each of the state-space variables. The second desired behaviour is expressed as the
minimal achievable absolute deviation in state-space variables, which is computed as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
132
Chapter 7
Extended Kalman Filter features
σs =
min
ΥR,k ∈υR ,ΥQ,k ∈υQ
(
then
[Rσs , Qσs ] =
)
N X
I
X
Wi,k,s − E[Wi,k,s ] ,
∀s,
(7.40)
k=1 i=1
argmin
ΥR,k ∈υR ,ΥQ,k ∈υQ
(
)
N X
I
X
Wi,k,s − E[Wi,k,s ] ,
∀s.
(7.41)
k=1 i=1
Thus σs is the minimal absolute deviation in the state-space variable s. The set [Rσs , Qσs ] represents
the parameters required to achieve this value. The minimal absolute deviation is computed as
N X
I
X
Wi,k,s − E[Wi,k,s ]
σs =
k=1 i=1
.
(7.42)
R=Rσs ,Q=Qσs
The spatial information is included through the use of a set of time series all located in a specific
geographical area. The set of N time series for a geographical area is denoted by {~xi,k }. Let qi,E
denote the probability density function derived at time index i from the residuals given over the set of
Rb
Rb
observations {xi,k }k=N
k=1 such that P [a ≤ E ≤ b] = a f (e)de = a f (e, R, Q)de i.e.,
P [a ≤ E ≤ b] =
Z
b
q(e, R, Q)de =
a
Z
b
qi,E de.
(7.43)
a
Let qi,s denote the probability density function for the state-space variable s derived at time index
i from the deviations given over the set of state-space vectors {Wi,k,s }k=N
k=1 such that P [a ≤ s ≤ b] =
Rb
R
b
f (s′ )ds′ = a f (s′ , R, Q)ds′ i.e.,
a
P [a ≤ s ≤ b] =
Z
b
′
′
q(s , R, Q)ds =
a
Z
b
qi,s ds′ .
(7.44)
a
∗
A conditioned observation probability density function qi,E
is defined as the probability density
function qi,E in equation (7.43), which uses the set [RσE , QσE ] to satisfy the condition given in equation
(7.39) as
P [a ≤ E ≤ b] =
Z
b
q(e, RσE , QσE )de =
a
Z
b
∗
qi,E
de.
a
(7.45)
∗
A conditioned process probability density function qi,s
is defined as the probability density function
qi,s in equation (7.44), which uses the set [Rσs , Qσs ] to satisfy the condition given in equation (7.42) as
P [a ≤ s ≤ b] =
Z
b
′
′
q(s , Rσs , Qσs )ds =
a
Z
b
a
∗
qi,s
ds′ .
(7.46)
The performance of the current estimate ΥR,k and ΥQ,k is defined by a criterion that evaluates how
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
133
Chapter 7
Extended Kalman Filter features
well the conditions stated in equation (7.37) and equation (7.40) are satisfied. The current estimates
are recursively updated and are denoted by Υ̂ιR,k and Υ̂ιQ,k , where ι denotes the iteration number. The
ι
current estimates Υ̂ιR,k and Υ̂ιQ,k are used to derive the set of probability density functions {q̂i,E
}, ∀i,
ι
and {q̂i,s
}, ∀i.
A f-divergent distance known as the Hellinger distance [207, 208] is used to measure the similarity
ι
∗
between the probability density functions q̂i,E
and qi,E
. The modified Hellinger distance Hi,E , Hi,E ∈
[0, 1], is computed as
Hi,E
v
sZ
u
u
= 1 − t1 −
∞
−∞
ι
∗
q̂i,E
qi,E
de,
(7.47)
ι
∗
where a value of Hi,E → 1 means high similarity between q̂i,E
and qi,E
, while Hi,E → 0 means low
similarity. The modified Hellinger distance is also used to measure the similarity for the state-space
variables. The modified Hellinger distance Hi,s , Hi,s ∈ [0, 1], is computed as
Hi,s
v
sZ
u
u
= 1 − t1 −
∞
−∞
ι
∗
q̂i,s
qi,s
ds′ ,
(7.48)
ι
∗
where a value of Hi,s → 1 means high similarity between q̂i,b,s
and qi,b,s
, while Hi,s → 0 means low
similarity.
The BVS is defined to encapsulates all similarity metrics as
Γi = min {Hi,s }s=S
∪
{H
}
.
i,E
s=1
(7.49)
Finding optimal estimates for Υ̂ιR,k and Υ̂ιQ,k requires a stable covariance matrix B(i|i),k .
Equation (7.27) states that the predicted covariance matrix B(i|i),k should converge to a constant matrix
under certain prerequisite conditions. Let IT , IT ≪ I, denote the number of time steps required to
ensure that the predicted covariance matrix B(IT |IT −1),k converges to ensure a stable covariance matrix
B(IT |IT ),k . The BVS is deemed accurate at IT , which is defined as
ΓIT = min
{HIT ,s }s=S
s=1
∪ {HIT ,E } .
(7.50)
The BVEP criterion is defined as the BVS, which optimally maximises the conditions. The BVEP
criterion is defined as
Γ∗IT =
max
ΥιR,k ∈υR ,ΥιQ,k ∈υQ
{ΓIT }.
(7.51)
If the reflectance values of the spectral bands are correlated, then the BVS is expanded to compensate
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
134
Chapter 7
Extended Kalman Filter features
for this as
ΓIT = min
n
{HIT ,b,s }s=S
s=1
b=B
b=1
o
{HIT ,b,E }b=B
b=1 .
(7.52)
In this thesis however it was assumed that the spectral bands were uncorrelated.
7.2.5
Bias-Variance Search algorithm
The BVSA is proposed in this section, which will attempt to estimate Υ̂ιR,k and Υ̂ιQ,k to satisfy the
BVEP criterion using the BVS given in equation (7.50). The BVSA starts by creating ideal operating
conditions for each parameter in the EKF, followed by using a hill-climbing algorithm to search for a
set of Υ̂ιR,k and Υ̂ιQ,k that will satisfy at best the ideal operating conditions for all the parameters within
the EKF.
The first ideal condition is a system that employs perfect tracking of the observation vectors. This
∗
ideal condition is used to create the probability density function qi,E
. This is obtained by
∗
qi,E
= qi,E : {ζk } → −∞; {ςk,s } → ∞, ∀ s .
(7.53)
∗
Under perfect conditions the probability density function qi,E
should tend to be an impulse of unity
power situated around the zero position, meaning zero error residual is measured.
The second ideal condition is a system that employs a stable state-space variable. This ideal
∗
condition is used to create the probability density function qi,s
. This is obtained by
∗
qi,s
= qi,s : {ζk } → ∞; {ςk,{s}\s } → ∞; {ςk,s } → −∞ .
(7.54)
This condition creates an environment which attempts to track the state-space variable s with the
smallest variation.
∗
∗
After the ideal observation conditions’ probability density functions qi,E
and qi,s
have been
computed, a hill-climbing search algorithm is applied to find a set of initial parameters that will best
satisfy all these ideal conditions. The BVSA iteratively searches the parameter space and is described
briefly below.
0
Step 1: The BVSA starts with the initial parameters set as ζk0 = 0dB, ∀ k, and ςk,s
= 0dB, ∀ k, s.
~ (I |I ),k at time IT using the same Υ̂ι = ζ ι and Υ̂ι =
Step 2: Compute the state-space vector W
R,k
k
Q,k
T T
k=N
{ζkι }s=S
s=1 for every time series in the set {xk }k=1 .
ι
Step 3: Obtain the probability density function of the residual errors qi,E
over the N time series at time
index IT .
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
135
Chapter 7
Extended Kalman Filter features
ι
Step 4: Obtain the probability density function of the residual errors qi,s
of the state-space variable s
over the N time series at time index IT .
Step 5: Compute the modified Hellinger distance HIT ,E as shown in equation (7.47).
Step 6: Compute the modified Hellinger distance HIT ,s as shown in equation (7.48).
Step 7: Determine the best performing condition Hbest as
Hbest = max {HIT ,E } {HIT ,s } .
(7.55)
Step 8: Determine the worst performing condition Hworst as
Hworst = min {HIT ,E } {HIT ,s } .
(7.56)
Step 9: Adjust the new ζkι according to its relative position to the best and worst performing parameters
using a threshold ρH , ρH ∈ [0, 1], ρH ∈ R. The adjustment is made as
ζkι+1

 ζ ι + γ ι if HIT ,E −Hworst > ρH
k
Hbest −Hworst
.
=
ι
ι
 ζ − γ if HIT ,E −Hworst ≤ ρH
k
Hbest −Hworst
(7.57)
The variable γ ι is a decreasing scalar measured in decibels and is a non-negative real number.
Step 10: Adjust the new ςkι according to its relative position to the best and worst performing
parameters using a threshold ρH , ρH ∈ [0, 1], ρH ∈ R. The adjustment is made as
ι+1
ςk,s

 ς ι + γ ι if HIT ,s −Hworst > ρH
k,s
Hbest −Hworst
.
=
ι
ι
 ς − γ if HIT ,s −Hworst ≤ ρH
k,s
Hbest −Hworst
(7.58)
The variable γ ι is a decreasing scalar measured in decibels and is a non-negative real number.
Repeat steps 2–10 until one of the parameters ζk or ςk,s stabilises. After the search algorithm converges,
the estimates Υ̂ιR,k and Υ̂ιQ,k are used to initialise the EKF.
7.3
AUTOCOVARIANCE LEAST SQUARES METHOD
In this section a method known as the ALS is investigated as an alternative for setting the initial
parameters of the EKF. If complete system knowledge about the measurement function h and transition
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
136
Chapter 7
Extended Kalman Filter features
function f were known, then the EKF only requires knowledge of the observation covariance matrix
R and process covariance matrix Q. Several different approaches have been formulated to solve the
estimation of these covariance matrices [209–211]. All these methods assumed that the noise-shaping
matrix in the transition equation is known. In the absence of information on the noise-shaping matrix
the linear dynamic model is modelled as a Gaussian noise vector. The method that is investigated is
the ALS method, which operates in the absence of information on the noise shaping matrix [212]. The
ALS method assumes that:
1. both the measurement function h and transition function f are known,
2. enough observation vectors are available to ensure internal covariance matrix B(i|i) becomes
stable, and
3. the residuals at different time increments are uncorrelated.
The method estimates the observation covariance matrix R and process covariance matrix Q by
minimising an objective function [212]. The objective function is a function of the measurement
function h, transition function f and the noise-shaping matrix (if present). The motivation for using this
method is that it avoids a complicated non-linear estimation approach used by methods that employ a
maximum likelihood estimation approach [213].
7.4
SUMMARY
In this chapter a novel BVEP criterion was proposed, which computes the process covariance matrix
and observation covariance matrix using spatial and temporal information. This criterion could easily
be extended, as shown in equation (7.52), to include spectral information if the spectral bands are
correlated.
The derived matrices in the BVS were then used to initialise the EKF, which is used as a feature
extraction method. The BVSA provides covariance matrices that could be used for a variety of different
applications. A variety of different search algorithms can be used with the BVEP criterion, such as
interior point, active set, simulated annealing, etc. These methods will be explored in chapter 8.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
137
C HAPTER
EIGHT
R ESULTS
8.1
OVERVIEW
The first part of the chapter studies the effects of different parameter settings to determine their
influence on the quality of the solutions. The second part of the chapter explores the classification
accuracies of several different methods, while the last part investigates the change detection accuracies
of the best performing methods. The chapter concludes with the processing of these methods on large
regional scale areas and assessing the outcome.
8.2
GROUND TRUTH DATA SET
A labelled data set, offering ground truth, is required to evaluate the performance of different land
cover change detection algorithms. The performance of the methods is measured with a variety of
tests to assess accuracy and robustness. Two study areas were investigated in this chapter, namely the
Limpopo and Gauteng provinces.
Limpopo province: The Limpopo province is located in the northern parts of South Africa and is
largely covered by natural vegetation. The expansion of human settlements, often informal and
unplanned, is the most pervasive form of land cover change in the province. Areas were identified
where new settlements were known to have been built over the last decade.
Gauteng province: The Gauteng province is located in the highveld of South Africa and is the most
urbanised province in the country. The province contributes 33% of the country’s national
economy. Active migration to the province from other provinces is motivated by the prospect
of higher incomes and more diverse employment opportunities. An average growth of 249 310
Chapter 8
Results
(a) Quickbird image taken on 1 March 2004
(courtesy of GoogleTM Earth).
(b) Quickbird image taken on 9 July 2008
(courtesy of GoogleTM Earth).
(c) Quickbird image taken on 11 December
2009 (courtesy of GoogleTM Earth).
F IGURE 8.1: Three high resolution images acquired over a residential area called Midstream estates
located in Midrand, Gauteng, South Africa. The area was zoned for residential use in 2003 and new
settlements were erected only after 9 July 2008.
persons per year within the province has been estimated over the past decade [214, 215]. It
should be noted that the Gauteng province only covers 1.4% of the country’s total land area,
while housing over 20% of the population.
8.2.1
MODIS time series data set
The performance of different land cover change detection methods will be evaluated on a per pixel basis
using a set of different spectral bands’ time series, which are extracted from the MODIS land surface
reflectance product. The MODIS (MCD43A4, Collection V005) 500 metre, Nadir and BRDF adjusted
spectral reflectance bands were used, as these significantly reduce the anisotropic scattering effects of
surfaces under different illumination and observation conditions [27, 28]. The first two spectral bands
(RED and NIR spectral bands) are the only spectral bands available at a spatial resolution of 250 metre,
and are not BRDF adjusted. The 500 metre resolution spectral bands were considered to illustrate the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
139
Chapter 8
Results
29 0’0"E
29 30’0"E
30 0’0"E
23 30’0"S
23 30’0"S
Limpopo
Gauteng
Mpumalanga
North West
Free State
KwaZuluNatal
24 0’0"S
24 0’0"S
Northern Cape
Eastern Cape
Western Cape
24 30’0"S
24 30’0"S
Land Cover
Natural vegetation
25 0’0"S
Human settlements
0
12.5
25
50
25 0’0"S
75
100
km
29 0’0"E
29 30’0"E
30 0’0"E
F IGURE 8.2: The Limpopo province study area has land cover types polygons overlayed using Albers
projection on SPOT5 RGB 321 imagery that was acquired between March 2006 and May 2006. The
SPOT2 images were acquired of the same area in May 2000 [8].
advantages of using additional spectral bands in the analysis. A time series is extracted for all 7 spectral
bands from the data set (MODIS tile H20V11) for each pixel in each study area (year 2000–2008).
8.2.2
Manual inspection of study areas
Identification of no change areas: Visual interpretation of SPOT2 (year 2000) and SPOT5 (year
2006 / 2008) high spatial resolution images was used to verify that none of the areas classified
as no change, experienced any form of land cover change during the study period.
Identification of change areas: This data set was captured using the same procedure explained for
the no change areas, except that areas where new human settlements had formed during the
study period were captured.
Even though human settlement expansion is one of the most pervasive forms of land cover change
in South Africa, information on this form of land cover change is poorly documented, and vital details
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
140
Chapter 8
Results
24 10’30"S
24 10’30"S
24 11’0"S
24 11’0"S
28 58’30"E
28 59’0"E
(a)
28 59’30"E
0
250
28 58’30"E
500
1 000
Meters
28 59’0"E
28 59’30"E
(b)
F IGURE 8.3: A land cover change of natural vegetation to human settlement in Sekuruwe. Sekuruwe
is a human settlement that is located in the Limpopo province, South Africa. The SPOT2 image (RGB
321) was acquired on 2 May 2000 of the natural vegetation area (a) and a SPOT5 (RGB 321) image
was acquired on 1 May 2007 of a newly developed human settlement (b). The SPOT2 and SPOT5
image is projected to a MODIS sinusoidal WGS84 projection and is overlaid with a MODIS 500 metre
coordinate grid [8].
such as the date of land cover conversion cannot be determined reliably. An example of inaccurate
information is shown in figure 8.1. The local municipality demarcated new roads in a suburban area
for future expansion. Unfortunately, no newly developed settlements had been built until quite recently.
A good estimate on the date of land cover conversion can be made if regular acquisitions are obtained
for a particular area. In this example, if only the images in figure 8.1(a) and figure 8.1(c) were available,
then the date of change could be somewhere between March 2004 and December 2009. The real land
cover change only occurred after July 2008, which illustrates the importance of the vital statistic of
knowing when change occurred.
Once the areas have been identified as change or no change, they are mapped with polygons on the
geocoded SPOT imagery, as shown in figure 8.2. The SPOT images are then projected to a MODIS
sinusoidal WGS84 projection and is overlaid with a MODIS 500 metre coordinate grid (Figure 8.3).
The MODIS grid blocks, which contain the mapped polygons, are thus marked for extraction from the
MODIS MCD43A4 product.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
141
Chapter 8
8.2.3
Results
GoogleTM Earth used for visual inspection
GoogleTM Earth is being used more routinely in visually displaying and validating of geographical areas
[216, 217]. As an additional validation step, the MODIS pixel coordinates of interest were transformed
into a KML (Keyhole Markup Language) file and visually inspected in GoogleTM Earth. The true
colour of the high resolution Quickbird images available in GoogleTM Earth made a good platform to
illustrate some of the findings presented in this chapter.
GoogleTM Earth operates on a free sharing policy of images and does not have a mandate to buy
regular imagery of certain geographical areas. This means that only areas in which suitable images
were acquired before and after the settlement formation could be validated using GoogleTM Earth.
8.2.4
Simulated land cover data set
Accurate date-of-change information was not available for the ground truth data set, preventing the
measurement of the delay in detecting change of the proposed methods. Land cover change events
were simulated by combining data from natural vegetation and human settlement time series, with the
advantage of a known date of change and transition duration [8].
Four testing data subsets were created, based on concatenating time series of different combinations
of classes:
• Subset 1: natural vegetation time series (class 1) concatenated to settlement time series (class 2).
• Subset 2: settlement time series (class 2) concatenated to natural vegetation time series (class 1).
• Subset 3: settlement time series (class 2) concatenated to another settlement time series (class
2).
• Subset 4: natural vegetation time series (class 1) concatenated to another natural vegetation time
series (class 1).
These four subsets were used to test if the change detection algorithm can detect change reliably
on subsets 1 and 2, while not falsely detecting change for subsets 3 and 4.
8.3
SYSTEM OUTLINE
In this section an overall system outline is provided to explain how all the different methods
interconnect with one another (figure 8.4) to create a change detection framework. The system
starts with the input of time series extracted from the MODIS MCD43A4 land surface reflectance
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
142
Chapter 8
MODIS MCD43A4
Time series
- NDVI
- 2 Bands
- 7 Bands
Change detection
Results
Temporal sliding
Window
Time series of
class labels
Feature extraction
- SFF
- EKF
- LS
- M-estimator
Machine learning method
- Supervised classifier
- Unsupervised classifier
Labeled training set
F IGURE 8.4: A flow diagram which provide a complete system outline used in this chapter in all the
experiments.
product (section 2.6). The time series used as input can either be one of the following spectral band
combinations as listed with the number of dimensions in the feature space as:
• NDVI (2-dimensions),
• first two spectral bands (RED and NIR spectral bands, 4-dimensions), and
• all seven spectral bands (land bands, 14-dimensions).
A temporal sliding window is used to extract sequential subsequences from the time series for
analysis. The length of the temporal sliding window is varied, depending on the feature extraction
method used. The feature extraction methods applied to these subsequences are listed with their
corresponding temporal sliding window length as:
• SFF (6, 12, and 18 months),
• least squares (12 months, see section 8.5.3),
• M-estimator (12 months, see section 8.5.3), and
• EKF (8 days).
The extracted feature vectors are then processed by a machine learning method, which assigns a
class label to each feature vector. The machine learning method can be either a supervised classifier, or
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
143
Chapter 8
Results
1.5
Effective change detection delay ∆
τ
1
Class
0.5
0
−0.5
−1
−1.5
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
F IGURE 8.5: An illustrative example of the effective change detection delay ∆τ , which is defined as
the time duration it takes after the first acquisition of change in the MODIS time series for the land
cover change detection algorithm to detect it.
an unsupervised classifier. The class labels produced by the machine learning method form a new time
series, where each time index corresponds to a classification of an extracted temporal subsequence.
An example of such a time series consisting of class labels is given in figure 8.5. The class labels in
the time series start in the class label 1 (natural vegetation class), and transitions to the class label -1
(human settlement class), as the position of the temporal sliding window is incremented. It is clear
from the illustration that a change in the land cover has occurred in the time series.
A simulated land cover change data set was created in response to the lack of information about
when the actual land cover changed (section 8.2.4). In the simulated land cover change data set, the
exact position (date) of land cover change in the time series is known. This creates another dimension
of evaluation, which enables the quantification of how quickly the land cover change can be detected
by the land cover change detection algorithm.
This delay in detecting a change in land cover is termed the effective change detection delay ∆τ ,
and is defined as the time duration in which the change detection algorithm is unable to detect the
simulated land cover change in subset 1, and subset 2 after the date of change. The concatenation
process (section 8.2.4) in the simulated land cover change data set produces an abrupt change in
the time series, which does not necessarily represent the reality of human-induced change such as
settlement expansion, which could take several months to develop. A blending period (linear blend
over 12 and 24 months) from one land cover time series to another was initially considered, but it
turned out that it did not affect the ability to detect the land cover change correctly, as this is a property
that is exploited in the post-classification change detection approach. The blending model does not
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
144
Chapter 8
Results
Instantaneous blend
1
Vegetation
Class
0.5
0
−0.5
−1
Settlement
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep
Instantaneous blend
End of blend
Start of blend
1
Vegetation
Class
0.5
0
−0.5
−1
Settlement
Jan
1
Mar
May
Jul
Sep
Nov
Jan
Mar
May
1 year linear blending period
Start of blend
Jul
Sep
Nov
Jan
End of blend
Vegetation
Class
0.5
0
−0.5
−1
Settlement
Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar
2 year linear blending period
F IGURE 8.6: Class label time series for simulated land cover change from natural vegetation to human
settlement. The top panel is for instantaneous simulated land cover change, the middle panel is for a
land cover change over a 12 month blending period and the bottom panel is for a land cover change
over a 24 month blending period.
faithfully simulate all forms of actual land cover change, but it does delay the date on which the
change is declared (figure 8.6). It was concluded that only abrupt concatenation should be used when
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
145
Chapter 8
Results
measuring the lower limit of effective change detection ∆τ time.
8.4 EXPERIMENTAL PLAN
In this section an overview is given of the experiments conducted in this chapter. The experiments were
conducted in the Limpopo and Gauteng provinces. The number of pixels per data set in each province
is given in table 8.1.
Table 8.1: Number of pixels per land cover type, per study area used for training, validation and testing
data sets.
Province
Class
Limpopo
Vegetation - No change
Settlement - No change
Simulated land cover change
Real land cover change
Complete Province
Number of
time series
1497
1735
500
118
590212
Gauteng
Vegetation - No change
Settlement - No change
Simulated land cover change
Real land cover change
Complete Province
591
371
124
180
78702
The experiments conducted in this chapter are grouped into four categories:
1. Parameter exploration (section 8.5),
2. Classification (section 8.6),
3. Change detection (section 8.7),
4. Provincial experiments (section 8.9).
A set of general experiments were conducted in section 8.5 to optimise the parameters which
are used in the remaining sections (section 8.6 – section 8.9). The first set of experiments is used
to determine the optimal network architecture for the MLP (section 8.5.1) that will minimise the
generalisation error. The second set of experiments is used to explore two different training methods
for the MLP (section 8.5.2): batch mode and iteratively retrained mode. The third set of experiments
is used to optimise the length of the sliding window for the least squares method (section 8.5.3).
The fourth set of experiments is used to compare the performance of the EKF when using the BVEP
criterion (denoted by EKFBVEP ) and ALS methods (denoted by EKFALS , section 8.5.4). The fifth set
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
146
Chapter 8
Results
of experiments is used to investigate the setting of the BVEP criterion using the BVSA (section 8.5.5).
The sixth set of experiments is used to investigate the performance of each of the regression methods
(section 8.5.6). The seventh set of experiments is used to determine the number of clusters to use in
the unsupervised classifier (section 8.5.7). The last set of experiments is used to determine the average
silhouette value for different clustering algorithms (section 8.5.8).
In section 8.6, the classification accuracy is computed for each of the two classes in a range of
experiments on the no change data set. In each section the average classification accuracy is reported,
along with the corresponding standard deviation. Different combinations of feature extraction methods
and machine learning methods are investigated in these experiments. The feature extraction methods
that were explored are:
• least squares model fitting,
• M-estimator model fitting,
• SFF, and
• EKFBVEP .
The classification experiments are divided into supervised classification experiments and
unsupervised classification experiments. The machine learning method determines the category of
the classifier. The machine learning methods that were explored are:
1. Supervised classifier:
• Multilayer Perceptron (section 8.6.1).
2. Unsupervised classifier:
• Hierarchical clustering, single linkage criterion (section 8.6.3),
• Hierarchical clustering, average linkage criterion (section 8.6.3),
• Hierarchical clustering, complete linkage criterion (section 8.6.3),
• Hierarchical clustering, Ward clustering method (section 8.6.4),
• Partitional clustering, K-means algorithm (section 8.6.5),
• Partitional clustering, EM algorithm (section 8.6.6).
The objective of the classification experiments is to identify combinations of methods which have high
classification accuracies and minimal corresponding standard deviations.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
147
Chapter 8
Results
The change detection algorithms in this thesis are based on a post-classification approach, and are
thus dependent on the classification accuracies reported in section 8.6. The classification accuracies
are used to identify a set of methods that will provide acceptable change detection accuracies (section
8.7).
The first set of experiments is used to determine the change detection accuracies on the simulated
land cover change data set. The number of time series blended to simulate the land cover change in
each province is given in table 8.1. The true positives and false positives are reported on the simulated
land cover data set in section 8.7.1.
The second set of experiments is used to determine the change detection accuracies on the real land
cover change data set. The number of time series that experienced actual land cover change in the
labelled data set of each province is given in table 8.1. In these experiments only the true positives are
reported on the real land cover data set in section 8.7.2.
The third set of experiments is used to determine the effective change detection delay ∆τ on the
simulated land cover change data set. The number of time series blended to simulate land cover change
with the exact time index known of change in each province is given in table 8.1. The effective change
detection delay is reported in days in section 8.7.3.
The change detection algorithms are then applied to the complete province in section 8.9. The total
number of time series in each province is given in table 8.1. The entire province is classified and areas
which experienced land cover change are mapped, followed by the calculation of summary statistics.
8.5
PARAMETER EXPLORATION
8.5.1
Optimising the multilayer perceptron
The MLP comprises an input layer, one hidden layer and an output layer. All hidden and output nodes
used a tangent sigmoid activation function. The input layer accepts feature vectors for classification,
while the output layer represents the likelihood that an input belongs to a specific class. The MLP
output was in the range [-1;1], where 1 represents a 100% certainty of class membership to class
1 (natural vegetation) given the feature vector, while -1 represents a 100% certainty of class 2
(settlement).
The weights of the MLP were determined using a steepest descent gradient optimisation method in
the training phase, with gradients estimated using backpropagation [130, Ch. 4 p. 140]. A validation
set was used for initial MLP architecture optimisation by evaluating the generalisation error to identify
overfitting of the network for each study area. The MLP architecture was optimised for different
lengths of sliding window Q, number of spectral bands and training mode. In table 8.2 the number
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
148
Chapter 8
Results
TABLE 8.2: The number of hidden nodes used within the MLP for each experiment.
Province
Algorithm
Window length
Limpopo
SFF, Iteratively retrained
6 months
12 months
18 months
SFF, Batch mode
12 months
8
10
9
Least squares
12 months
9
8
11
M-estimator
12 months
9
10
7
n/a
n/a
7
15
5
13
5
11
SFF, Iteratively retrained
6 months
12 months
18 months
8
7
7
8
7
6
7
8
5
SFF, Batch mode
12 months
7
7
8
Least squares
12 months
8
10
5
M-estimator
12 months
11
10
9
n/a
n/a
9
14
4
6
2
5
EKFBVEP
EKFALS
Gauteng
EKFBVEP
EKFALS
Spectral Band
NDVI 2 Bands 7 Bands
7
6
6
8
10
9
8
9
7
of hidden nodes used in each experiment are reported. The learning rate was set to 0.01 and the
momentum parameter was set to 0.9. The maximum number of epochs in each training phase was set
to 10000, and used the generalisation error on the validation set as an early stopping criterion.
8.5.2
Batch mode versus iterative retrained mode
In this section the notion of an iterative retrained training mode is explored and is compared to a
classical batch training mode. The change detection method extracts feature vectors sequentially from
a time series using a temporal sliding window. These feature vectors must be processed to yield a class
label for each feature vector.
A MLP operating on the SFFs extracted from the temporal sliding window was used to explore the
difference in classification accuracies between the batch mode and iteratively retrained mode. In the
batch mode [130, Ch. 7 p. 263] all the incremental sliding windows between the year 2000 and the
year 2008 were used as initial training inputs to the MLP. The experiments were conducted for the 8
years without any retraining.
The iteratively retrained MLP is proposed to compensate for the inter-annual variability between
years due to the rainfall variability. The iteratively retrained MLP is trained to recognise data from
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
149
Chapter 8
Results
Table 8.3: Classification accuracy of the batch mode and iteratively retrained MLP on the validation
set. Each entry gives the average classification accuracy for each mode, calculated over 10 repeated
independent experiments along with the corresponding standard deviation. The average classification
accuracy is given in percentage for each of the classes over a temporal sliding window length of 12
months and different sets of spectral band combinations (NDVI, 2 spectral bands and all 7 spectral
bands).
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Mode
Batch mode Iteratively retrained
67.7 ± 9.5
72.8 ± 5.3
83.0 ± 4.9
83.2 ± 3.7
2 Bands
Vegetation
Settlement
80.5 ± 5.6
87.2 ± 2.0
83.1 ± 4.1
86.8 ± 2.7
7 Bands
Vegetation
Settlement
94.5 ± 2.1
94.8 ± 1.2
94.4 ± 1.6
95.2 ± 1.1
NDVI
Vegetation
Settlement
94.6 ± 4.1
82.3 ± 8.9
96.2 ± 2.0
88.0 ± 6.3
2 Bands
Vegetation
Settlement
96.6 ± 1.4
92.2 ± 3.2
96.7 ± 1.6
95.6 ± 2.3
7 Bands
Vegetation
Settlement
97.2 ± 0.4
95.7 ± 0.4
99.8 ± 0.3
99.3 ± 0.7
Gauteng
Class
the training set within the sliding window at position p in the time series, and is then used to classify
the data from the testing set within the sliding window at position p. This retraining at each time
increment caused a small adaptation of the weights, and has low complexity because of the small
incremental MLP weight changes over each 8 day increment of MODIS. These small MLP weight
changes only required 300 epochs at each time increment for network adaptation.
The iteratively retrained mode provided slightly higher mean classification accuracies when
compared to the classical batch training mode.
The reason why the iteratively retrained mode
performed better than the batch mode (table 8.3) is that the iteratively retrained mode had the
advantage of learning the most recent spectral properties of the land cover types, as time progressed.
The iteratively retrained mode takes cognisance of what is within the temporal sliding window to
compensate for short-term inter-annual climate variability and adapts to longer term trends in climate
without confusing any of these with a particular land cover type, which has often been a problem
with other regional land cover studies [218, 219]. It should be noted that these benefits of using the
iteratively retrained mode comes at the cost of having shorter predictive spans, as predicting future
events will require retraining with an training data set that is unavailable. The benefits of using
iteratively retrained mode resulted in it being used in the remainder of this chapter.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
150
Chapter 8
8.5.3
Results
Optimising least squares
Classification accuracy [Percentage]
100
90
80
70
60
50
40
30
20
Vegetation accuracy
Settlement accuracy
10
0
1
2
3
4
5
6
7
8
9
10 11 12
Sliding window length [Months]
13
14
15
16
17
18
F IGURE 8.7: Classification accuracy reported by the K-means algorithm using the model fitted with a
least squares model approach. The average classification accuracy is measured in percentage for each
of the classes over a range of temporal sliding window length.
In this section an experiment was conducted to determine the optimal length of the sliding window
when using the least squares approach to fit a model. The model is a triply modulated cosine
model and the estimated parameters are used by a machine learning method for classification and
change detection. The sliding window length was evaluated against classification accuracy, the model
parameters’ standard deviation and residuals of the fitted model. The classification accuracies were
computed using the K-means algorithm operating on the first two spectral bands that were extracted
from the Limpopo province study area. In figure 8.7, the classification accuracies are plotted as a
function of the sliding window length, which is reported in the number of months.
It was observed that the settlement classification accuracy stabilised above 80% when the sliding
window length surpassed the 5 month mark. The vegetation classification accuracy only stabilised
above 80% after the sliding window had a length longer than 9 months. Similar classification
accuracies and corresponding standard deviations were observed for both classes when the sliding
window length increased beyond 11 months.
The model parameters’ standard deviation for both the mean and amplitude parameters are shown
in figure 8.8(a) and figure 8.8(b) respectively. It was observed that the model parameters’ standard
deviation for both the mean and amplitude parameters reduced as the length of the sliding window was
increased. The mean parameter’s standard deviation for both spectral bands started to decrease more
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
151
log(σ)
Chapter 8
Results
Spectral band 1
Spectral band 2
2
10
1
2
3
4
5
6
7
8
9
10 11 12
Sliding window length [Months]
13
14
15
16
17
18
log(σ)
(a) Mean parameter
Spectral band 1
Spectral band 2
2
10
1
2
3
4
5
6
7
8
9
10 11 12
Sliding window length [Months]
13
14
15
16
17
18
log(error)
(b) Amplitude parameter
2
10
Spectral band 1
Spectral band 2
0
10
1
2
3
4
5
6
7
8
9
10 11 12
Sliding window length [Months]
13
14
15
16
17
18
(c) Absolute error
F IGURE 8.8: The standard deviation for the mean and amplitude parameter are illustrated in (a) and
(b) when using a least squares approach to fit a triply modulated cosine model to the first two spectral
bands of MODIS. The absolute error between the fitted model and the actual MODIS time series is
shown in (c).
slowly when the sliding window length was longer than 9 months. The amplitude parameter’s standard
deviation for both spectral bands started to decrease more slowly when the sliding window length was
longer than 10 months.
The opposite was observed with the absolute error, which measures the difference between the fitted
model and the actual MODIS time series. A shorter sliding window length had a smaller measured
residuals, except if the window was too short and was severely affected by the additive noise in the
MODIS time series. A sliding window of 2–3 months had the smallest measured residuals (figure
8.8(c)).
The length of the sliding window was determined based on the classification accuracies, owing to
the inverse relationship between the standard deviations of the model’s parameters and the absolute
error. On the basis of this experiment it was decided to set the sliding window length to 12 months for
all experiments using least squares to fit a model. The similarity between the results produced by the
least squares and M-estimator supports the choice of a 12 month window for the M-estimator too. No
significant variations in the parameter vector were found when sliding the window through the time
series and using the least squares or the M-estimator.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
152
Chapter 8
8.5.4
Results
BVEP versus autocovariance least squares
Table 8.4: Classification accuracy of the MLP using either the BVEP criterion or the ALS approach to
fine tune the parameters of the Extended Kalman filter. Each entry gives the average classification
accuracy for each mode, calculated over 10 repeated independent experiments along with the
corresponding standard deviation. The average classification accuracy is given as a percentage for
each of the classes over a number of spectral band combinations (NDVI, 2 spectral bands and all 7
spectral bands).
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Mode
EKFALS
EKFBVEP
66.6 ± 9.1 80.2 ± 4.4
79.2 ± 6.2 82.7 ± 3.7
2 Bands
Vegetation
Settlement
79.3 ± 2.7
85.9 ± 2.1
87.2 ± 1.6
89.7 ± 1.3
7 Bands
Vegetation
Settlement
86.6 ± 3.7
90.6 ± 1.9
95.3 ± 0.7
96.1 ± 0.6
NDVI
Vegetation
Settlement
89.3 ± 4.8
72.1 ± 16.9
91.4 ± 5.7
86.9 ± 9.1
2 Bands
Vegetation
Settlement
90.6 ± 2.9
87.6 ± 3.2
98.6 ± 1.0
96.2 ± 1.5
7 Bands
Vegetation
Settlement
95.3 ± 1.8
94.8 ± 2.4
99.9 ± 0.1
99.9 ± 0.1
Gauteng
Class
In this section two different methods used for setting the parameters of the EKF are investigated.
The first method that is investigated is the ALS method discussed in section 7.3. The second method
investigated is the BVEP criterion approach discussed in section 7.2.4.
In table 8.4, the classification accuracies for both provinces are reported when the EKF is used
to extract the features. The average classification accuracy is calculated with cross-validation using
10 repeated independent experiments [127]. From these results it was concluded that the EKFBVEP
performed better than any experiment conducted using the EKFALS . This could be owing to the fact
that the BVEP criterion utilises spatial information that is inherent in the set of time series.
8.5.5
Optimisation of Kalman filter parameters
In this section the results obtained by using the BVSA are discussed. The BVSA is an iterative
algorithm that moves the BVS through a defined space. In each epoch the algorithm attempts to
minimise the standard deviation of all the state space variables while simultaneously minimising the
residual between the triple modulated cosine function’s output and the actual observations.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
153
Chapter 8
Results
1
Spectral band 1
10
0
σµ
10
−1
10
−2
10
−3
10
5
10
15
20
Epochs
F IGURE 8.9: The expected standard deviation of the mean parameter computed for the first MODIS
spectral band on the Limpopo province study area as a function of epoch.
1
Spectral band 1
10
0
σα
10
−1
10
−2
10
−3
10
5
10
15
20
Epochs
F IGURE 8.10: The expected standard deviation of the amplitude parameter computed for the first
MODIS spectral band on the Limpopo province study area as a function of epoch.
In figure 8.9, the standard deviation σµ of the mean parameter obtained by fitting the cosine model
to the first MODIS spectral band is illustrated as a function of epoch in the BVSA. The standard
deviation reported here is the average standard deviation found over all the time series extracted from
the Limpopo province study area. It is clear from the graph that the standard deviation decreases as
more epochs are processed, which implies that the mean parameter appears to become more stable with
each iteration.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
154
Chapter 8
Results
The standard deviation σα of the amplitude parameter that is used to fit the first MODIS spectral
band is illustrated as a function of epoch of the BVSA in figure 8.10. The standard deviation reported
here is the average standard deviation found over all the time series extracted from the Limpopo
province study area. It is clear from the graph that the standard deviation decreases as more epochs are
processed, implying increasing stability with further iterations.
2
10
σE
Spectral band 1
5
10
15
20
Epochs
F IGURE 8.11: The expected residuals computed for the first MODIS spectral band on the Limpopo
province study area as a function epoch.
In figure 8.11, the mean residual σE over all the time series’ difference between the actual
observations and EKF output is illustrated as a function of epoch in the BVSA. It is observed that
the residual decreases significantly after the 10th epoch. Overfitting appears towards the end of the
optimisation process. This overfit can occur on any metric and in this experiment the overfit is observed
on the σE metric after the 21st epoch. This overfit defines the end of the search and is used as an early
stopping criterion.
Table 8.5: Parameter evaluation of two different search methods that were compared in the Limpopo
province study area.
Algorithm
Simulated Annealing
BVSA
Parameter evaluation
σµ
σα
σE
14.5 12.6
94.6
0.04 0.02
87.1
The process covariance matrix Q and observation covariance matrix R used in the 21st epoch are
then used to initialise the EKF for the experiments. The BVSA is applied independently to each of the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
155
Chapter 8
Results
seven spectral bands and NDVI time series to obtain a process covariance matrix Q and observation
covariance matrix R for each spectral band.
Table 8.6: Parameters evaluation of all four methods for the Limpopo province study area. The
measurements are made on all seven MODIS spectral bands and NDVI.
Province
Limpopo
Spectral Band
NDVI
σE
σµ
σα
Least
squares
0.04
0.02
0.02
Band 1
σE
σµ
σα
Band 2
Mode
M-estimator EKFALS
EKFBVEP
0.04
0.01
0.02
0.001
0.04
0.05
0.03
0.02
0.001
118.6
28.8
36.4
118.7
28.1
36.1
144.0
29.8
21.8
87.1
0.04
0.02
σE
σµ
σα
145.2
38.5
56.4
144.7
37.4
57.6
179.9
29.6
25.2
95.7
0.01
0.36
Band 3
σE
σµ
σα
58.1
13.6
18.9
58.0
13.1
18.3
62.3
20.9
14.7
47.9
0.06
0.05
Band 4
σE
σµ
σα
65.6
14.2
19.7
65.6
14.1
20.8
81.0
25.5
18.0
58.3
0.05
0.04
Band 5
σE
σµ
σα
154.6
36.7
48.6
154.3
36.2
49.1
171.1
29.6
24.9
97.3
0.01
0.01
Band 6
σE
σµ
σα
198.5
46.6
67.8
198.4
45.8
68.1
242.4
33.8
27.3
166.9
0.01
0.01
Band 7
σE
σµ
σα
232.1
79.3
77.9
232.0
76.5
76.4
302.0
31.3
26.1
201.1
0.02
0.03
It should be noted that other optimisation algorithms were also explored, based on the objective
function defined in the BVEP criterion (equation (7.50)) to evaluate the performance of the BVSA.
The algorithms used to set the BVS are: (1) the interior point method [220], (2) active set method
[221], and (3) simulating annealing [222]. It is observed from the active set method that larger and
more aggressive step sizes are required, which supports the BVSA described on page 135. Simulated
annealing (500 epochs, 5 function evaluations per epoch) produced better results than either the active
set method or the interior point method. Table 8.5 compares simulated annealing to BVSA.
By evaluating the propagation direction of the simulating annealing method, it was concluded that
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
156
Chapter 8
Results
the method would eventually find the same solution identified by the BVSA, and yield the exact same
performance. The advantage of the BVSA was the speed of convergence, which is attributed to the
fact that it only requires a single function evaluation per epoch and converged in 21 epochs in this
experiment.
8.5.6
BVSA parameter evaluation
Table 8.7: Parameters evaluation of all four methods for the Gauteng province study area. The
measurements are made on all seven MODIS spectral bands and NDVI.
Province
Gauteng
Spectral Band
NDVI
σE
σµ
σα
Least
squares
0.04
0.01
0.009
Band 1
σE
σµ
σα
Band 2
Mode
M-estimator EKFALS
EKFBVEP
0.04
0.01
0.01
0.001
0.07
0.06
0.003
0.05
0.01
96.6
17.7
22.5
96.6
17.4
22.2
90.8
21.3
17.3
44.8
0.01
15.3
σE
σµ
σα
156.4
49.1
54.9
155.9
47.2
55.3
204.2
29.8
25.5
123.4
0.01
0.5
Band 3
σE
σµ
σα
55.1
10.2
14.0
55.1
9.8
13.5
46.7
14.9
12.2
38.5
0.03
0.02
Band 4
σE
σµ
σα
63.3
12.6
14.7
63.3
12.6
15.4
57.0
19.2
14.5
42.7
0.04
0.03
Band 5
σE
σµ
σα
153.2
47.4
54.2
153.0
46.2
53.8
162.9
26.6
22.6
105.3
0.01
0.01
Band 6
σE
σµ
σα
157.3
29.8
34.8
157.4
30.0
36.6
130.5
24.9
22.2
87.3
0.01
0.01
Band 7
σE
σµ
σα
158.0
27.8
35.0
157.8
27.0
34.3
151.9
23.0
21.7
71.9
0.02
20.5
In this section the derived parameters for each regression method are compared along with the
residuals. The comparison is based on the standard deviation σµ of the mean parameter, the standard
deviation σα of the amplitude parameter, and the residuals σE . A mean (amplitude) parameter with a
small standard deviation indicates a stable variable. A small σE indicates a well-estimated output when
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
157
Chapter 8
Results
compared to the actual observations.
An analysis of the standard deviation of the parameters extracted from the Limpopo province
data is presented in table 8.6. It was observed that the M-estimator generally performs similarly to
least squares, and in some cases performed slightly better. The EKFALS method generally increased
the residuals to improve the parameter stability when compared to the M-estimator. The EKFBVEP
outperformed all the methods in all the experiments, except for the NDVI experiments. The EKFBVEP
however did yield comparable results to the other methods in the NDVI experiments.
In table 8.7, the same comparison was made as in table 8.6 for the Gauteng province study area.
The M-estimator again performed similar to the least squares and in a few experiments performed
slightly better. The relation between the EKFALS method and M-estimator did not hold in the Gauteng
province study area. The EKFALS method increased its residuals in spectral bands 2 and 5 to improve
the parameters’ stability when compared to the M-estimator. In spectral bands 1, 3 and 4 the mean
parameter’s standard deviation σµ was increased to improve the other two metrics. In spectral bands
6 and 7, EKFALS outperformed the M-estimator in all the metrics. In the NDVI case the EKFALS
decreased its residuals at the cost of parameter stability when compared to the M-estimator.
The EKFBVEP outperformed all methods in all the experiments, except for the NDVI experiments.
A peculiar observation was made for the EKFBVEP in spectral bands 1 and 7. For the first spectral
band case overfitting was observed in the amplitude parameter early in the BVSA, which is used as an
early stopping criterion. For the seventh spectral band case the standard deviation σα of the amplitude
parameter slowly monotonically decreased for each epoch of the BVSA until an overfit was reported
on the residuals σE at the 22nd epoch. If the overfit did not occur, the standard deviation σα of the
amplitude parameter would still steadily decrease. In the remainder of the chapter only the optimised
EKF using the BVEP criterion (EKFBVEP ) will be considered and will be referred to as the EKF
method.
8.5.7
Determining the number of clusters
Determining the number of clusters is one of the most difficult design considerations. The number
of clusters K must be determined that provides maximum compression of information in the feature
vectors with minimal error in classification on the data set.
The average silhouette value Save (equation (4.31) on page 82) is the metric used to determine the
number of clusters. The nature of selecting only natural vegetation and human settlement areas in the
labelled time series data set, and the resolution of the MODIS sensor, suggested a strong tendency of
Save to have a high value at lower values of K. This is due to the fact that the labelled data set contains
two distinct classes. At 500 metre resolution, the MODIS pixels are quite large, and are therefore
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
158
Chapter 8
Results
0.7
Silhouette value
0.65
0.6
0.55
0.5
0.45
0.4
0.35
2
4
6
8
10
12
Number of clusters
14
16
18
20
F IGURE 8.12: The average silhouette value Save computed over a range of different number of clusters
in the Gauteng province.
likely to contain a mixture of different vegetation types. Nevertheless, it is reasonable to assume that
the variability within the broader vegetation class will be large enough to justify splitting the vegetation
class into subclasses. This however was not the case in the labelled data sets in this study.
In figure 8.12, an experiment was performed to compute the average silhouette value Save for a
range of K. The experiment was conducted in Gauteng province using the EKF on the first two
spectral bands. The feature vectors were then clustered using the K-means algorithm, followed by the
computing of the silhouette values. The highest average silhouette value of 0.69 was recorded at two
classes and steadily decreased as K increased. The experiment was repeated for all the other clustering
methods, with K=2 producing the highest silhouette value in all the cases. The same experiments were
conducted in the Limpopo province study area and yielded similar results.
8.5.8
Results: Cophenetic correlation coefficient
In this section the cophenetic correlation coefficient Dcc was computed for a range of hierarchical
clustering methods: single linkage criterion (section 8.6.3), average linkage criterion (section 8.6.3),
complete linkage criterion (section 8.6.3) and Ward clustering (section 8.6.4).
The cophenetic correlation coefficient evaluates how the created dendrogram retains the original
placement of the feature vectors within the feature space. A high cophenetic correlation coefficient,
Dcc → 1, denotes that the distance representation is well preserved in the dendrogram.
The
cophenetic correlation coefficient was computed in the Limpopo province for a range of experimental
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
159
Chapter 8
Results
Table 8.8: The Cophenetic correlation coefficient computed for a range of hierarchical clustering
methods on the Limpopo province’s no change data set.
Algorithm
Feature extraction
Single linkage
criterion
Average linkage
criterion
Complete linkage
criterion
Ward clustering
Window length
Spectral Band
2 Bands 7 Bands
0.31
0.33
0.32
0.33
0.32
0.33
SFF
6 months
12 months
18 months
NDVI
0.50
0.51
0.52
Least squares
12 months
0.49
0.32
0.38
M-estimator
12 months
0.49
0.32
0.39
EKF
n/a
0.46
0.28
0.29
SFF
6 months
12 months
18 months
0.59
0.59
0.59
0.64
0.65
0.65
0.61
0.61
0.62
Least squares
12 months
0.60
0.62
0.61
M-estimator
12 months
0.60
0.62
0.60
EKF
n/a
0.59
0.62
0.59
SFF
6 months
12 months
18 months
0.64
0.64
0.64
0.64
0.65
0.66
0.62
0.63
0.63
Least squares
12 months
0.60
0.61
0.62
M-estimator
12 months
0.60
0.62
0.62
EKF
n/a
0.62
0.63
0.64
SFF
6 months
12 months
18 months
0.69
0.69
0.70
0.71
0.72
0.72
0.68
0.68
0.69
Least squares
12 months
0.67
0.73
0.69
M-estimator
12 months
0.67
0.73
0.69
n/a
0.68
0.74
0.69
EKF
parameters (table 8.8): hierarchical clustering methods, feature extraction methods, and spectral band
combinations.
A small improvement in the cophenetic correlation coefficient is observed when the sliding window
length is increased. It is concluded that the cophenetic correlation coefficient is highly dependent on the
clustering method used, as all feature extraction methods performed similarly when using a particular
clustering method.
The single linkage criterion provided the lowest cophenetic correlation coefficients among the
clustering methods.
The average linkage criterion provided much better cophenetic correlation
coefficients than the experiments using the single linkage criterion. A small improvement is observed
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
160
Chapter 8
Results
in the NDVI experiments when the complete linkage criterion is compared to the average linkage
criterion. Similar results were observed for the average and complete linkage criteria in the two and
seven spectral band experiments. A small improvement was observed in all the experiments when
Ward clustering was used instead of the complete linkage criterion.
The same trend in cophenetic correlation coefficients was observed in the Gauteng province when
all the experiments were compared to the results produced in the Limpopo province. The cophenetic
correlation coefficient confirms the trend, which is observed in classification accuracies through
sections 8.6.3–8.6.4. This is an important experiment, as this result was derived in an unsupervised
manner, meaning the class labels for each time series were not used in the cluster process. It
was concluded from the experiments conducted in this section that creating spherical clusters with
minimum internal variance preserves the inherent distance between feature vectors within the feature
space, which results in a higher cophenetic correlation coefficient.
8.6
8.6.1
CLASSIFICATION
Classification accuracy: Multilayer perceptron
Table 8.9: Classification accuracy of the MLP using SSFs on the no change data set. Each entry gives
the average classification accuracy in percentage along with the corresponding standard deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Sliding window length
6 months
12 months 18 months
69.7 ± 7.8 72.8 ± 5.3 73.9 ± 4.8
81.5 ± 5.0 83.2 ± 3.7 84.8 ± 3.1
2 Bands
Vegetation
Settlement
81.4 ± 4.3
86.3 ± 3.4
83.1 ± 4.1
86.8 ± 2.7
85.2 ± 3.7
88.1 ± 2.2
7 Bands
Vegetation
Settlement
93.1 ± 2.1
93.8 ± 1.6
94.4 ± 1.6
95.2 ± 1.1
94.7 ± 1.4
96.3 ± 0.9
NDVI
Vegetation
Settlement
94.4 ± 3.7
79.5 ± 11.5
96.2 ± 2.0
88.0 ± 6.3
95.8 ± 2.2
88.5 ± 7.2
2 Bands
Vegetation
Settlement
95.1 ± 2.8
90.7 ± 6.7
96.7 ± 1.6
95.6 ± 2.3
97.2 ± 1.9
95.8 ± 2.5
7 Bands
Vegetation
Settlement
99.3 ± 0.7
98.1 ± 1.4
99.8 ± 0.3
99.3 ± 0.7
99.8 ± 0.3
99.6 ± 0.6
Gauteng
Class
In this section the classification accuracies are evaluated for a MLP using a range of feature
extraction methods. In table 8.9, the classification accuracies for both provinces are reported using
SFFs. The average classification accuracy and corresponding standard deviation were calculated with
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
161
Chapter 8
Results
Table 8.10: Classification accuracy of the MLP using regression methods to extract features on the
no change data set. Each entry gives the average classification accuracy in percentage along with the
corresponding standard deviation.
Province
Limpopo
Gauteng
Spectral Band
Class
Method
M-estimator
72.8 ± 5.4
84.6 ± 3.4
EKF
80.2 ± 4.4
82.7 ± 3.7
NDVI
Vegetation
Settlement
Least squares
72.5 ± 5.3
83.3 ± 3.4
2 Bands
Vegetation
Settlement
82.2 ± 4.3
86.4 ± 2.8
83.1 ± 4.3
87.7 ± 2.5
87.2 ± 1.6
89.7 ± 1.3
7 Bands
Vegetation
Settlement
92.5 ± 2.3
92.6 ± 1.2
92.5 ± 1.9
92.4 ± 1.4
95.3 ± 0.7
96.1 ± 0.6
NDVI
Vegetation
Settlement
92.5 ± 4.9
88.6 ± 6.4
93.1 ± 4.4
88.8 ± 6.0
91.4 ± 5.7
86.9 ± 9.1
2 Bands
Vegetation
Settlement
97.5 ± 1.8
95.1 ± 2.6
97.3 ± 1.9
94.9 ± 2.9
98.6 ± 1.0
96.2 ± 1.5
7 Bands
Vegetation
Settlement
99.8 ± 0.4
99.2 ± 0.5
99.9 ± 0.4
99.3 ± 0.9
99.9 ± 0.1
99.9 ± 0.1
cross-validation using 10 repeated independent experiments. The accuracy is reported for each class
over a range of temporal sliding window lengths (6, 12 and 18 months) and different spectral band
combinations (NDVI, 2 spectral bands and all 7 spectral bands).
It is observed that a longer sliding window has a higher classification accuracy in all the
experiments, as well as a reduction in standard deviations. Overall, the trend was that the classification
performance improved for a longer sliding window. Another trend that was observed was an increase
in overall performance when more spectral bands were used as input to a MLP classifier. This is
supported by a higher classification accuracy for the first two spectral bands when compared to the
NDVI, and the highest classification accuracy was reported for all seven spectral bands.
In table 8.10, the classification accuracies for both provinces are reported using regression methods
to extract the features. The regression methods attempted to fit a triply modulated cosine function to
the MODIS time series. The sliding window length was set to 12 months for both the least squares and
M-estimator approaches. A similar improvement is observed as in table 8.9 when more spectral bands
are used in the experiments.
From all the experiments it was concluded that a significant improvement is obtained when using
the first two spectral bands rather than the NDVI. A further improvement was observed when the MLP
operated on all seven spectral bands. The experiments conducted in the section are repeated in the
following sections using different clustering algorithms.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
162
Chapter 8
8.6.2
Results
Clustering experimental setup
In the following sections (section 8.6.3–8.6.4), different clustering approaches are analysed in a range
of experiments. The first set of experiments conducted in each section is the measurement of the
classification accuracy of the labelled time series using SFFs. The experiments were conducted for
three different lengths of sliding window: 6 months (23 MODIS samples), 12 months (46 MODIS
samples), and 18 months (69 MODIS samples). The experiments also explore the use of different
spectral bands: NDVI, the first two spectral bands, and all seven spectral bands. In each experiment
the classification accuracy along with the standard deviation is reported for the two classes: natural
vegetation and human settlement.
The class labels in the experiments are assigned to minimise the overall error. This is accomplished
in the Limpopo province by assigning the cluster containing majority of the feature vectors to the
settlement class, as there are more settlement class time series than vegetation class time series
(table 8.1). In the experiments conducted in the Gauteng province, the cluster containing majority
of the feature vectors is assigned to the vegetation class, as there are more vegetation class time series
than settlement class time series (table 8.1).
The second set of experiments conducted in each section is the measurement of classification
accuracies of the labelled time series using different regression methods to extract features. The
experiment is conducted on three different regression methods: least squares model fitting, M-estimator
model fitting, and EKF. The experiments were also conducted to explore the use of different spectral
bands in the similar method as in the first set of experiments. In each experiment the classification
accuracy along with the standard deviation is reported for the two classes. The class labels are again
assigned to minimised the overall error.
8.6.3
Clustering accuracy: Single, Average and Complete linkage criterion
In this section the viability of using hierarchical clustering based on the single, average and complete
linkage criteria are investigated. Table 8.11 shows the classification accuracy on the experiments
conducted using the SFFs, which were clustered based on the single, average and complete linkage
criteria.
It is clear from the experiments that the first two spectral band outperforms NDVI.The first two
spectral band also offered a slight improvement over the all seven spectral band. It is important to
note that the all seven spectral band feature vector already encapsulate the first two spectral band. The
reason for the decrease in classification accuracy is attributed to the fact that the seven spectral band
feature vector requires more clusters (number of clusters K must increase) to cater for the increase in
feature dimensionality. It was observed in an independent experiment that the classification accuracy
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
163
Chapter 8
Results
Table 8.11: Classification accuracy of a hierarchical clustering algorithm using the single, average
and complete linkage criteria with the SFFs on the no change data set. Each entry gives the average
classification accuracy in percentage along with the corresponding standard deviation for a sliding
window length of 12 months.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
2 Bands
Vegetation
Settlement
72.1 ± 16.7
80.0 ± 10.1
76.4 ± 17.6
83.5 ± 9.5
78.8 ± 15.9
85.7 ± 11.3
7 Bands
Vegetation
Settlement
71.4 ± 17.0
77.5 ± 9.9
76.5 ± 25.2
83.0 ± 12.8
75.5 ± 19.1
80.6 ± 24.0
NDVI
Vegetation
Settlement
60.9 ± 18.2
36.9 ± 25.4
65.3 ± 11.2
40.8 ± 21.8
64.8 ± 9.9
42.1 ± 20.0
2 Bands
Vegetation
Settlement
80.1 ± 16.1
66.4 ± 35.1
82.8 ± 14.8
67.0 ± 33.8
81.6 ± 11.7
69.2 ± 29.4
7 Bands
Vegetation
Settlement
79.2 ± 16.3
64.4 ± 34.2
80.2 ± 15.1
64.8 ± 34.1
80.5 ± 12.2
65.9 ± 30.1
Gauteng
Class
Sliding window length
Single linkage Average linkage Complete linkage
45.8 ± 26.7
46.2 ± 25.7
52.1 ± 28.8
70.3 ± 21.1
71.0 ± 18.9
67.1 ± 21.9
rapidly improves for the seven spectral band case if K is larger than 10. The number of clusters
was not increased as the objective of the use of the unsupervised classifier is to evaluate a completely
unsupervised change detection method. A supervised algorithm must then be applied onto the clusters
if more clusters are included.
The first two spectral band experiments offered acceptable performance in both provinces. It
should be noted that these classification accuracies could only be obtained with these three hierarchical
clustering methods when performing proper outlier removal. The outliers were identified by applying
principle component analysis to the feature vectors and calculating the Hotellier T 2 distance between
the principal components and each of the transformed feature vectors. The outliers were then selected
with distances exceeding a predefined threshold. The other clustering methods did not require the
removal of outliers and for this reason the single linkage, average linkage and complete linkage criteria
will not be further evaluated in this chapter.
8.6.4
Clustering accuracy: Ward clustering method
In this section the viability of using the Ward clustering method is investigated. Table 8.12 and table
8.13 show the results for the experiments that were produced using the Ward clustering method.
The Ward clustering method provided no acceptable classification accuracies when clustering on
the NDVI time series. The Ward clustering method did however provide reasonable classification
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
164
Chapter 8
Results
Table 8.12: Classification accuracy of the Ward clustering method using the SFFs on the no change data
set. Each entry gives the average classification accuracy in percentage along with the corresponding
standard deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Sliding window length
6 months
12 months
18 months
45.3 ± 19.4 45.4 ± 17.5 46.3 ± 17.2
64.6 ± 12.8 66.3 ± 11.9 66.6 ± 11.7
2 Bands
Vegetation
Settlement
79.0 ± 14.2
78.2 ± 11.1
80.9 ± 13.8
77.5 ± 10.2
81.7 ± 13.4
77.3 ± 10.3
7 Bands
Vegetation
Settlement
72.4 ± 16.5
73.6 ± 11.9
73.8 ± 15.6
74.5 ± 11.5
73.8 ± 15.8
74.7 ± 11.1
NDVI
Vegetation
Settlement
66.4 ± 10.8
35.2 ± 28.9
67.4 ± 8.8
38.7 ± 28.6
67.5 ± 8.7
38.9 ± 29.0
2 Bands
Vegetation
Settlement
81.3 ± 14.5
68.0 ± 31.9
86.8 ± 13.1
69.8 ± 31.8
86.8 ± 12.7
69.9 ± 32.0
7 Bands
Vegetation
Settlement
77.4 ± 15.6
24.5 ± 19.0
78.2 ± 17.8
26.2 ± 18.7
76.3 ± 18.3
27.9 ± 23.1
Gauteng
Class
Table 8.13: Classification accuracy of Ward clustering with the regression methods to extract features
on the no change data set. Each entry gives the average classification accuracy in percentage along
with the corresponding standard deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
2 Bands
Gauteng
Class
Least squares
68.0 ± 16.4
78.8 ± 13.4
Method
M-estimator
68.8 ± 15.7
78.5 ± 13.4
EKF
66.3 ± 16.5
77.5 ± 13.4
Vegetation
Settlement
79.9 ± 15.1
76.9 ± 11.1
80.0 ± 15.0
76.9 ± 11.1
85.7 ± 12.3
77.7 ± 10.9
7 Bands
Vegetation
Settlement
72.8 ± 17.5
72.8 ± 14.3
72.8 ± 17.6
72.8 ± 14.2
74.1 ± 14.9
75.4 ± 9.3
NDVI
Vegetation
Settlement
94.6 ± 10.8
27.9 ± 12.5
94.7 ± 10.9
28.1 ± 12.9
85.1 ± 12.1
36.9 ± 23.3
2 Bands
Vegetation
Settlement
84.5 ± 14.5
68.6 ± 32.1
84.5 ± 14.5
68.8 ± 32.0
88.7 ± 10.2
87.9 ± 14.3
7 Bands
Vegetation
Settlement
79.6 ± 17.3
27.5 ± 22.7
79.6 ± 17.4
27.4 ± 22.6
78.8 ± 18.0
44.0 ± 25.2
accuracies when the first two spectral bands and the all seven spectral bands were used in the Limpopo
province. Classification accuracies of above 75% were reported for the first two spectral band
experiments. The EKF features using the first two spectral bands yielded classification accuracies
higher than 87.9% in the Gauteng province when compared to all the other regression methods, which
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
165
Chapter 8
Results
yielded classification accuracies below 70%.
In the seven spectral bands experiments an interesting trend was observed in all the hierarchical
clustering experiments. The classification accuracies were lower in higher dimensions (7 spectral
bands) than in lower dimensions (2 spectral bands). The question that was raised was whether the
feature vectors became more separable in higher dimensions. The answer was confirmed with the
MLP in section 8.6.1, where the MLP reported higher classification accuracies in the seven spectral
band experiments when compared to the two spectral band experiments.
This reverts back to the statement made in section 4.2.2 on page 70 that clustering in a
high-dimensional feature space usually provides meaningless results if proper design considerations
are not followed [197, 198]. This is usually attributed to the notion that the ratio between the nearest
neighbour and average neighbourhood distance rapidly converges to one in higher dimensions.
The remedy for this reduction in classification accuracy in the seven spectral band experiments
is the implementation of a more complex clustering algorithm or a more in-depth feature selection
criterion. The complex clustering algorithm will create non-linear mappings as with the MLP to
obtain the desired classification accuracies. The shortcoming is the need to over design the clustering
algorithm for a particular data set. Feature selection is the other approach that can be used to improve
clustering performance, as it is used as a dimensionality reduction procedure, which uses fewer spectral
bands to improve the performance. The problem is that different combinations of spectral bands will
perform better on different data sets.
Based on the impossibility theorem, the emphasis is placed on obtaining acceptable performance
in the clustering algorithm. As stated previously, the Ward clustering method does provide acceptable
classification accuracies when using the first two spectral bands.
8.6.5
Clustering accuracy: K-means clustering
In this section the viability of using the K-means partitional clustering method is investigated.
Table 8.14 and table 8.15 illustrate the classification accuracies for the experiments conducted with
the K-means clustering algorithm.
The clustering of the NDVI time series using K-means provided acceptable classification
accuracies when the regression method was used in the Limpopo province (table 8.15). This however
was not the case in the Gauteng province, from which it can be concluded that the performance of
clustering NDVI time series with K-means was unacceptable as it is only usable in the Limpopo
province.
The first two spectral band experiments provided better classification accuracy performance when
compared to any similar hierarchical clustering method. The EKF approach was deemed the best
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
166
Chapter 8
Results
Table 8.14: Classification accuracy of K-means with the SFFs on the no change data set. Each
entry gives the average classification accuracy in percentage along with the corresponding standard
deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Sliding window length
6 months
12 months
18 months
53.2 ± 12.8 54.4 ± 8.3
54.8 ± 9.2
58.7 ± 7.1
59.9 ± 5.3
59.7 ± 7.3
2 Bands
Vegetation
Settlement
81.7 ± 4.7
81.4 ± 2.2
82.9 ± 3.7
82.0 ± 2.4
83.4 ± 3.5
81.8 ± 2.2
7 Bands
Vegetation
Settlement
75.8 ± 5.0
74.9 ± 2.8
76.2 ± 4.6
75.2 ± 2.3
76.3 ± 4.3
75.2 ± 2.1
NDVI
Vegetation
Settlement
61.3 ± 8.0
42.3 ± 28.3
63.1 ± 5.3
39.8 ± 30.2
65.5 ± 6.7
38.9 ± 29.9
2 Bands
Vegetation
Settlement
85.1 ± 9.1
72.6 ± 19.4
90.0 ± 7.3
70.9 ± 21.3
90.4 ± 7.2
71.2 ± 21.7
7 Bands
Vegetation
Settlement
76.5 ± 13.2
38.7 ± 7.6
77.3 ± 13.1
41.2 ± 6.8
77.3 ± 13.4
41.6 ± 6.3
Gauteng
Class
Table 8.15: Classification accuracy of K-means with the regression methods to extract features on the
no change data set. Each entry gives the average classification accuracy in percentage along with the
corresponding standard deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
2 Bands
Gauteng
Class
Least squares
69.9 ± 5.7
79.3 ± 3.5
Method
M-estimator
71.4 ± 5.7
81.2 ± 3.4
EKF
70.5 ± 6.8
79.1 ± 4.7
Vegetation
Settlement
81.5 ± 3.5
80.7 ± 3.1
81.5 ± 3.6
80.6 ± 3.0
84.4 ± 0.2
82.3 ± 0.2
7 Bands
Vegetation
Settlement
76.7 ± 3.8
74.3 ± 2.8
76.7 ± 3.7
74.5 ± 2.7
76.3 ± 0.2
75.1 ± 0.1
NDVI
Vegetation
Settlement
94.4 ± 5.2
29.2 ± 2.7
94.4 ± 5.2
29.3 ± 2.6
68.3 ± 14.2
39.9 ± 32.2
2 Bands
Vegetation
Settlement
87.2 ± 7.6
73.9 ± 20.1
87.2 ± 7.6
73.9 ± 20.2
92.3 ± 0.4
84.7 ± 2.2
7 Bands
Vegetation
Settlement
75.9 ± 12.5
24.5 ± 6.6
76.0 ± 12.4
24.5 ± 6.6
75.9 ± 1.9
33.2 ± 0.7
performing feature extraction method in view of the small standard deviation in classification accuracy.
A similar observation was made for the partitional clustering as for the hierarchical clustering when
clustering in higher dimensions. A small decrease of 6% was measured in classification accuracy when
the first two spectral band experiments were compared to the all seven spectral band experiments in
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
167
Chapter 8
Results
the Limpopo province. A large decrease of over 30% was measured in classification accuracy when
comparing the same experiments in the Gauteng province. This suggested that the same approach as
described in section 8.6.4 must be followed.
8.6.6
Clustering accuracy: Expectation-Maximisation
In this section the viability of using the EM clustering algorithm is investigated. Table 8.16 and
table 8.17 illustrate the results for the experiments conducted with the EM clustering algorithm. It was
concluded from the experiments that the K-means clustering algorithm and EM clustering algorithm
perform similarly, as the experimental results were almost exactly the same.
Table 8.16: Classification accuracy of EM algorithm with the SFFs on the no change data set. Each
entry gives the average classification accuracy in percentage along with the corresponding standard
deviation.
Province
Spectral Band
Limpopo
NDVI
Vegetation
Settlement
Sliding window length
6 months
12 months
18 months
51.3 ± 12.8 52.4 ± 8.5 52.9 ± 11.7
58.7 ± 7.1
58.8 ± 6.5
57.7 ± 7.3
2 Bands
Vegetation
Settlement
80.7 ± 4.6
81.4 ± 2.2
81.9 ± 3.7
81.1 ± 2.2
81.4 ± 3.6
80.6 ± 2.1
7 Bands
Vegetation
Settlement
75.8 ± 5.0
75.0 ± 2.9
76.3 ± 4.5
75.2 ± 2.3
76.3 ± 4.3
75.2 ± 2.1
NDVI
Vegetation
Settlement
61.3 ± 8.0
42.3 ± 28.3
63.1 ± 5.3
39.8 ± 30.2
65.5 ± 6.7
39.0 ± 29.9
2 Bands
Vegetation
Settlement
85.1 ± 9.1
72.6 ± 19.4
90.0 ± 7.4
70.9 ± 21.1
90.4 ± 7.2
71.2 ± 21.7
7 Bands
Vegetation
Settlement
76.5 ± 13.2
38.7 ± 7.6
77.3 ± 13.2
41.2 ± 6.8
77.3 ± 13.4
41.6 ± 6.3
Gauteng
Class
The EM clustering algorithm did however have a slightly lower classification accuracy at a
negligible increase in standard deviation in a few of the experiments. For this reason the K-means
clustering algorithm was chosen for its lower computational complexity.
8.6.7
Summary of classification results
In this section the results of the classification accuracies for section 8.6 are summarised. The first
classifier that was considered in this section was the supervised MLP, which had the advantage of
modelling a non-linear relationship between the input and output vectors.
The prospect of detecting land cover change was confirmed as possible by either using the NDVI
time series or the first two spectral bands time series of the MODIS data, as this was supported by
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
168
Chapter 8
Results
Table 8.17: Classification accuracy of EM algorithm with the regression methods to extract features on
the no change data set. Each entry gives the average classification accuracy in percentage along with
the corresponding standard deviation.
Province
Limpopo
Gauteng
Spectral Band
Class
Method
M-estimator
71.3 ± 5.7
81.3 ± 3.4
EKF
69.5 ± 6.9
79.0 ± 4.7
NDVI
Vegetation
Settlement
Least squares
69.9 ± 5.9
79.3 ± 3.5
2 Bands
Vegetation
Settlement
81.5 ± 3.5
80.7 ± 3.1
81.5 ± 3.5
80.6 ± 3.1
84.3 ± 0.2
81.3 ± 0.2
7 Bands
Vegetation
Settlement
76.7 ± 3.8
74.5 ± 2.4
76.8 ± 3.8
74.4 ± 2.5
76.3 ± 0.2
75.0 ± 0.1
NDVI
Vegetation
Settlement
94.4 ± 5.2
29.2 ± 2.6
94.4 ± 5.2
29.3 ± 2.9
68.3 ± 14.2
40.1 ± 31.2
2 Bands
Vegetation
Settlement
87.2 ± 8.4
73.1 ± 22.0
87.2 ± 8.3
73.1 ± 22.0
92.2 ± 0.4
83.9 ± 2.1
7 Bands
Vegetation
Settlement
75.8 ± 12.3
24.5 ± 6.8
75.9 ± 12.5
24.4 ± 6.6
75.8 ± 1.9
33.2 ± 0.7
the results in [223]. The classification accuracies produced by the MLP were however found to be the
highest when using all seven spectral bands.
The MLP was deemed to be the best classifier in this chapter when the feature vectors were
extracted with the EKF. Classification accuracies of 95.3% with a standard deviation of 0.7% for
the vegetation class, and 96.1% with a standard deviation of 0.6% for the settlement class were
reported in the Limpopo province. In the Gauteng province classification accuracies of 99.9% with
a standard deviation of 0.1% for the vegetation class and 99.9% with a standard deviation of 0.1% for
the settlement class were reported.
It should be noted that the MLP classifier can be replaced with a variety of other classifiers. The
MLP performed the best of all the classifiers in this thesis, but like most other supervised machine
learning methods, the MLP is dependent on a training set and is required to be robust to any errors
occurring within the training set [14]. The drawback in the remote sensing field is that the training
data set has to be created with the aid of high spatial resolution imagery, and because of the temporal
component must be updated periodically. These periodic updates are a costly endeavour, which justifies
the consideration of unsupervised classification methods.
An unsupervised classifier is usually designed by learning from example. Thus several clustering
methods were evaluated to make deductions about the nature of the feature vectors in the feature space.
Acceptable performance was only obtained with the single, average and complete linkage criteria
with proper outlier removal. The other clustering methods did not require the removal of outliers and
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
169
Chapter 8
Results
for this reason was not explored further.
Ward’s clustering method produced the best results of all the hierarchical clustering methods. It
was concluded from the experiments conducted that creating spherical clusters with minimum internal
variance preserves the inherent distance between feature vectors in the feature space. The algorithm
provided acceptable performance for all experiments conducted in the Limpopo province, with the
exception that acceptable performance was only observed for the first two spectral band experiments
in the Gauteng province.
K-means and EM clustering algorithms were investigated as representative partitional clustering
methods, with both methods performing very similarly. The experiments showed empirically that the
partitional clustering methods outperformed all the hierarchical clustering methods in the Limpopo
province. The partitional clustering methods had the same outcome as the Ward clustering method
in the Gauteng province, with similar poor performances in the NDVI- and seven spectral band
experiments. The partitional clustering methods were deemed to be better than the Ward clustering
method, as they presented classification accuracies with lower standard deviations. The K-means
algorithm was the preferred partitional clustering method for its reduced computational complexity.
In the next section the change detection capabilities of the algorithms are explored. Only a few
methods were explored, since the change detection in this chapter is based on a post-classification
approach. The algorithms that provided acceptable classification performance, which will be explored
in the next section, are:
1. the Multilayer perceptron,
2. the Ward clustering method, and
3. the K-means algorithm.
8.7
8.7.1
CHANGE DETECTION
Simulated land cover change detection
A simulated land cover change data set was created to assess the land cover change detection algorithm
objectively. The time series data set is used to ensure that the change detection algorithm is able to
detect a transition between classes, while analysing the transition.
In table 8.18, the first set of change detection experiments are shown that were conducted in the
Limpopo province. All the viable classification approaches that yielded acceptable performance in
section 8.6 are shown in these experiments. Each entry in table 8.18 gives the average change detection
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
170
Chapter 8
Results
Table 8.18: The land cover change detection accuracies are given on the simulated land cover change
data set in the Limpopo province. Each entry gives the true positives in percentage (false positives in
parentheses).
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
Window length
Spectral Band
2 Bands
7 Bands
77.6 (22.4) 90.5 (9.6)
78.2 (21.3) 90.8 (9.4)
78.7 (20.7) 91.0 (8.9)
6 months
12 months
18 months
NDVI
69.2 (30.0)
70.2 (29.5)
71.9 (29.2)
Least squares
12 months
68.4 (31.8)
77.5 (22.3)
90.0 (10.1)
M-estimator
12 months
69.0 (31.1)
77.2 (23.4)
90.2 (10.0)
EKF
n/a
70.0 (30.3)
79.8 (20.2)
91.7 (8.7)
SFF
6 months
12 months
18 months
51.2 (50.5)
52.4 (48.5)
52.6 (42.8)
71.1 (25.7)
71.6 (25.5)
72.2 (24.5)
68.3 (30.5)
68.7 (30.3)
69.2 (30.1)
Least squares
12 months
65.4 (33.7)
69.8 (27.9)
67.6 (32.1)
M-estimator
12 months
65.8 (33.7)
70.1 (28.0)
67.7 (32.3)
EKF
n/a
59.8 (38.1)
73.0 (22.2)
66.6 (30.8)
SFF
6 months
12 months
18 months
50.0 (46.8)
52.7 (46.1)
53.5 (40.4)
71.3 (26.8)
72.6 (26.5)
72.9 (24.5)
64.3 (33.7)
65.0 (33.0)
65.7 (33.7)
Least squares
12 months
63.4 (36.1)
70.4 (29.8)
65.4 (35.8)
M-estimator
12 months
63.5 (36.3)
70.6 (29.5)
65.4 (35.8)
n/a
57.9 (42.0)
72.8 (22.7)
64.8 (33.8)
EKF
accuracies, with the corresponding false alarm rate in parentheses. The change detection accuracies
(true positives) are measured on subset 1 and subset 2, which were discussed in section 8.2.4, and the
false alarm rates (false positives) are measured on subset 3 and subset 4.
The worst performing experiment was the method that employs the NDVI time series. The overall
change detection accuracies were well below 70%, with a reported false alarm rate higher than 30%. In
the first two spectral band experiments, acceptable performance was measured across all the methods,
with overall change detection accuracies of above 70%, and a reported false alarm rate usually below
26%.
The seven spectral band experiment yielded similar behaviour when compared to the results
observed in the classification accuracies. The MLP (supervised classifier) performed exceptionally
by reporting overall change detection accuracies above 90% and a false alarm rate below 10%. The
unsupervised classifiers, Ward clustering and K-means, reported change detection accuracies which
are lower in the higher dimensions (7 spectral bands) than in the lower dimensions (2 spectral bands).
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
171
Chapter 8
Results
Table 8.19: The land cover change detection accuracies are given on the simulated land cover change
data set in the Gauteng province. Each entry gives the true positives in percentage (false positives in
parentheses).
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
Window length
Spectral Band
2 Bands
7 Bands
89.7 (11.1) 97.3 (2.7)
91.8 (10.5) 98.5 (1.5)
92.0 (8.9)
98.5 (1.4)
6 months
12 months
18 months
NDVI
81.2 (16.3)
83.8 (16.3)
83.9 (16.4)
Least squares
12 months
78.1 (20.2)
90.0 (13.4)
97.5 (3.4)
M-estimator
12 months
80.1 (18.9)
90.2 (13.0)
97.6 (3.2)
EKF
n/a
82.5 (14.0)
93.2 (8.4)
98.4 (1.3)
SFF
6 months
12 months
18 months
27.7 (28.8)
33.2 (31.5)
35.6 (34.6)
77.6 (25.4)
80.0 (21.6)
81.1 (19.8)
32.6 (31.6)
36.9 (35.1)
39.3 (35.4)
Least squares
12 months
24.5 (17.4)
78.9 (19.7)
33.5 (28.6)
M-estimator
12 months
24.5 (17.0)
79.2 (19.4)
33.4 (28.7)
EKF
n/a
25.1 (17.2)
86.1 (7.2)
42.7 (26.0)
SFF
6 months
12 months
18 months
37.2 (42.9)
43.8 (41.6)
45.9 (46.7)
77.2 (26.6)
80.3 (23.4)
80.4 (24.6)
50.4 (41.3)
51.2 (46.9)
55.8 (38.7)
Least squares
12 months
28.6 (21.3)
74.6 (28.5)
50.6 (45.7)
M-estimator
12 months
28.6 (21.3)
75.0 (28.3)
51.3 (45.4)
n/a
36.1 (37.8)
83.8 (5.9)
50.7 (40.8)
EKF
The reduction in change detection accuracies can be attributed to the reduction in classification
accuracies shown in section 8.6.4 and section 8.6.5. The remedy for this reduction in change detection
accuracy in the seven spectral band experiment is again either a more complex clustering algorithm
or a more detailed selection of features. The more complex clustering algorithm typically requires
a non-linear clustering region to obtain higher change detection accuracies. It is reported in the
literature that this shortcoming can typically be solved by over designing the clustering algorithm for a
particular data set. The second approach to remedy this reduction is to apply dimensionality reduction,
which implies selecting different combinations of spectral bands. The potential risk is that different
combinations of spectral bands will perform better on different data sets.
The emphasis in this thesis is placed on obtaining acceptable performance with the clustering
algorithm based on the impossibility theorem. Acceptable performance is reported for all methods
employing the first two spectral bands, and exceptional performance is reported for the MLP employing
all seven spectral bands.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
172
Chapter 8
Results
In table 8.19, the second set of change detection experiments are shown that were conducted in the
Gauteng province. The same setup is used in these experiments as in the experiments conducted in
the Limpopo province. The best performing algorithms were the methods that employ the MLP. The
overall change detection accuracies were above 80% with a false alarm rate below 17%. A significant
increase in change detection accuracy is observed when the two spectral bands are evaluated when
compared to the NDVI. Both the NDVI and two spectral bands’ experiments uses the same spectral
bands, which implies that using the two spectral bands separately is better.
The worst performing experiments were the methods that employed either the NDVI or all seven
spectral bands with an unsupervised classifier. It was observed that experiments conducted with the
first two spectral bands along with an unsupervised classifier yielded acceptable performance. The
reported overall change detection accuracies were above 75% with a false alarm rate below 30%.
8.7.2
Real land cover change detection
Table 8.20: The land cover change detection accuracy on the real land cover change data set in the
Limpopo province. Each entry gives the true positives in percentage (false positives in parentheses).
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
6 months
12 months
18 months
Spectral Band
NDVI
2 Bands
7 Bands
65.4 (32.5) 75.1 (19.5) 84.8 (9.3)
66.1 (28.2) 75.3 (18.9) 85.3 (7.9)
68.0 (28.7) 76.0 (18.8) 85.3 (8.2)
Least squares
12 months
64.8 (28.6)
73.8 (23.1)
84.3 (10.1)
M-estimator
12 months
64.7 (29.9)
73.4 (22.8)
84.3 (9.9)
EKF
n/a
64.2 (24.6)
78.6 (16.7)
86.8 (8.7)
SFF
6 months
12 months
18 months
38.8 (44.7)
40.3 (52.1)
40.5 (50.3)
67.3 (26.7)
70.7 (25.9)
70.0 (25.2)
58.7 (35.5)
63.0 (32.9)
63.3 (32.6)
Least squares
12 months
57.6 (36.8)
65.4 (29.0)
62.8 (32.8)
M-estimator
12 months
57.0 (36.3)
65.4 (28.5)
62.2 (32.8)
EKF
n/a
52.8 (41.7)
71.8 (26.4)
63.5 (31.1)
SFF
6 months
12 months
18 months
44.8 (41.1)
46.0 (42.0)
46.9 (42.3)
70.2 (25.8)
70.5 (25.4)
70.5 (25.4)
59.8 (29.8)
60.6 (31.1)
61.0 (31.4)
Least squares
12 months
59.8 (37.3)
68.4 (31.1)
61.0 (32.0)
M-estimator
12 months
59.0 (36.5)
69.0 (30.3)
61.5 (33.4)
n/a
51.7 (40.1)
72.0 (24.4)
63.0 (29.9)
EKF
Window length
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
173
Chapter 8
Results
In this section, the real land cover change data set (section 8.2.2) is used to measure the performance
of the land cover change detection algorithms. This data set is used to test the validity of the algorithms
for real world applications [127].
In table 8.20, the first set of change detection experiments are reported that were conducted
in the Limpopo province. In these experiments all the viable classifiers identified in section 8.6.7
are explored. Each entry in table 8.20 gives the change detection accuracies (true positives), with
corresponding false alarm rates (false positives) in parentheses.
The worst performing methods were those that employed the NDVI spectral band. Overall
change detection accuracies in these experiments were observed to be well below 70%. On the other
hand, acceptable performance was reported across all the methods using the first two spectral bands,
except for the unsupervised classifiers operating on the features extracted with the least squares, and
M-estimator.
Table 8.21: The land cover change detection accuracy on the real land cover change data set in the
Gauteng province. Each entry gives the true positives in percentage (false positives in parentheses).
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
Window length
Spectral Band
2 Bands
7 Bands
86.5 (9.8)
94.3 (2.2)
90.0 (8.8)
95.1 (1.1)
90.4 (8.9)
95.1 (1.0)
6 months
12 months
18 months
NDVI
82.3 (20.5)
82.3 (16.8)
83.7 (15.3)
Least squares
12 months
80.0 (16.7)
87.7 (11.8)
94.3 (2.5)
M-estimator
12 months
80.0 (17.5)
87.7 (10.9)
92.9 (2.8)
EKF
n/a
83.4 (17.0)
92.1 (9.9)
95.5 (1.6)
SFF
6 months
12 months
18 months
15.8 (24.2)
20.7 (27.0)
21.2 (28.8)
80.1 (21.2)
80.3 (21.5)
80.3 (21.4)
28.7 (29.8)
31.3 (30.1)
31.3 (30.3)
Least squares
12 months
18.8 (18.0)
78.0 (23.1)
29.7 (29.4)
M-estimator
12 months
18.1 (17.7)
75.5 (22.2)
30.5 (29.6)
EKF
n/a
17.8 (17.5)
82.3 (11.3)
38.8 (24.8)
SFF
6 months
12 months
18 months
32.9 (34.4)
38.3 (35.1)
36.0 (34.7)
79.2 (24.2)
79.2 (24.1)
80.8 (22.7)
40.9 (38.9)
44.7 (42.0)
46.2 (40.4)
Least squares
12 months
24.3 (23.9)
75.1 (26.6)
42.3 (40.1)
M-estimator
12 months
22.8 (23.1)
75.1 (26.2)
44.7 (42.0)
n/a
33.3 (29.8)
80.6 (9.8)
43.5 (43.2)
EKF
The MLP performed better, by reporting overall change detection accuracies above 84% when
using all seven spectral bands. The unsupervised classifiers performed better on the first two spectral
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
174
Chapter 8
Results
bands than on all seven spectral bands. This was expected, as a similar trend was observed in the
classification accuracies.
In table 8.21, the same set of experiments for the real land cover change data set were conducted in
Gauteng results are reported. The best performing set of experiments is again the methods that employ
the MLP. The overall change detection accuracies are above 80% with false alarm rates below 20%. A
significant increase in change detection accuracy is observed when the two spectral spectral bands are
evaluated when compared to the NDVI. Because both the NDVI and two spectral bands’ experiments
uses the same spectral bands, it can be concluded that using the two spectral bands separately is better.
This claim is supported by all the previous experiments in this chapter.
The worst performing methods are those that employ either the NDVI or all seven spectral bands
with an unsupervised classifier. Meanwhile, similar experiments conducted with the first two spectral
bands with an unsupervised classifier yielded acceptable performance. The reported overall change
detection accuracies were above 75%, with a false alarm rate below 25%.
The conclusion from both sets of experiments is that using the first two spectral bands with any
change detection methods yields acceptable performance. At the same time, experiments using all
seven spectral bands with a supervised classifier offered the best reported performance.
8.7.3
Effective change detection delay
In this section, the effective change detection delay ∆τ is reported. The results of the experiments
are presented in table 8.22 for the Limpopo province, and table 8.23 for the Gauteng province. The
experiments’ results are reported in the average number of days (1 MODIS sample = 8 days) for the
ensemble of time series in the simulated land cover change data set.
The MLP was deemed the best performing classifier, as it achieved the shortest effective change
detection delay. The MLP’s effective change detection delay improved as more spectral bands were
included. The best performing feature extraction method was the SFF with a temporal sliding window
length of 6 months. The overall trend was that a shorter temporal sliding window length had a shorter
effective change detection delay. This is intuitive as fewer data points contribute to the current state of
the output class membership. The SFFs outperform the least squares and M-estimator using a similar
temporal sliding window length of 12 months.
The unsupervised classifiers (Ward clustering method and K-means) reported an overall increase
in effective change detection delay when compared to the MLP classifier. A similar observation
is made here as in the discussion of classification accuracy in section 8.6.7. The first two spectral
bands outperformed the NDVI and all seven spectral band combinations. This is due to the improved
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
175
Chapter 8
Results
Table 8.22: Effective change detection delay for simulated land cover change conducted in the
Limpopo province. Each entry gives the average number of days for each study area, calculated over
10 repeated independent experiments.
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
Window length
Spectral Band
2 Bands 7 Bands
76
73
101
92
120
106
6 months
12 months
18 months
NDVI
88
117
178
Least squares
12 months
130
109
102
M-estimator
12 months
146
118
109
EKF
n/a
110
96
91
SFF
6 months
12 months
18 months
132
177
253
92
113
176
116
160
218
Least squares
12 months
185
130
166
M-estimator
12 months
189
125
186
EKF
n/a
163
104
151
SFF
6 months
12 months
18 months
127
169
233
94
107
164
119
154
216
Least squares
12 months
186
127
165
M-estimator
12 months
186
123
179
n/a
166
105
151
EKF
classification accuracies reported in section 8.6.3–8.6.6 for the first two spectral bands.
Most experiments conducted in the Limpopo province had the K-means algorithm producing
shorter effective change detection delays than the Ward clustering method, while no distinguishing
difference was observed in the Gauteng province. In these experiments a clear improvement in the
effective change detection delay is observed when the SFF is compared to the least squares and
M-estimator with a similar sliding window length.
8.7.4
Summary of change detection results
In this section the results of the change detection experiments are summarised. In section 8.7.1, true
positives and false positives were reported for the experiments conducted on the simulated land cover
change data set. In section 8.7.2, the true positives were reported for the experiments conducted on the
real land cover change data set. In section 8.7.3, the average effective change detection delays were
reported for the experiments conducted on the simulated land cover change data set.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
176
Chapter 8
Results
Table 8.23: Effective change detection delay for simulated land cover change conducted in the Gauteng
province. Each entry gives the average number of days for each study area, calculated over 10 repeated
independent experiments.
Algorithm
MLP
Ward clustering
K-means
Feature
extraction
SFF
Window length
Spectral Band
2 Bands 7 Bands
69
65
87
81
114
109
6 months
12 months
18 months
NDVI
84
111
153
Least squares
12 months
122
98
94
M-estimator
12 months
127
99
97
EKF
n/a
108
89
81
SFF
6 months
12 months
18 months
117
146
168
84
103
140
102
139
168
Least squares
12 months
155
120
146
M-estimator
12 months
164
123
154
EKF
n/a
151
97
138
SFF
6 months
12 months
18 months
118
139
172
88
112
157
110
143
189
Least squares
12 months
153
126
149
M-estimator
12 months
157
128
153
n/a
137
106
134
EKF
The MLP was considered the best classifier used for change detection. The MLP had better change
detection accuracies and effective change detection delays when using more spectral bands. It was
also found that a trade-off existed in the length of the temporal sliding window when comparing the
difference between change detection accuracy and effective change detection delay. A longer temporal
sliding window length improves the classification accuracy at the cost of a longer effective change
detection delay. A shorter temporal sliding window length reacts faster to change in the time series at
the loss in change detection accuracy.
Poor performance with the unsupervised methods used for clustering on the NDVI time series
and all seven spectral bands’ time series indicated that classes could not be well encapsulated in the
clusters. The first two spectral bands, on the other hand, resulted in acceptable performance across all
the change detection experiments and effective change detection delay’s experiments.
The K-means algorithm and Ward clustering method performed similarly in all the experiments,
except that the Ward clustering method had slightly higher change detection accuracies while the
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
177
Chapter 8
Results
Table 8.24: A list of different combinations of change detection algorithms that will be tested at a
regional scale.
Feature
extraction
SFF
Sliding window
length
12 months
Spectral
band
2 Bands, 7 Bands
Machine learning method
12 months
2 Bands
Ward clustering method
12 months
2 Bands
K-means algorithm
2 Bands, 7 Bands
MLP
2 Bands
Ward clustering method
2 Bands
K-means algorithm
EKF
MLP
K-means algorithm had a shorter effective change detection delay. This observation could be attributed
to the K-means classification experiments, which yielded a very small standard deviation when
compared to the Ward clustering method. In all the experiments conducted in this section (section 8.7),
it was observed that the SFFs and EKF features outperformed the least squares and M-estimator
features in the performance metrics. It is concluded from these experiments that the combinations
given in table 8.24 yielded the best performance and will be evaluated on a regional scale.
8.8
CHANGE DETECTION ALGORITHM COMPARISON
In this section the change detection accuracies measured in section 8.7 are compared to other change
detection algorithms found in the literature. The change detection methods used for comparison are:
• the annual NDVI differencing method (denoted by NDVICDM ) [19],
• the EKF change detection method (denoted by EKFCDM ) [120], and
• the ACF change detection method (denoted by ACFCDM ) [121].
All three these methods listed above are supervised in nature, as a training data set is required to
set a threshold, which is used to declare change. These three methods are compared in table 8.25 to a
few methods listed in table 8.24.
The worst performing method was the NDVICDM method, having a change detection accuracy of
69% with a false alarm rate of 13% in the Limpopo province, and a change detection accuracy of
57% with a false alarm rate of 14% in the Gauteng province. A possible explanation for this poor
performance is given in [224], which is that the method assumes that the annual NDVI difference
between years is normally distributed, which could imply that it has difficulty in detecting land cover
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
178
Chapter 8
Results
Table 8.25: Comparison of the change detection accuracies in percentage (false alarm rate in
parentheses) of the proposed change detection algorithms to other change detection algorithms found
in the literature.
Algorithm
EKFCDM [19]
ACFCDM [120]
NDVICDM [121]
EKFBVEP , MLP, 7 spectral bands
EKFBVEP , MLP, 2 spectral bands
EKFBVEP , K-means, 2 spectral bands
EKFBVEP , Ward clustering, 2 spectral bands
Province
Limpopo province Gauteng province
89% (13%)
75% (13%)
81% (12%)
92% (15%)
69% (13%)
57% (14%)
87% (9%)
96% (2%)
79% (23%)
92% (10%)
72% (24%)
81% (10%)
72% (26%)
82% (11%)
change in heterogeneous areas. The method performed the poorest in the Gauteng province owing to
the land cover diversity [224].
The EKFCDM had the highest change detection accuracy of 89% in the Limpopo province, with
a false alarm rate of 13%. This was attributed to the fact that most of the province is covered by
natural vegetation, which is the result of the high correlation between the parameter sequences of
the neighbouring pixels in the spatio-temporal window [224]. The relative difference between the
change and no change parameter streams was high enough to detect change. The EKFCDM method’s
performance was lower in the Gauteng province, which was attributed in [224] to the land cover
diversity.
The ACFCDM exploits the non-stationary property of the change time series when compared to
the no change time series. The method was applied to the 4th spectral band of MODIS, as it offered
the best performance [224]. The method reported a higher change detection accuracy in the Gauteng
province when compared to the Limpopo province.
The performance of the two unsupervised classifiers (K-means and Ward clustering) operating on
the first two spectral bands was similar. Both methods had better change detection accuracies and false
alarm rates when compared to the NDVICDM method. The methods had a 6% higher change detection
accuracy when compared to the EKFCDM in the Gauteng province, but a 17% decrease in the Limpopo
province.
The MLP operating on the EKFBVEP features computed from the first two spectral bands had the
same change detection accuracy as the ACFCDM in the Gauteng province, but had the advantage of
having a 5% lower false alarm rate. The reverse was observed in the Limpopo province, as the MLP
operating on the first two spectral bands had a 2% lower change detection accuracy and 11% higher
false alarm rate when compared to the ACFCDM method.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
179
Chapter 8
Results
The MLP operating on the EKFBVEP features computed on all seven spectral bands was deemed the
best change detection method in this section. The method had the highest change detection accuracy
and lowest false alarm rate in the Gauteng province. It had the second highest change detection
accuracy (2% lower than the highest) and the lowest false alarm rate in the Limpopo province.
8.9
PROVINCIAL EXPERIMENTS
A list of the best performing change detection algorithms is given in table 8.24, which is to be evaluated
on a regional scale. The areas that will be evaluated are the entire Limpopo and Gauteng provinces.
F IGURE 8.13: A classification/ change detection map of the entire Limpopo province.
The results obtained from processing the entire Limpopo province are presented in table 8.26. The
table divides the results into three categories: natural vegetation, human settlements, and change. An
illustration of one of these experiments is shown in figure 8.13, which represents the Limpopo province.
The overall trend throughout all the methods was that natural vegetation covered 85%–88% of the
province, and that human settlement covered 9%–12% of the province. This signifies that majority of
the province is still largely covered by natural vegetation. The land cover change that is reported here is
the transformation of natural vegetation to human settlement. The land cover change that was reported
ranged from 1%–4% of the total area in the province. This is a significant area that has changed in
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
180
Chapter 8
Results
the province over the past decade, since the total human settlement class has expanded by 12%–40%
in the study period. This suggests that some of the algorithms might be too sensitive towards change
events or that the labelled data set should be expanded to incorporate a larger variety of classes. On
the other hand, it should be noted that the controlled experiments that were conducted on the labelled
data set involved land cover that transformed from natural vegetation to human settlement. This did
not include any examples of other land cover transformations, which could exist in the province.
This could be rectified, as the algorithms are versatile enough to include other classes to improve
the classification, and in turn change detection accuracies. Future expansion of the work could entail
collecting agricultural land cover information in each of the provinces.
Table 8.26: The classification and change detection results produced for the entire Limpopo province.
The results are presented in percentage cover of total area in the province.
Feature
extraction
Algorithm
SFF
MLP
2 Bands
7 Bands
Ward clustering
2 Bands
86.33
9.64
4.03
K-means
2 Bands
86.05
10.02
3.93
MLP
2 Bands
7 Bands
85.74
86.33
11.57
12.11
2.69
1.56
Ward clustering
2 Bands
86.20
10.32
3.48
K-means
2 Bands
85.81
10.90
3.29
EKF
Spectral Band
Class allocation [%]
Natural
Human
Land cover
vegetation settlement change
86.94
10.31
2.75
87.69
10.61
1.70
Closer inspection of table 8.26 allows the deduction of some interesting trends. These trends
cannot be confirmed, as no ground truth exists for the current results, which are only based on
observations. The MLP consistently detected more human settlement than the unsupervised classifiers,
while indicating a reduced number of detected land cover changes. This puts emphasis on the
classification at the beginning of the time series, as both the detected land cover change class and
the human settlement class agree that the time series ends in the human settlement class. This could be
attributed to the fact that the province experienced a rainfall shortage in 2001/2002 (beginning of the
study period).
The unsupervised classifiers detected more land cover change when compared to the MLP. In some
experiments the size of changed areas that were reported almost doubled. Another observation among
the unsupervised classifiers is that the Ward clustering method flagged more land cover changes than
the K-means algorithm. This trend was also observed in the controlled experiments and was deduced
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
181
Chapter 8
Results
F IGURE 8.14: A classification/ change detection map of the entire Gauteng province.
from the observation that the Ward clustering method had a wider standard deviation in its classification
accuracies than the K-means.
Table 8.27: The classification and change detection results produced for the entire Gauteng province.
The results are presented in percentage cover of total area in the province.
Feature
extraction
Algorithm
SFF
MLP
2 Bands
7 Bands
Ward clustering
2 Bands
75.53
19.90
4.57
K-means
2 Bands
75.43
20.46
4.11
MLP
2 Bands
7 Bands
76.01
76.89
20.92
21.46
3.07
1.17
Ward clustering
2 Bands
76.22
19.56
4.22
K-means
2 Bands
76.08
19.96
3.96
EKF
Spectral Band
Class allocation [%]
Natural
Human
Land cover
vegetation settlement change
76.65
20.12
3.23
77.33
21.39
1.28
The same experiment was conducted in the Gauteng province and its results are presented in
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
182
Chapter 8
Results
table 8.27. The results were produced by processing the entire Gauteng province into the three defined
categories. An illustration of one of these experiments is shown in figure 8.14, which represents
the Gauteng province. The overall trend in this province was significantly different from the results
produced in the Limpopo province, as this province is mostly urbanised. The natural vegetation class
covered 75%–78% of the province, while human settlements covered 19%–22%. This result supports
the concept that Gauteng is a heavily urbanised province.
The land cover change which was flagged ranged from 1%–5% of the total area in the province.
This is a significant large area that has changed in the study period, as the total human settlement class
has expanded by 5%–23% in the province. The same trends that were observed in the results produced
for the Limpopo province with regard to the nature of the change detection algorithm were observed
in the Gauteng province.
8.10
COMPUTATIONAL COMPLEXITY
In this section a comparison is made of the complexity of extracting the EKF features and the SFFs. A
time series x of length I, is defined as
x = [~x1 ~x2 . . . ~xI ],
(8.1)
~xi = [xi,1 xi,2 . . . xi,T ].
(8.2)
with
~ i used in the
The variable T denotes the number of elements in vector ~xi . If the state-space vector W
EKF has S elements, then the complexity of filtering a single time series is at least O(IS 2 )+O(IT 2.4 ).
In the case of the EKF features extracted from a triply modulated cosine function on uncorrelated
spectral bands, S=3 and T =1.
The complexity of extracting the SFF is based on the complexity of the FFT algorithm and the
length of the temporal sliding window. If the time series is length I and the length of the temporal
sliding window is Q, then the processing of a single time series is equal to O((I − Q)Q log2 Q), with
Q ≪ I.
A timing experiment was conducted on a cluster node to calculate the computational time of both
feature extraction methods and the results are reported in table 8.28. The computer’s specifications
used for this experiment are:
• Dell PowerEdge 1955 blade, Intel Xeon 5355 (Quad-Core) 2.66 GHz, 8 GB RAM, 1333 MHz
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
183
Chapter 8
Results
Table 8.28: The computational time required to extract features from 25000 time series using either
the EKF feature extraction method or SFF extraction method. The results is reported in milliseconds
per time series.
Feature
SFF
EKF
Millisecond per time series
0.47
22.81
FSB, Gigabit Ethernet, 4x 2.1 kW redundant power supplies (3+1), 2x Gigabit Switch Modules,
1x Avocent Digital Access KVM switch, Software Debian Testing AMD64 with MATLAB
R2012a.
The experiment was conducted over 25000 time series and it was concluded that the SFF could be
extracted from the time series 48.5 times faster than the EKF features. The next requirement addressed
is the time required to optimise the EKF features using the BVEP criterion. The BVSA is an iterative
search algorithm that sets the EKF parameters within the BVS in an attempt to best satisfy the BVEP
criterion. If the BVSA requires EBVSA iterations to set the EKF parameters, the the extraction of
EKFBVEP features takes at least 48.5EBVSA times longer than the SFF. The typical range of iterations
used for EBVSA in these experiments were between 20 and 30.
8.11
SUMMARY
In this section a summary is provided of the observarions made in this chapter. It was found that the
supervised classifier outperformed the unsupervised methods. The downside was the costs involved
in producing a labelled training data set. The best performance was obtained when the MLP was
optimally set to operate on all seven spectral bands of MODIS. The training method adopted was the
iteratively retrained mode, which compensates for the inter-annual variability. A temporal sliding
window length of 12 months used on either the SFF, least squares, or M-estimator offered the
best trade-off between parameter variability, effective change detection delay and change detection
accuracy. Similar gains were obtained in the trade-off with the EKF features if the parameters were
optimised with the BVEP criterion.
The change detection algorithms yielded better performance in the Gauteng province than the
Limpopo province. This could be attributed to the more dense natural vegetation found in the Gauteng
province. Figure 8.15 illustrates a difference between the informal settlements and natural vegetation
found in both provinces. The Gauteng province houses more compact informal settlements and more
dense natural vegetation when compared to the Limpopo province.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
184
Chapter 8
Results
(a) Natural vegetation located in the
Limpopo province.
(b) Informal settlements located in the
Limpopo province.
(c) Natural vegetation located in the
Gauteng province.
(d) Informal settlements located in the
Gauteng province.
F IGURE 8.15: Four high resolution images acquired in the two provinces; Limpopo and Gauteng. (a)
A natural vegetation area located in the Limpopo province. (b) An informal settlement located in the
Limpopo province. (c) A natural vegetation area located in the Gauteng province. (d) An informal
settlement located in the Gauteng province. (courtesy of GoogleTM Earth).
A general trend of performance improvement was observed when the first two spectral bands (Red
and NIR spectral bands) were used instead of the NDVI. The use of the first two spectral bands as input
was deemed superior, as the same spectral bands are used to compute the NDVI. Further improvement
was observed when using all seven spectral bands with a supervised classifier.
The SFFs and EKF features yield better performance in detecting land cover change when
compared to the features extracted using least squares and M-estimator methods. The EKF features
only provided better separation between classes than the SFFs when the BVEP criterion was used to
set the EKF parameters. The consequence of this is that the SFF was deemed the better approach
when compared to the EKF features, as the EKF-extracted features required the computation of the
covariance matrices using the BVEP criterion. This improvement into separation in classes was not
significant, and the SFF was deemed better owing to its lower computational time (section 8.10).
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
185
C HAPTER
NINE
C ONCLUSION
9.1
CONCLUDING REMARKS
The importance of reliable land cover monitoring and detection of land cover change was discussed in
chapter 1, and has been shown to be of great benefit to the global community [11]. Each country or
region faces its own challenges in monitoring the land; in South Africa the transformation of natural
vegetation to new human settlements is the most pervasive form of land cover change [7].
South Africa’s National Land Cover (NLC) was mapped in 1995–1997 using manual photo
interpretation [225] of Landsat imagery, while the NLC of 2000 was based on digital classification
of Landsat images by regional experts [226]. Both of these took a number of years to complete.
Subsequently land cover has been mapped by provincial governments on an ad hoc basis through
private companies using a variety of methods. Since the methods have not been standardised through
time and space, reliable land cover change data cannot be generated from successive national land
cover data sets. The Landsat-based land cover mapping efforts furthermore relied on single date
imagery, which resulted in neighbouring images being acquired on widely varying dates containing
seasonal effects that hampered multi-spectral land cover classification. The hyper-temporal, time-series
analysis approach described here capitalises on seasonal dynamics to characterise land cover and land
cover change in a repeatable, standardised method that can be applied over large areas.
The satellite images used in this thesis were acquired by the MODIS sensor. The MODIS sensor
is used to produce a hyper-temporal, multi-spectral medium spatial resolution land surface reflectance
data product. This sequence of images is used to construct a time series, which can be analysed
with a change detection algorithm to detect the formation of newly developed human settlements. A
post-classification change detection framework was developed to detect land cover change occurring in
time series. The framework classifies the geographical area for each time index and declares change if
a permanent transition in class label is observed. Two novel hyper-temporal feature extraction methods
Chapter 9
Conclusion
were proposed in this thesis, which are used in the post-classification change detection framework. The
two types of features extracted with these novel feature extraction methods are:
• the Seasonal Fourier Features (SFF), and
• the Extended Kalman Filter (EKF) features optimised using the Bias-Variance Equilibrium Point
(BVEP) criterion.
The SFF is a hyper-temporal feature vector that extracts information from multiple spectral bands,
which exploits the seasonal spectral signature in the temporal dimension of a geographical area. SFF
is the first type of novel hyper-temporal feature in this thesis that incorporates temporal information,
allowing the analysis of seasonal surface reflectance variations of different land cover classes. SFF
(extracted from the MODIS time series) allows the post-classification change detection framework to
be sensitive enough to detect new human settlements as small as 0.25 km2 .
The second novel hyper-temporal feature extraction method is an improvement on the method
proposed by Kleynhans et al. [30]. The first contribution made to this method is the extension to higher
dimensions, which improves the land cover change detection accuracies. This contribution is supported
by all the experiments conducted in chapter 8. The second contribution made to the method proposed
by Kleynhans et al. [30] is the definition of the novel BVEP criterion, which defines the condition that
improves the tracking of time series, while simultaneously improving the internal stability of the EKF.
This criterion allows the evaluation of the EKF performance in an unsupervised fashion. The
drawback with the method proposed by Kleynhans et al. is that it requires an offline optimisation
phase, which must be performed by an operator with a training set. This drawback is overcome by
defining a scoring function such as the Bias-Variance Score (BVS) to evaluate how well a particular set
of parameters satisfy the BVEP criterion. The EKF parameters are adjusted using a search algorithm
such as the Bias-Variance Search Algorithm (BVSA) in an attempt to best satisfy the BVEP criterion.
This led to another contribution, namely the development of the BVSA; the BVSA is an unsupervised
search algorithm that can effectively optimise the BVS using the BVEP criterion for optimal EKF
performance. It was found in chapter 8 that the BVSA performed similarly to other popular search
algorithms, but had the advantage of having a faster convergence time. All these contributions led to
the full automation of the method proposed by Kleynhans et al. [30]. The BVS optimised using the
BVEP criterion provides statistical information on the phenological growth cycle, which could also be
used to provide vital insight to environmental dynamics [31, 32].
The post-classification change detection framework uses a machine learning method to classify
a geographical area at each time index and can be either a supervised or an unsupervised classifier.
In chapter 8 the ability of the hyper-temporal features to separate different land cover classes was
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
187
Chapter 9
Conclusion
investigated. A classification experiment was used to evaluate class separation; a Multilayer Perceptron
(MLP) was used to represent supervised classifiers. Unsupervised methods were represented by a
selection of clustering methods. The supervised classifier performed significantly better than the
unsupervised methods, but it requires labelled examples derived from commercial high resolution
satellite imagery, making the unsuperivsed methods more attractive for operational implementation.
A range of experiments were conducted for different combinations of spectral bands: NDVI, first
two MODIS spectral bands, and all seven MODIS spectral bands. It was observed that the experiments
using the first two spectral bands yielded better results than the experiments using NDVI. This is a
well-known property in the machine learning community, that better separation is usually obtained
in higher dimensions [130, Ch. 1 p. 4]. This was supported by classification experiments in chapter
8, where the MLP reported general improvements with an increase in the number of spectral bands.
The performance of the unsupervised methods improved when going from two-dimensional features
(NDVI) to four-dimensional features (first two spectral bands), but the performance deteriorated
when going to 14-dimensional features (all seven spectral bands), suggesting that complex decision
boundaries are required to maximise performance in 14-dimensions.
The goal for this thesis was the development of a novel land cover change detection method. The
method had to be sufficiently near automated with minimal human interaction. A post-classification
change detection framework was used to evaluate two features extraction methods to improve land
cover separability, which in turn improved the land cover change detection. The SFF is a novel
introduced feature and was compared to the EKF feature presented by Kleynhans et al. [30]. The
EKF features were improved using the novel BVEP criterion, which resulted in an optimised EKF that
gave the best performance. The downside was that the EKF features could only provide better results
if the BVEP criterion was used in the optimisation phase. These improvements over the SFF features
were small when compared to the computational requirement of the optimisation phase. Therefore, it
was concluded that the SFF is more practical for operational applications.
9.2
FUTURE RECOMMENDATIONS
In this section a brief overview is given of potential future research that could stem from the work
presented in this thesis.
• Spatial information analysis: In chapter 2 it was discussed that algorithms are usually designed
to provide acceptable performance for an application in a particular geographical area. This is
caused by the inherent differences between geographical areas. The BVEP criterion can be used
to analyse a particular geographical area by studying the statistical parameters derived, such as
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
188
Chapter 9
Conclusion
the standard deviation of model parameters. This information can be used in a statistical test to
determine whether a region of the study area can be expanded to cover a larger area. An example
of such a test is the use of the Aikaike Information criterion (AIC) to determine if the size of the
current study area is acceptable. The AIC is given as
AIC = ln(K) − 2 ln(L),
(9.1)
where K is the number of model parameters and L is the likelihood of the model which
incorporates the standard deviation. The criterion is used to balance the cost of increased
complexity (more small regions) against the loss of performance when using fewer, larger
regions.
• Spectral band selection criterion: In chapter 4 it was discussed that proper domain knowledge
leads to proper definition of feature vectors. Feature selection is always a relevant topic in remote
sensing, as new sensors are continually being developed with more sophisticated capabilities.
In chapter 3, an approach to training a neural network was presented which was proposed by
Caruana et al. [168]. The training algorithm starts by mapping all the linear regions in the
feature space and then progresses to map more complex non-linear regions. In a neural network
architecture context, input nodes that contribute to the output nodes are assigned larger synaptic
weights, while input nodes that contribute little information to the output nodes are assigned
smaller synaptic weights. The distribution of the synaptic weights can be used to infer a spectral
band selection criterion.
• Internal covariance matrix analysis: In the computation of the BVS, it is assumed that the
internal covariance matrix P(i|i) (equation 5.38) is set to the identity matrix. The matrix will
then converge to a stable internal covariance matrix P(IT |IT ) at time IT if the Riccati condition
holds and enough observation vectors are supplied. This convergence should be almost constant
and can be expressed as
2
d P(i|i) di2 ≤ ε,
(9.2)
where k · k is a suitable matrix norm, e.g. induced norm or Frobenius norm. An in-depth
study is proposed on the behaviour of the EKF’s internal covariance matrix P(i|i) with regards
to land cover change. The internal covariance matrix P(i|i) should fluctuate when experiencing
a non-stationary process such as land cover change. These fluctuations can be used to define a
change threshold TP that flags a change when
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
189
Chapter 9
Conclusion
2
d P(i|i) di2 > TP .
(9.3)
• Complex model design: In chapter 5 the emphasis was placed on using a triply modulated
cosine model to describe the MODIS time series. The next phase is to explore more complex
models, which could be used to model the time series. For example, the triply modulated cosine
model given in equation (5.44) can be expanded to incorporate multiple models as
xi =
M
X
~ i ) + vi ,
hm (W
(9.4)
m
with measurement function defined as
~ i ) = Wi,µ,m + Wi,α,m cos(2πfsamp i + Wi,θ,m ).
hm (W
(9.5)
Another proposed expansion to the SFF feature is to consider more Fourier components for
analysis. The sinusoidal behaviour is not a true representation of all different land cover classes,
which motivates a further exploration of new models.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
190
R EFERENCES
[1] A. Comber, P. Fisher, and R. Wadsworth, “What is land cover?” Environment and planning B:
Planning and design, vol. 32, no. 2, pp. 199–209, 2005.
[2] P. Vitousek, H. Mooney, J. Lubchenco, and J. Melillo, “Human domination of Earth’s
ecosystems,” Science, vol. 277, pp. 494–499, July 1997.
[3] G. Daily and P. Ehrlich, “Population, sustainability, and Earth’s carrying capacity,” Bioscience,
vol. 42, no. 10, pp. 761–771, November 1992.
[4] R. DeFries, L. Bounoua, and G. Collatz, “Human modification of the landscape and surface
climate in the next fifty years,” Global Change Biology, vol. 8, no. 5, pp. 438–458, May 2002.
[5] J. Foley, R. DeFries, G. Asner, C. Barford, G. Bonan, S. Carpenter, F. Chapin, M. Coe, G. Daily,
H. Gibbs, J. Helkowski, T. Holloway, E. Howard, C. Kucharik, C. Monfreda, J. Patz, I. Prentice,
N. Ramankutty, and P. Snyder, “Global consequences of land use,” Science, vol. 309, no. 5734,
pp. 570–574, July 2005.
[6] G. Brundtland, “Report of the World Commission on environment and development: Our
common future,” Brundtland Commission, United Nations General Assembly, Tech. Rep.
A/42/427, 1987.
[7] C. Olver, “South Africa’s review report for the sixteenth session of the United Nations
commission on sustainable development,” Department of Environmental Affairs and Tourism
Pretoria, Tech. Rep. CSD-16, March 2008.
[8] B. Salmon, J. Olivier, W. Kleynhans, K. Wessels, F. van den Bergh, and K. Steenkamp, “The use
of a Multilayer Perceptron for detecting new human settlements from a time series of MODIS
images,” International Journal of Applied Earth Observation and Geoinformation, vol. 13,
no. 6, pp. 873–883, December 2011.
[9] P. van den Berg, “Transformasie van winterveld:
Veranderde grondbenutting en
nedersettingsverdigting,” Master’s thesis, Department of Geography, University of Pretoria,
Pretoria, South Africa, October 1994.
[10] H. Eva, A. Brink, and D. Simonetti, “Monitoring land cover dynamics in sub-Saharan Africa,”
Institute for Environmental and Sustainability, Tech. Rep. EUR 22498 EN, 2006.
[11] C. Johannsen, P. Carter, D. Morris, B. Erickson, and K. Ross, “Potential applications of remote
sensing,” Site-Specific Management Guidelines SSMG-22, Potash and Phosphate Institute,
Tech. Rep., 1999.
References
[12] R. Myneni and J. Ross, Photon-vegetation Interactions: Applications in Optical Remote Sensing
and Plant Physiology, 1st ed. New York, USA: Springer, 1991.
[13] S. Liang, Quantitative Remote Sensing of land surfaces, 1st ed.
Interscience, 2004.
New York, USA: Wiley
[14] R. DeFries and J. Chan, “Multiple criteria for evaluating machine learning algorithms for land
cover classification from satellite data,” Remote Sensing of Environment, vol. 74, no. 3, pp.
503–515, December 2000.
[15] R. S. Lunetta, D. Johnson, J. Lyon, and J. Crotwell, “Impacts of imagery temporal frequency
on land-cover change detection monitoring,” Remote Sensing of Environment, vol. 89, no. 4, pp.
444–454, February 2004.
[16] J. Townshend and C. Justice, “Selecting the spatial resolution of satellite sensors required for
global monitoring of land transformations,” International Journal of Remote Sensing, vol. 9,
no. 2, pp. 187–236, February 1988.
[17] M. Hansen and R. DeFries, “Detecting long-term global forest change using continuous fields
of tree-cover maps from 8-km Advanced Very High Resolution Radiometer (AVHRR) data for
the years 1982-99,” Ecosystems, vol. 7, no. 7, pp. 695–716, November 2004.
[18] D. Lu and Q. Weng, “A survey of image classification methods and techniques for improving
classification performance,” International Journal of Remote Sensing, vol. 28, no. 5, pp.
823–870, January 2007.
[19] R. Lunetta, J. Knight, J. Ediriwickrema, J. Lyon, and L. Worthy, “Land-cover change detection
using multi-temporal MODIS NDVI data,” Remote Sensing of Environment, vol. 105, no. 2, pp.
142–154, November 2006.
[20] P. Coppin, I. Jonckheere, K. Nackaerts, B. Muys, and E. Lambin, “Digital change detection
methods in ecosystem monitoring: a review,” International Journal of Remote Sensing, vol. 25,
no. 9, pp. 1565–1596, May 2004.
[21] S. Gopal, C. Woodcock, and A. Strahler, “Fuzzy neural network classification of global land
cover from a 1 degree AVHRR data set,” Remote Sensing of Environment, vol. 67, no. 2, pp.
230–243, February 1999.
[22] G. Carpenter, S. Gopal, S. Macomber, S. Martens, C. Woodcock, and J. Franklin, “A neural
network method for efficient vegetation mapping,” Remote Sensing of Environment, vol. 70,
no. 3, pp. 326–338, December 1999.
[23] B. Braswell, S. Hagen, S. Frolking, and W. Salas, “A multivariable approach for mapping
sub-pixel land cover distributions using MISR and MODIS: application in the Brazilian Amazon
region,” Remote Sensing of Environment, vol. 87, no. 2-3, pp. 243–256, October 2003.
[24] D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection techniques,” International
Journal of Remote Sensing, vol. 25, no. 12, pp. 2365–2407, June 2004.
[25] H. Nemmour and Y. Chibani, “Neural network combination by fuzzy integral for robust change
detection in remotely sensed imagery,” EURASIP Journal on Applied Signal Processing, vol.
2005, no. 14, pp. 2187–2195, January 2005.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
192
References
[26] T. Westra and R. de Wulf, “Monitoring Sahelian floodplains using Fourier analysis of MODIS
time-series data and artificial neural networks,” International Journal of Remote Sensing,
vol. 28, no. 7, pp. 1595–1610, January 2007.
[27] W. Wanner, A. H. Strahler, B. Hu, P. Lewis, J. Muller, X. Li, C. Schaaf, and M. Barnsley, “Global
retrieval of bidirectional reflectance and albedo over land from EOS MODIS and MISR data:
Theory and algorithm,” Journal of Geophysical Research, vol. 102, no. D14, pp. 17 143–17 161,
1997.
[28] C. Schaaf, F. Gao, A. Strahler, W. Lucht, X. Li, T. Tsang, N. Strugnell, X. Zhang, Y. Jin,
J. Muller, P. Lewis, M. Barnsley, P. Hobson, M. Disney, G. Roberts, M. Dunderdale, C. Doll,
R. d’Entremont, B. Hu, S. Liang, J. Privette, and D. Roy, “First Operational BRDF, Albedo and
Nadir Reflectance Products from MODIS,” Remote Sensing of Environment, vol. 83, no. 1, pp.
135–148, November 2002.
[29] E. Keogh and J. Lin, “Clustering of time-series subsequences is meaningless: implications for
previous and future research,” Knowledge and Information systems, vol. 8, no. 2, pp. 154–177,
August 2005.
[30] W. Kleynhans, J. Olivier, K. Wessels, F. van den Bergh, B. Salmon, and K. Steenkamp,
“Improving land-cover class separation using an extended Kalman filter on MODIS NDVI
time-series data,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 2, pp. 381–385,
April 2010.
[31] M. Jakubauskas, D. Legates, and J. Kastens, “Harmonic analysis of time-series AVHRR NDVI
data,” Photogrammetric Engineering of Remote Sensing, vol. 67, no. 4, pp. 461–470, April 2001.
[32] S. Lhermitte, J. Verbesselt, K. Nackaerts, and P. Coppin, “A segmentation of
vegetation-soil-climate complexes for South Africa based on SPOT vegetation time series,” in
2nd International Vegetation User Conference, vol. 1, Antwerp, Belgium, March 24–26, 2004,
pp. 1–7.
[33] S. Liang, Advances in land remote sensing: System, modeling, inversion and application, 1st ed.
New York, USA: Springer, 2008.
[34] W. Derman and S. Whiteford, Social impact analysis and development planning in the third
world, 1st ed. Colorado, USA: Westview Press, 1985.
[35] F. Hudson, A Geography of settlements, 2nd ed. London, UK: Macdonald and Evans Ltd, 1976.
[36] P. Harrison, “The policies and politics of informal settlements in South Africa: A historical
perspective,” Journal of Africa Insights, vol. 22, no. 1, pp. 14–22, 1992.
[37] A. Gilbert and J. Gugler, Cities, poverty and development: Urbanization in the third world,
1st ed. London, UK: Oxford University Press, 1982.
[38] A. Christopher, “Apartheid and urban segregation levels in South Africa,” Journal of Urban
Studies, vol. 27, no. 3, pp. 421–440, June 1990.
[39] C. de Wet, Moving together drifting apart: Betterment planning and villagisation in a South
African homeland, 1st ed. Johannesburg, South Africa: Witwatersrand University Press, 1995.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
193
References
[40] B. Salmon, J. Olivier, K. Wessels, W. Kleynhans, F. van den Bergh, and K. Steenkamp,
“Unsupervised land cover change detection: Meaningless sequential time series analysis,” IEEE
Transactions Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 4, no. 2, pp. 327–335, June 2011.
[41] G. Gutman, A. Janetos, C. Justice, E. Moran, J. Mustard, R. Rindfuss, D. Skole, B. Turner, and
M. Cochrane, Land Change Science: Observing, Monitoring, and Understanding Trajectories
of Change on the Earths Surface, 1st ed. New York, USA: Springer, 2004.
[42] T. Lillesand and R. Kiefer, Remote Sensing and Image Interpretation, 4th ed.
John Wiley and Sons, 2000.
New York, NY:
[43] P. Gibson, Introductory Remote Sensing: Principles and Concepts, 1st ed.
Routledge, 2000.
New York, NY:
[44] D. Halliday, R. Resnick, and J. Walker, Fundamentals of Physics, 1st ed. John Wiley and Sons,
1997.
[45] H. Pollack, S. Hurter, and J. Johnson, “Heat flow from the Earth’s interior: Analysis of the
global data set,” Reviews of Geophysics, vol. 31, no. 3, pp. 267–280, 1993.
[46] B. Nordell and B. Gervet, “Global energy accumulation and net heat emission,” International
Journal of Global Warming, vol. 1, no. 1–3, pp. 378–391, 2009.
[47] R. Dickinson, “Land surface processes and climate-surface albedos and energy balance,”
Advance Geophysics, vol. 25, pp. 305–353, 1983.
[48] J. Foley, I. Prentice, N. Ramankutty, S. Levis, D. Pollard, S. Sitch, and A. Haxeltine, “An
integrated biosphere model of land surface processes, terrestrial carbon balance, and vegetation
dynamics,” Global Biogeochemical Cycles, vol. 10, no. 4, pp. 603–628, 1996.
[49] R. Dickinson, “Land processes in climate models,” Remote Sensing of Environment, vol. 51,
no. 1, pp. 27–38, January 1995.
[50] P. Tyson and R. Preston-Whyte, The weather and climate of southern Africa, 2nd ed.
University Press, 2002.
Oxford
[51] J. Nagol, E. Vermote, and S. Prince, “Effects of atmospheric variation on AVHRR NDVI data,”
Remote Sensing of Environment, vol. 113, no. 2, pp. 392–397, February 2009.
[52] H. Ouaidrari and E. Vermote, “Operational atmospheric correction of Landsat TM data,” Remote
Sensing of Environment, vol. 70, no. 1, pp. 4–15, October 1999.
[53] R. Avissar and R. Pielke, “A parameterization of heterogeneous land-surface for atmospheric
numerical models and its impact on regional meteorology,” Monthly Weather Review, vol. 117,
no. 10, pp. 2113–2136, October 1989.
[54] P. R.A., G. Dalu, J. Snook, T. Lee, and T. Kittel, “Nonlinear influence of mesoscale land use on
weather and climate,” Journal of Climate, vol. 4, no. 11, pp. 1053–1069, November 1991.
[55] J. Proakis and M. Salehi, Communication systems engineering, 2nd ed.
New Jersey, USA: Prentice Hall, 2002.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
Upper Saddle River,
194
References
[56] “Draft of the MODIS level 1B Algorithm Theoretical Basis Document Version 2.0,” SAIC/GSC
MODIS Characterization Support Team (MCST), Tech. Rep., February 1997.
[57] C. Justice, E. Vermote, J. Townshend, R. Defries, D. Roy, D. Hall, V. Salomonson, J. Privette,
G. Riggs, A. Strahler, W. Lucht, R. Myneni, Y. Knyazikhin, S. Running, R. Nemani, Z. Wan,
A. Huete, W. van Leeuwen, R. Wolfe, L. Giglio, J. Muller, P. Lewis, and M. Barnsley, “The
Moderate resolution imaging spectroradiometer (MODIS): Land remote sensing for global
change research,” IEEE Transactions on Geoscience and Remote Sensing, vol. 36, no. 4, pp.
1228–1249, July 1998.
[58] W. Lucht, C. Schaaf, and A. Strahler, “An Algorithm for the retrieval of albedo from space using
semiempirical BRDF models,” IEEE Transactions on Geoscience and Remote Sensing, vol. 38,
no. 2, pp. 977–998, March 2000.
[59] W. Lucht and J. Roujean, “Considerations in the Parametric Modeling of BRDF and Albedo
from Multiangular Satellite Satellite Sensor Observations,” Remote Sensing Reviews, vol. 18,
no. 2-4, pp. 343–379, September 2000.
[60] W. Lucht and P. Lewis, “Theoretical noise sensitivity of BRDF and albedo retrieval from the
EOS-MODIS and MISR sensors with respect to angular sampling,” International Journal of
Remote Sensing, vol. 21, no. 1, pp. 81–98, January 2000.
[61] E. Vermote and A. Vermeulen, “Atmospheric correction algorithm: Spectral reflectance
(MOD09) algorithm theoretical basis document (ATBD),” Department of Geography, University
of Maryland, Tech. Rep., 1999.
[62] E. Vermote, N. Saleous, and C. Justice, “Atmospheric correction of MODIS data in the visible
to middle infrared: First results,” Remote Sensing of Environment, vol. 83, no. 1–2, pp. 97–111,
November 2002.
[63] F. Nicodemus, “Directional reflectance and emissivity of an opaque surface,” Journal of Applied
Optics, vol. 4, no. 7, pp. 767–773, May 1965.
[64] D. Roy, Y. Jin, P. Lewis, and C. Justice, “Prototyping a global algorithm for systematic
fire-affected area mapping using MODIS time series data,” Remote Sensing of Environment,
vol. 97, no. 2, pp. 137–162, July 2005.
[65] R. Wolfe, D. Roy, and E. Vermote, “MODIS Land data storage, gridding, and compositing
methodology: Level 2 grid,” IEEE Transactions on Geoscience and Remote Sensing, vol. 36,
no. 4, pp. 1324–1338, July 1998.
[66] W. Barnes, T. Pagano, and V. Salomonson, “Prelaunch characteristics of the Moderate
Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 36, no. 4, pp. 1088–1100, July 1998.
[67] A. Huete, K. Huemmrich, T. Miura, X. Xiao, K. Didan, W. van Leeuwen, F. Hall, and C. Tucker,
“Vegetation Index greenness global data set,” NASA ESDR/CDR, Tech. Rep. 1, April 2006.
[68] J. Rouse, R. Haas, D. Deering, and J. Schell, “Monitoring the vernal advancement and
retrogradation (Green wave effect) of natural vegetation,” Goddard Space Flight Center,
Greenbelt, Maryland 20771, Tech. Rep., October 1973.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
195
References
[69] P. Sellers, “Canopy reflectance, photosynthesis, and transpiration,” International Journal of
Remote Sensing, vol. 6, no. 8, pp. 1335–1372, August 1985.
[70] R. Myneni, F. Hall, P. Sellers, and A. Marshak, “The interpretation of spectral vegetation
indexes,” IEEE Transactions on Geoscience and Remote Sensing, vol. 33, no. 2, pp. 481–486,
March 1995.
[71] B. Pinty and M. Verstraete, “A non-linear index to monitor global vegetation from satellites,”
Plant Ecology, vol. 101, no. 1, pp. 15–20, July 1992.
[72] A. Richardson and C. Wiegand, “Distinguishing vegetation from soil background information,”
Photogrammetric Engineering and Remote Sensing, vol. 43, no. 2, pp. 1541–1552, December
1977.
[73] A. Huete, “A soil-adjusted vegetation index (SAVI),” Remote Sensing of Environment, vol. 25,
no. 3, pp. 53–70, August 1988.
[74] Y. Kaufman and D. Tanre, “Atmospherically resistant vegetation index (ARVI) for
EOS-MODIS,” IEEE Transactions on Geoscience and Remote Sensing, vol. 30, no. 2, pp.
261–270, March 1992.
[75] F. Garcia-Haro, M. Gilabert, and J. Melia, “Monitoring fire-affected areas using Thematic
Mapper data,” International Journal of Remote Sensing, vol. 22, no. 4, pp. 533–549, March
2001.
[76] T. Fung and W. Siu, “Environmental quality and its changes, an analysis using NDVI,”
International Journal of Remote Sensing, vol. 21, no. 5, pp. 1011–1024, July 2000.
[77] E. Rosch, “Natural categories,” Cognitive Psychology, vol. 4, no. 3, pp. 328–350, May 1973.
[78] T. Fung, “Land use and land cover change detection with Landsat MSS and SPOT HRV data in
Hong Kong,” Geocarto International, vol. 7, no. 3, pp. 33–40, September 1992.
[79] N. Gautam and G. Chennaiah, “Land-use and land-cover mapping and change detection in
tripura using satellite Landsat data,” International Journal of Remote Sensing, vol. 6, no. 3–4,
pp. 517–528, March 1985.
[80] K. Price, D. Pyke, and L. Mendes, “Shrub dieback in a semiarid ecosystem: the integration of
remote sensing and GIS for detecting vegetation change,” Photogrammetric Engineering and
Remote Sensing, vol. 58, no. 4, pp. 455–463, April 1992.
[81] D. Alves, J. Pereira, C. De Sousa, J. Soares, and F. Yamaguchi, “Characterizing landscape
changes in central Rondonia using Landsat TM imagery,” International Journal of Remote
Sensing, vol. 20, no. 14, pp. 2877–2882, September 1999.
[82] D. Fuller, “Satellite remote sensing of biomass burning with optical and thermal sensors,”
Progress in Physical Geography, vol. 24, no. 4, pp. 543–561, December 2000.
[83] V. Cuomo, R. Lasaponara, and V. Tramutoli, “Evaluation of a new satellite-based method for
forest fire detection,” International Journal of Remote Sensing, vol. 22, no. 9, pp. 1799–1826,
June 2001.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
196
References
[84] J. Chan, K. Chan, and A. Yeh, “Detecting the nature of change in an urban environment:
a comparison of machine learning algorithms,” Photogrammetric Engineering and Remote
Sensing, vol. 67, no. 2, pp. 213–225, February 2001.
[85] X. Li and A. Yeh, “Principal component analysis of stacked multitemporal images for the
monitoring of rapid urban expansion in the Pearl River Delta,” International Journal of Remote
Sensing, vol. 19, no. 8, pp. 1501–1518, May 1998.
[86] J. Michalek, T. Wager, J. Luczkovich, and R. Stoffle, “Multispectral change vector analysis for
monitoring coastal marine environments,” Photogrammetric Engineering and Remote Sensing,
vol. 59, no. 3, pp. 381–384, March 1993.
[87] G. Zhou, J. Luo, C. Yang, B. Li, and S. Wang, “Flood monitoring using multitemporal AVHRR
and RADARSAT imagery,” Photogrammetric Engineering and Remote Sensing, vol. 66, no. 5,
pp. 633–638, May 2000.
[88] P. Agouris, A. Stefanidis, and S. Gyftakis, “Differential snakes for change detection in road
segments,” Photogrammetric Engineering and Remote Sensing, vol. 67, no. 12, pp. 1391–1399,
December 2001.
[89] R. Dwivedi and T. Sankar, “Monitoring shifting cultivation using space-borne multispectral and
multitemporal data,” International Journal of Remote Sensing, vol. 12, no. 3, pp. 427–433,
March 1991.
[90] W. Kleynhans, B. Salmon, J. Olivier, K. Wessels, and F. van den Bergh, “A comparison of
feature extraction methods within a spatio-temporal land cover change detection framework,”
in IEEE International Geoscience and Remote Sensing Symposium, vol. 1, Vancouver, Canada,
July 24–29, 2011, pp. 688–691.
[91] J. Townshend, C. Justice, C. Gurney, and J. McManus, “The impact of misregistration on
change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 30, no. 5, pp.
1054–1060, September 1992.
[92] X. Dai and S. Khorram, “The effects of image misregistration on the accuracy of remotely sensed
change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 36, no. 5, pp.
1566–1577, September 1998.
[93] R. Nelson, “Detecting forest canopy change due to insect activity using Landsat MSS,”
Photogrammetric Engineering and Remote Sensing, vol. 49, no. 9, pp. 1303–1314, September
1983.
[94] J. Lyon, D. Yuan, R. Lunetta, and C. Elvidge, “A change detection experiment using vegetation
indices,” Photogrammetric Engineering and Remote Sensing, vol. 64, no. 2, pp. 143–150, 1998.
[95] K. Green, D. Kempka, and L. Lackley, “Using remote sensing to detect and monitor land-cover
and land-use change,” Photogrammetric Engineering and Remote Sensing, vol. 60, no. 3, pp.
331–337, 1994.
[96] J. Jensen and D. Toll, “Detecting residential land use development at the urban fringe,”
Photogrammetric Engineering and Remote Sensing, vol. 48, no. 4, pp. 629–643, April 1982.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
197
References
[97] P. Chavez and D. MacKinnon, “Automatic detection of vegetation changes in the southwestern
United States using remotely sensed images,” Photogrammetric Engineering and Remote
Sensing, vol. 60, no. 5, pp. 571–583, May 1994.
[98] A. Singh, “Digital change detection techniques using remotely sensed data.” International
Journal of Remote Sensing, vol. 10, no. 6, pp. 989–1003, June 1989.
[99] J. Adams, D. Sabol, V. Kapos, R. Filho, D. Roberts, M. Smith, and A. Gillespie, “Classification
of multispectral images based on fractions of endmembers: application to land-cover change
in the Brazillian Amazon,” Remote Sensing of Environment, vol. 52, no. 2, pp. 137–154, May
1995.
[100] S. Macomber and C. Woodcock, “Mapping and monitoring conifer mortality using remote
sensing in the Lake Tahoe Basin,” Remote Sensing of Environment, vol. 50, no. 3, pp. 255–266,
December 1994.
[101] C. Lo and R. Shipman, “A GIS approach to land-use change dynamics detection,”
Photogrammetric Engineering and Remote Sensing, vol. 56, no. 11, pp. 1483–1491, November
1990.
[102] T. Stone and P. Lefebvre, “Using multitemporal satellite data to evaluate selective logging in
Para, Brazil,” International Journal of Remote Sensing, vol. 19, no. 13, pp. 2517–2526, January
1998.
[103] R. Lawrence and W. Ripple, “Calculating change curves for multitemporal satellite imagery:
Mount St. Helens 1980–1995,” Remote Sensing of Environment, vol. 67, no. 3, pp. 309–319,
March 1999.
[104] T. Yue, S. Chen, B. Xu, Q. Liu, H. Li, G. Liu, and Q. Ye, “A curve-theorem based approach for
change detection and its application to Yellow River Delta,” International Journal of Remote
Sensing, vol. 23, no. 11, pp. 2283–2292, June 2002.
[105] G. Henebry, “Detecting change in grasslands using measures of spatial dependence with Landsat
TM data.” Remote Sensing of Environment, vol. 46, no. 2, pp. 223–234, November 1993.
[106] J. Verbesselt, R. Hyndman, G. Newnham, and D. Culvenor, “Detecting trend and seasonal
changes in satellite image time series,” Remote Sensing of Environment, vol. 114, no. 1, pp.
106–115, January 2010.
[107] R. Lunetta, J. Ediriwickrema, D. Johnson, J. Lyon, and A. McKerrow, “Impact of vegetation
dynamics on the identification of land-cover change in a biologically complex community in
North Carolina, USA,” Remote Sensing of Environment, vol. 82, no. 2–3, pp. 258–270, October
2002.
[108] T. Loveland, J. Merchant, J. Brown, D. Ohlen, B. Reed, P. Olson, and J. Hutchinson, “Seasonal
land-cover regions of the United States,” Annals of the Association of American Geographers,
vol. 85, no. 2, pp. 339–355, June 1995.
[109] R. Kennedy, W. Cohen, and T. Schroeder, “Trajectory-based change detection for automated
characterization of forest disturbance dynamics,” Remote Sensing of Environment, vol. 110,
no. 3, pp. 370–386, October 2007.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
198
References
[110] F. Bovolo and L. Bruzzone, “A Split-based approach to unsupervised change detection
in large-size multitemporal images: Application to Tsunami-damage assessment,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 45, no. 6, pp. 1658–1670, June 2007.
[111] C. Jha and N. Unnia, “Digital change detection of forest conversion of a dry tropical Indian forest
region,” International Journal of Remote Sensing, vol. 15, no. 13, pp. 2543–2552, September
1994.
[112] P. Howarth and G. Wickware, “Procedures for change detection using Landsat digital data,”
International Journal of Remote Sensing, vol. 2, no. 3, pp. 277–291, August 1981.
[113] R. Townshend and C. Justice, “Spatial variability of images and the monitoring of changes in
the normalized difference vegetation index,” International Journal of Remote Sensing, vol. 16,
no. 12, pp. 2187–2195, August 1995.
[114] E. Lambin and A. Strahler, “Indicators of land-cover change for change-vector analysis in
multitemporal space at coarse spatial scales,” International Journal of Remote Sensing, vol. 15,
no. 10, pp. 2099–2119, July 1994.
[115] S. Mitra, Digital signal processing: A computer-based approach, 2nd ed.
McGraw-Hill, 2002.
New York, USA:
[116] S. Lhermitte, J. Verbesselt, I. Jonckheere, K. Nackaerts, J. van Aardt, W. Verstraeten, and
P. Coppin, “Hierarchical image segmentation based on similarity of NDVI time series,” Remote
Sensing of Environment, vol. 112, no. 2, pp. 506–521, February 2008.
[117] J. Verbesselt, R. Hyndman, A. Zeileis, and D. Culvenor, “Phenological change detection while
accounting for abrupt and gradual trends in satellite image time series,” Remote Sensing of
Environment, vol. 114, no. 12, pp. 2970–2980, December 2010.
[118] C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese, “Major
disturbance events in terrestrial ecosystems detected using global satellite data sets,” Global
Change Biology, vol. 9, no. 7, pp. 1005–1021, July 2003.
[119] D. Mildrexler, M. Zhao, and S. Running, “Testing a MODIS Global Disturbance Index across
North America,” Remote Sensing of Environment, vol. 113, no. 10, pp. 2103–2117, October
2009.
[120] W. Kleynhans, J. Olivier, K. Wessels, B. Salmon, F. van den Bergh, and K. Steenkamp,
“Detecting land cover change using an Extended Kalman Filter on MODIS NDVI time series
data,” IEEE Geoscience and Remote Sensing Letters, vol. 8, no. 3, pp. 507–511, May 2011.
[121] W. Kleynhans, B. Salmon, J. Olivier, F. van den Bergh, K. Wessels, T. Grobler, and
K. Steenkamp, “Land cover change detection using autocorrelation analysis on MODIS
time-series data: Detection of new human settlements in the Gauteng province of South Africa,”
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, In press
2011.
[122] M. Hansen, R. DeFries, J. Townshend, M. Carroll, C. Dimiceli, and R. Sohlberg, “Global
Percent Tree Cover at a spatial resolution of 500 meters: First results of the MODIS vegetation
continuous fields algorithm,” Earth Interactions, vol. 7, no. 10, pp. 1–15, October 2003.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
199
References
[123] X. Zhan, R. Sohlberg, J. Townshend, C. DiMiceli, M. Carroll, J. Eastman, M. Hansen, and
R. DeFries, “Detection of land cover changes using MODIS 250m data,” Remote Sensing of
Environment, vol. 83, no. 1-2, pp. 336–350, November 2002.
[124] A. Strahler, D. Muchoney, J. Borak, M. Friedl, S. Gopal, E. Lambin, and A. Moody, “MODIS
Land Cover Product Algorithm Theoretical Basis Document (ATBD): MODIS Land Cover and
Land-Cover Change,” Boston: Boston University, Tech. Rep., May 1999.
[125] J. Vermaak and E. Botha, “Recurrent neural networks for short-term load forecasting,” IEEE
Transactions on Power Systems, vol. 13, no. 1, pp. 126–132, February 1998.
[126] X. Wang, L. Xiu-Xia, and J. Sun, “A new approach of neural networks to time-varying database
classification,” in IEEE Proceedings Machine Learning and Cybernetics, vol. 4, Guangzhou,
China, August 18–21, 2005, pp. 2050–2054.
[127] S. Salzberg, “On comparing classifiers: Pitfalls to avoid and a recommended approach,” Data
mining and knowledge discovery, vol. 1, no. 3, pp. 317–328, September 1997.
[128] L. Bruzzone and S. Serpico, “An iterative technique for the detection of land-cover transitions in
multitemporal remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 35, no. 4, pp. 858–867, July 1997.
[129] C. Burges, “A Tutorial on support vector machines for pattern recognition,” Data Mining and
Knowledge Discovery, vol. 2, no. 2, pp. 121–167, June 1998.
[130] C. Bishop, Neural Networks for Pattern Recognition, 2nd ed.
University Press, 1995.
New York, USA: Oxford
[131] M. Richard and R. Lippmann, “Neural network classifiers estimate Bayesian a posteriori
probabilities,” Neural Computation, vol. 3, no. 4, pp. 461–483, 1991.
[132] H. White, “Connectionist nonparametric regression: multilayer feedforward networks can learn
arbitrary mappings,” Journal of Neural Networks, vol. 3, no. 5, pp. 535–549, 1990.
[133] J. Hopfield, “Learning algorithms and probability distributions in feed-forward and feed-back
networks,” Proceedings of the National Academy of Sciences, vol. 84, no. 23, pp. 8429–8433,
December 1987.
[134] J. Hampshire and B. Pearlmutter, “Equivalence proofs for multilayer perceptron classifiers and
the Bayesian discriminant function,” in Proceedings of the 1990 Connectionist Models Summer
School, vol. 1, San Mateo, CA, USA, 1990, pp. 159–172.
[135] C. Bishop, “Novelty detection and neural network validation,” IEE Proceedings: Vision, Image
and Signal Processing, vol. 141, no. 4, pp. 217–222, August 1994.
[136] P. Hartono and H. Shuji, “Learning from imperfect data,” Journal of Applied Soft Computing,
vol. 7, no. 1, pp. 353–363, January 2007.
[137] I. Bruha and A. Famili, “Postprocessing in machine learning and data mining,” ACM SIGKDD
Explorations Newsletter - Special issue on Scalable data mining algorithms, vol. 2, no. 2, pp.
110–114, December 2000.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
200
References
[138] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed.
USA: Prentice Hall, 2002.
New Jersey,
[139] F. Rosenblatt, “The perceptron – a perceiving and recognizing automaton,” Cornell Aeronautical
Laboratory, Tech. Rep. 85-460-1, 1957.
[140] M. Minsky and S. Papert, Perceptron, 1st ed.
1969.
Cambridge, Massachusetts, USA: MIT Press,
[141] D. MacKay, Information Theory, Inference, and Learning Algorithms, 1st ed.
United Kingdom: Cambridge University Press, 2003.
Cambridge,
[142] A. Kolmogorov, “On the representation of continuous functions of several variables by
superposition of continuous functions of one variable and addition,” Doklady Akademii. Nauk
USSR, vol. 114, pp. 679–681, 1957.
[143] R. Duda, P. Hart, and D. Stork, Pattern classification, 2nd ed.
2000.
New York: Wiley-Interscience,
[144] A. Barron, “Universal approximation bounds for superposition of a sigmoidal function,” IEEE
Transactions on Information Theory, vol. 39, no. 3, pp. 930–945, May 1993.
[145] R. Lippmann, “An introduction to computing with neural nets,” IEEE ASSP Magazine, vol. 4,
no. 2, pp. 4–22, April 1987.
[146] D. Rumelhart and J. McClelland, Parallel Distributed Processing, 1st ed.
Press, 1987.
Cambridge: MIT
[147] Y. Le Cun, P. Simard, and B. Pearlmutter, “Automatic learning rate maximization by on-line
estimation of the Hessian eigenvectors,” Advances in Neural Information Processing Systems,
vol. 5, pp. 156–163, 1993.
[148] D. Plaut, S. Nowlan, and G. Hinton, “Experiments on learning by back propagation,”
Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Tech. Rep.
CMU-CS-86-126, 1986.
[149] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The
RPROP algorithm,” in Proceedings of the IEEE International Conference on Neural Networks,
vol. 1, San Francisco, CA, USA, 28 March – 1 April, 1993, pp. 586–591.
[150] S. Fahlman, “Faster-learning variation back-propagation: an empirical study,” in Proceedings of
the 1988 Connectionist Models Summer School, vol. 1, San Mateo, CA, USA, 1988, pp. 38–51.
[151] R. Brent, Algorithms for minimization without derivatives, 1st ed. Englewood Cliffs, NJ, USA:
Prentice Hall, 1973.
[152] M. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal
of Research of the National Bureau of Standards, vol. 46, no. 6, pp. 409–436, 1952.
[153] J. Dennis and R. Schnabel, Numerical methods for unconstrained optimization and nonlinear
equations, 1st ed. New Jersey, US: Society for Industrial Mathematics, 1987.
[154] D. Shanno, “Conjugate gradient methods with inexact searches,” Mathematics of Operations
Research, vol. 3, no. 3, pp. 244–256, 1978.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
201
References
[155] K. Levenberg, “A method for the solution of certain non-linear problems in least squares,”
Quaterly Journal of Applied Mathematics, vol. 2, no. 2, pp. 164–168, 1944.
[156] D. Marquardt, “An algorithm for least-squares estimation of non-linear parameters,” Journal of
the Society of Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.
[157] J. Moody and C. Darken, “Fast learning in networks of locally tuned processing units,” Neural
Computation, vol. 1, no. 2, pp. 281–294, 1989.
[158] S. Chen, C. Cowan, and P. Grant, “Orthogonal least squares learning algorithm for Radial Basis
Function networks,” IEEE Transactions on Neural Networks, vol. 2, no. 2, pp. 302–309, March
1991.
[159] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological
Cybernetics, vol. 43, no. 1, pp. 59–69, January 1982.
[160] ——, Self-organization and associative memory, 2nd ed.
Berlin: Springer-Verlag, 1987.
[161] J. Hopfield and D. Tank, “Neural computations of decisions in optimization problems,” Biology
and Cybernetics, vol. 52, no. 3, pp. 1–25, July 1985.
[162] J. Hopfield, “Neural networks and physical systems with emergent collective computational
abilities,” Proceedings of the National Academy of Sciences of USA, vol. 79, no. 8, pp.
2554–2558, April 1982.
[163] J. Li, A. Michel, and W. Porod, “Analysis and synthesis of a class of neural networks: linear
systems operating on a closed hypercube,” IEEE Transactions on Circuits and Systems, vol. 36,
no. 11, pp. 1405–1422, November 1989.
[164] M. Negnevitsky, Artifical Intelligence: A guide to intelligent systems, 1st ed.
UK: Addison Wesley, 2002.
Essex, England,
[165] V. Kecman, Learning and soft computing; Support Vector Machines, Neural Networks and Fuzzy
Logic Models, 1st ed. Cambridge, Massachusetts: MIT Press, 2001.
[166] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, 1st ed.
Athena Scientific, 1996.
Belmont, MA, USA:
[167] E. Baum and D. Haussler, “What size net gives valid generalization,” Neural Computation,
vol. 1, no. 1, pp. 151–160, 1989.
[168] R. Caruana, S. Lawrence, and C. Giles, “Overfitting and neural networks: conjugate gradient
and backpropagation,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference
on Neural Networks, vol. 1, Como, Italy, July 24–27, 2000, pp. 114–119.
[169] A. Weigend, “On overfitting and the effective number of hidden units,” in Proceedings of the
1993 Connectionist Models Summer School, vol. 1, San Mateo, CA, USA, 1993, pp. 335–342.
[170] A. Jain, “Data clustering: 50 years beyond K-means,” Pattern recognition letters, vol. 31, no. 8,
pp. 651–666, June 2010.
[171] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31,
no. 3, pp. 264–323, September 1999.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
202
References
[172] J. Kleinberg, “An impossibility theorem for clustering,” in Advances in Neural Information
Processing Systems 15. Cambridge, MA: MIT Press, 2003, pp. 446–453.
[173] A. Jain and R. Dubes, Algorithms for clustering data, 1st ed.
Prentice Hall, 1988.
Upper Saddle River, NJ, USA:
[174] G. Nagy, “State of the art in pattern recognition,” Proceedings of the IEEE, vol. 56, no. 5, pp.
836–863, May 1968.
[175] F. Backer and L. Hubert, “A graph-theoretic approach to goodness-of-fit in complete-link
hierarchical clustering,” Journal American Statistical Association, vol. 71, no. 356, pp. 870–878,
December 1976.
[176] J. Ward, “Hierarchical grouping to optimize an objective function,” Journal of American
Statistical Association, vol. 58, no. 301, pp. 236–244, March 1963.
[177] R. Sokal and F. Rohlf, “The comparison of dendrograms by objective methods,” Taxon, vol. 6,
no. 2, pp. 33–40, February 1962.
[178] H. Steinhaus, “Sur la division des corp materiels en parties,” Bulletin of the Polish Academy of
Science, vol. 4, no. 1, pp. 801–804, 1956.
[179] M. Anderberg, Cluster Analysis for Applications: Monographs and Textbooks on Probability
and Mathematical Statistics, 1st ed. New York, USA: Academic Press, Inc., 1973.
[180] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clustering large graphs via the
singular value decomposition,” Machine learning, vol. 56, no. 1–3, pp. 9–33, July 2004.
[181] M. Meila, “The uniqueness of a good optimum for k-means,” in Proceedings of the 23rd
International Conference on Machine Learning, vol. 1, Pennsylvania, USA, June 25–29, 2006,
pp. 625–632.
[182] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM
algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1–38, 1977.
[183] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis,
9th ed. New Jersey: Wiley-Interscience, 1990.
[184] G. Goodwin, S. Graebe, and M. Salgado, Control system design, 1st ed.
New Jersey, USA: Prentice-Hall, 2001.
Upper Saddle River,
[185] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman filter: Particle Filters for
Tracking Applications, 1st ed. London, UK: Artech House, 2004.
[186] R. Kalman, “A new approach to linear filtering and prediction problems,” Transactions ASME
Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960.
[187] R. Kalman and R. Bucy, “New results in linear filtering and prediction theory,” Transactions
ASME Journal of Basic Engineering, vol. 83, no. Series D, pp. 95–107, 1961.
[188] S. Julier and J. Uhlmann, “Unscented Filtering and Nonlinear Estimation,” in Proceedings of
the IEEE, vol. 92, no. 3, March 2004, pp. 401–422.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
203
References
[189] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C++: The art of
scientific computing, 2nd ed. Cambridge, UK: Cambridge Press, 2002.
[190] J. Nelder and R. Mead, “A simplex method for function minimization,” Computer Journal,
vol. 7, no. 4, pp. 308–313, 1965.
[191] G. Carlson, Signal and Linear system analysis, 2nd ed. New York, USA: John Wiley and Sons
Inc., 1998.
[192] G. Das, K. Lin, H. Mannila, G. Renganathan, and P. Smyth, “Rule Discovery from time series,”
in Proceedings of the 4th International Conference on Knowledge Discovery and Data mining,
vol. 1, New York, USA, August 27–31, 1998, pp. 16–22.
[193] N. Radhakrishnan, J. Wilson, and P. Loizou, “An alternate partitioning technique tp quantify
the regularity of complex time series,” International Journal of Bifurcation and Chaos, vol. 10,
no. 7, pp. 1773–1779, July 2000.
[194] P. Cotofrei, “Statistical temporal rules,” in Proceedings of the 15th Conference on
Computational Statistical, vol. 1, Berlin, Germany, August 24–28, 2002, pp. 24–28.
[195] C. Schittenkopf, P. Tino, and G. Dorffner, “The benefits of information reduction for trading
strategies,” Report series for adaptive information systems and management in economics and
management science, Tech. Rep. 45, 2000.
[196] T. Yairi, Y. Kato, and K. Hori, “Fault detection by mining association rules in house-keeping
data,” in Proceedings of the 6th International Symposium on Artificial Intelligence, Robotics
and Automation in space, vol. 1, Montreal, Canada, June 18–22, 2001, pp. 18–21.
[197] C. Aggarwal, A. Hinneburg, and D. Keim, “On the surprising behaviour of distance metrics
in high dimensional space,” in Proceedings of the 8th International Conference on Database
Theory, vol. 1, London, UK, January 4–6, 2001, pp. 420–434.
[198] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest neighbour
meaningful?” in Proceedings of the 7th International Conference on Database Theory, vol. 1,
Jerusalem, Israel, January 10–12, 1999, pp. 217–235.
[199] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality reduction for fast
similarity search in large time series databases,” Journal of Knowledge and Information systems,
vol. 3, no. 3, pp. 263–286, August 2001.
[200] A. Oppenheim, R. Schafer, and J. Buck, Discrete-Time Signal Processing, 2nd ed. New Jersey,
USA: Prentice-Hall Signal Processing series, 1999.
[201] R. Bellman, Adaptive control processes: A guided tour.
University Press, 1961.
Princeton, New Jersey: Princeton
[202] M. Jakubauskas, D. Legates, and J. Kastens, “Crop identication using harmonic analysis of the
time-series AVHRR NDVI data,” Computers and Electronics in Agriculture, vol. 37, no. 1-3,
pp. 127–139, November 2002.
[203] R. Juarez and W. Liu, “FFT analysis on NDVI annual cycle and climatic regionality in northeast
Brazil,” International Journal of Climatology, vol. 21, no. 14, pp. 1803–1820, December 2001.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
204
References
[204] M. Chen, S. Liu, L. Tieszen, and D. Hollinger, “An improved state-parameter analysis of
ecosystem models using data assimilation,” Ecological Modelling, vol. 219, no. 3–4, pp.
317–326, December 2008.
[205] O. Samain, J. Roujean, and B. Geiger, “Use of a Kalman filter for the retrieval of surface BRDF
coefficients with a time-evolving model based on the ECOCLIMAP land cover classification,”
Remote Sensing of Environment, vol. 112, no. 4, pp. 1337–1346, April 2008.
[206] J. Mendel, Lessons in digital estimation theory, 1st ed.
Prentice-Hall, 1987.
The University of Michigan:
[207] M. Nikulin, D. Commenges, and C. Huber, Probability, Statistics and Modeling in public health,
1st ed. 233 Spring street, New York, USA: Springer, 2005.
[208] M. Nikulin, N. Limnois, N. Balakrishnan, W. Kahle, and C. Huber-Carol, Advances in
degradation modeling: Applications to reliability, survival analysis, and finance, 1st ed. 233
Spring street, New York, USA: Springer, 2010.
[209] R. Mehra, “On the identification of variances and adaptive Kalman filtering,” IEEE Transactions
on Automatic Control, vol. 15, no. 12, pp. 175–184, April 1970.
[210] B. Carew and P. Belanger, “Identification of optimum filter steady-state gain for systems with
unknown noise covariances,” IEEE Transactions on Automatic Control, vol. 18, no. 6, pp.
582–587, December 1973.
[211] G. Noriega and S. Pasupathy, “Adaptive estimation of noise covariance matrices in real-time
preprocessing of geophysical data,” IEEE Transactions on Geoscience Remote Sensing, vol. 35,
no. 5, pp. 1146–1159, September 1997.
[212] M. Rajamani and J. Rawlings, “Estimation of the disturbance structure from data using
semidefinite programming and optimal weighting,” Automatica, vol. 45, no. 1, pp. 142–148,
January 2009.
[213] R. Shumway and D. Stoffer, “An approach to time series smoothing and forecasting using the
em algorithm,” Journal of Time Series Analysis, vol. 3, no. 4, pp. 253–264, July 1982.
[214] R. Hirschowitz, “Mid-year estimates Statistical release,” Statistics South Africa, Tech. Rep.
P0302, 2000.
[215] P. Lehohla, “Mid-year population estimates,” Statistics South Africa, Tech. Rep. P0302, 2010.
[216] A. Beaudette, D.E. nad OGeen, “Soil-Web: An online soil survey for California, Arizona, and
Nevada,” Computers and Geosciences, vol. 35, no. 10, pp. 2119–2128, October 2009.
[217] M. Clark and T. Aide, “Virtual interpretation of Earth Web-interface tool (VIEW-IT) for
collecting land-use/land-cover reference data,” Remote Sensing, vol. 3, no. 3, pp. 601–620,
March 2011.
[218] L. Olsson, L. Eklundhb, and J. Ardo, “A recent greening of the Sahel-trends, patterns and
potential causes,” Journal of Arid Environments, vol. 63, no. 3, pp. 556–566, November 2005.
[219] V. Vanacker, M. Linderman, F. Lupo, S. Flasse, and E. Lambin, “Impact of short-term rainfall
fluctuation on inter-annual land cover change in sub-Saharan Africa,” Global Ecology and
Biogeography, vol. 14, no. 2, pp. 123–135, January 2005.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
205
Chapter 9
Conclusion
[220] S. Mehrotra, “On the implementation of a Primal Dual Interior Point method,” SIAM Journal on
Optimization, vol. 2, no. 4, pp. 575–601, 1992.
[221] P. Gill, W. Murray, M. Saunders, and M. Wright, “Procedures for Optimization Problems with
a Mixture of Bounds and General Linear Constraints,” ACM Transactions on Mathematical
Software, vol. 10, no. 3, pp. 282–298, September 1984.
[222] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by Simulated Annealing,” Science, vol.
220, no. 4598, pp. 671–680, May 1983.
[223] M. Friedl, D. Sulla-Menashe, B. Tan, A. Schneider, N. Ramankutty, A. Sibley, and X. Huang,
“MODIS collection 5 global land cover: algorithm refinement and characterization of new
datasets,” Remote Sensing of Environment, vol. 114, no. 1, pp. 168–182, January 2010.
[224] W. Kleynhans, “Detecting land-cover change using MODIS time-series data,” Ph.D.
dissertation, Department of Electrical, Electronic and Computer Engineering, University of
Pretoria, Pretoria, South Africa, September 2011.
[225] M. Thompson, “A standard land-cover classification scheme for remote sensing applications in
South Africa,” South African Journal of Science, vol. 92, no. 1, pp. 34–42, January 1996.
[226] M. Thompson, H. van den Berg, T. Newby, and D. Hoare, “Guideline procedures for the
National Land-Cover mapping and change monitoring,” Council for Scientific and Industrial
Research and Agricultural Research Council, Tech. Rep., March 2001.
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
206
A PPENDIX
A
P UBLICATIONS EMANATING FROM THIS
THESIS AND RELATED WORK
A.1
PAPERS THAT APPEARED IN THOMSON INSTITUTE FOR SCIENTIFIC INFORMATION JOURNALS
• Salmon B.P., Olivier J.C., Wessels K.J., Kleynhans W., van den Bergh F., Steenkamp K.C.”The
use of a Multilayer Perceptron for detecting new human settlements from a time series of MODIS
images”, International Journal of Applied Earth Observations and Geoinformation, vol. 13, no.
6, December 2011, pp 873–883
• Salmon B.P., Olivier J.C., Wessels K.J., Kleynhans W., van den Bergh F., Steenkamp
K.C.”Unsupervised land cover change detection: Meaningful Sequential Time Series Analysis”,
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 4, no.
2, June 2011, pp 327–335
• Kleynhans W., Olivier J.C., Wessels K.J., Salmon B.P., van den Bergh F., Steenkamp
K.C.”Improving land cover class separation using an extended Kalman filter on MODIS NDVI
time-series data”, IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 2, April 2010, pp
381–385
• Kleynhans W., Olivier J.C., Wessels K.J., Salmon B.P., van den Bergh F., Steenkamp
K.C.”Detecting Land Cover Change Using an Extended Kalman Filter on MODIS NDVI Time
Series Data”, IEEE Geoscience and Remote Sensing Letters, vol. 8, no. 3, 2011, pp 507–511
• Kleynhans W., Salmon B.P., Olivier J.C., van den Bergh F., Wessels K.J., T.L. Grobler and
Steenkamp K.C.”Land Cover Change Detection Using Autocorrelation Analysis on MODIS
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
207
Appendix A
Publications emanating from this thesis and related work
Time-Series Data: Detection of new human settlements in the Gauteng province of South Africa”,
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, In press
• Ackermann E.R., Grobler T.L., Kleynhans W., Olivier J.C., Salmon B.P., and van Zyl A.J.
”Cavalieri Integration: a Novel Integration Technique”, Quaestiones Mathematicae, In press
• Grobler T.L., Ackermann E.R., van Zyl A.J., Olivier J.C., Kleynhans W., and Salmon B.P.
”Synthesizing Multispectral MODIS Surface Spectral Reflectance Time Series Data”, IEEE
Geoscience and Remote Sensing Letters, In Press
• Grobler T.L., Ackermann E.R., van Zyl A.J., Olivier J.C., Kleynhans W., and Salmon B.P.
”Using Pages Cumulative Sum Test on MODIS time series to detect land cover changes”, IEEE
Geoscience and Remote Sensing Letters, In Press
A.2
PAPERS PUBLISHED IN REFEREED ACCREDITED CONFERENCE
PROCEEDINGS
• Salmon B.P., Kleynhans W., van den Bergh F., Olivier J.C., Marais, W.J., Grobler T.L., Wessels
K.J.,”A search algorithm to meta-optimize the parameters for an extended Kalman filter to
improve classification on hyper-temporal images”, Accepted for publication, IEEE Geoscience
and Remote Sensing Symposium 2012, Munich, Germany, 22 July - 27 July 2012
• Salmon B.P., Kleynhans W., van den Bergh F., Olivier J.C., Wessels K.J.,”Detecting land cover
change by evaluating the internal covariance matrix of the extended Kalman filter”, Accepted
for publication, IEEE Geoscience and Remote Sensing Symposium 2012, Munich, Germany, 22
July - 27 July 2012
• Grobler T.L., Ackermann E.R., van Zyl A.J., Kleynhans W., Salmon B.P., Olivier J.C.
”Sequential classification of MODIS time series”, Accepted for publication, IEEE Geoscience
and Remote Sensing Symposium 2012, Munich, Germany, 22 July - 27 July 2012
• Kleynhans W., Salmon B.P., Olivier J.C., van den Bergh F., Wessels K.J., Grobler T.L.
”Detecting land cover change using a sliding window temporal autocorrelation approach”,
Accepted for publication, IEEE Geoscience and Remote Sensing Symposium 2012, Munich,
Germany, 22 July - 27 July 2012
• Kleynhans W., Salmon B.P., Olivier J.C.,, Wessels K.J., van den Bergh F.,”A comparison of
feature extraction methods within a spatio-temporal land cover change detection framework”,
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
208
Appendix A
Publications emanating from this thesis and related work
IEEE Geoscience and Remote Sensing Symposium 2011, Vancouver, Canada, 25 July - 29 July
2011
• Salmon B.P., Olivier J.C., Kleynhans W., Wessels K.J., van den Bergh F.,”Automated land
cover change detection: The quest for meaningful high temporal time series extraction”, IEEE
Geoscience and Remote Sensing Symposium 2010, Honolulu, Hawaii, United States, 25 July 30 July 2010
• Kleynhans W., Olivier J.C., Salmon B.P., Wessels K.J., van den Bergh F.,”A spatio-temporal
approach to detecting land cover change using an extended Kalman filter on MODIS time series
data”, IEEE Geoscience and Remote Sensing Symposium 2010, Honolulu, Hawaii, United
States, 25 July - 30 July 2010
A.3
INVITED CONFERENCE PAPERS IN REFEREED ACCREDITED
CONFERENCE PROCEEDINGS
• Salmon B.P., Olivier J.C., Kleynhans W., Wessels K.J., van den Bergh F.,”The quest for
automated land cover change detection using satellite time series data meaningful high temporal
time series extraction”, IEEE Geoscience and Remote Sensing Symposium 2009, Cape Town,
South Africa, 12 July - 17 July 2009
• Kleynhans W., Olivier J.C., Salmon B.P., Wessels K.J., van den Bergh F.,”Improving NDVI time
series class separation using an extended Kalman filter temporal time series extraction”, IEEE
Geoscience and Remote Sensing Symposium 2009, Cape Town, South Africa, 12 July - 17 July
2009
• Kleynhans W., Salmon B.P., Olivier J.C.,, Wessels K.J., van den Bergh F.,”An autocorrelation
analysis approach to detecting land cover change using hyper-temporal time-series data”, Joint
invite for publication, IEEE Geoscience and Remote Sensing Symposium 2011, Vancouver,
Canada, 25 July - 29 July 2011
A.4
PAPERS SUBMITTED TO REFEREED ACCREDITED CONFERENCE
PROCEEDINGS
• Kleynhans W., Salmon B.P. ”Monitoring informal settlements using SAR polarimetry”,
Submitted for review, African Association of Remote Sensing of the Environement (AARSE)
2012
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
209
Appendix A
A.5
Publications emanating from this thesis and related work
BEST PAPER AWARD
• Salmon B.P., Kleynhans W., van den Bergh F., Olivier J.C., Marais, W.J., Wessels
K.J.,”Meta-optimization of the extended Kalman filter’s parameters for improved feature
extraction on hyper-temporal images”, IEEE Geoscience and Remote Sensing Symposium 2011,
Vancouver, Canada, 25 July - 29 July 2011
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
210
L IST OF TABLES
2.1
Specification of different remote sensing sensors. . . . . . . . . . . . . . . . . . . . .
19
2.2
MODIS spectral bands properties and characteristics. . . . . . . . . . . . . . . . . . .
21
2.3
MODIS land cover products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
6.1
Sequence of features extracted with sliding window at increments of π2 . . . . . . . . . 114
6.2
Sequence of features extracted with sliding window at increments of 2π. . . . . . . . . 116
8.1
Number of pixels used for training, validation and testing data sets. . . . . . . . . . . . 146
8.2
The number of hidden nodes used within the MLP. . . . . . . . . . . . . . . . . . . . 149
8.3
Classification accuracy of the batch mode and iteratively retrained MLP. . . . . . . . . 150
8.4
Classification accuracy of MLP using BVEP and ALS. . . . . . . . . . . . . . . . . . 153
8.5
Parameter evaluation of simulated annealing and BVSA. . . . . . . . . . . . . . . . . 155
8.6
Parameter evaluation of MODIS spectral bands and NDVI in Limpopo province. . . . 156
8.7
Parameter evaluation of MODIS spectral bands and NDVI in Gauteng province. . . . . 157
8.8
The Cophenetic correlation coefficient computed for hierarchical clustering methods. . 160
8.9
Classification accuracy of MLP using SFF. . . . . . . . . . . . . . . . . . . . . . . . . 161
8.10 Classification accuracy of MLP using regression methods. . . . . . . . . . . . . . . . 162
8.11 Classification accuracy of single, average and complete linkage criteria using SFF. . . . 164
8.12 Classification accuracy of Ward clustering method using SFF. . . . . . . . . . . . . . . 165
8.13 Classification accuracy of Ward clustering method using regression methods. . . . . . 165
8.14 Classification accuracy of K-means using SFF. . . . . . . . . . . . . . . . . . . . . . 167
8.15 Classification accuracy of K-means using regression methods. . . . . . . . . . . . . . 167
8.16 Classification accuracy of EM algorithm using SFF. . . . . . . . . . . . . . . . . . . . 168
8.17 Classification accuracy of EM algorithm using regression methods. . . . . . . . . . . . 169
8.18 Change detection accuracy on simulated land cover change in Limpopo province. . . . 171
8.19 Change detection accuracy on simulated land cover change in Gauteng province. . . . 172
8.20 Change detection accuracy on real land cover change in Limpopo province. . . . . . . 173
8.21 Change detection accuracy on real land cover change in Gauteng province. . . . . . . . 174
211
List of Tables
8.22 Effective change detection delay in Limpopo province. . . . . . . . . . . . . . . . . . 176
8.23 Effective change detection delay in Gauteng province. . . . . . . . . . . . . . . . . . . 177
8.24 Change detection algorithms tested at regional scale. . . . . . . . . . . . . . . . . . . 178
8.25 Change detection algorithm comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.26 Classification of the entire Limpopo province. . . . . . . . . . . . . . . . . . . . . . . 181
8.27 Classification of the entire Gauteng province. . . . . . . . . . . . . . . . . . . . . . . 182
8.28 Computational time of feature extraction methods. . . . . . . . . . . . . . . . . . . . 184
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
212
L IST OF F IGURES
1.1
Flow diagram for proposed solution. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1
The Limpopo province. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
The Gauteng province. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.3
The electromagnetic spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.4
Atmospheric absorption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.5
Global MODIS image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.6
Example of passive satellite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.7
Sinusoidal projection of the Earth. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.8
Global NDVI index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.9
Seasonal variations versus land cover conversion. . . . . . . . . . . . . . . . . . . . .
30
3.1
Aerial photograph in Limpopo province. . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2
Aerial photograph in Limpopo province (new segments). . . . . . . . . . . . . . . . .
43
3.3
Flow diagram of processing steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4
Aerial photograph in Limpopo province (alternative segments). . . . . . . . . . . . . .
45
3.5
Aerial photograph in Limpopo province (histogram representation). . . . . . . . . . .
46
3.6
MLP topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.7
Training of the SOM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1
Aerial photograph in Limpopo province. . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.2
Two dimensional illustration of feature vectors. . . . . . . . . . . . . . . . . . . . . .
69
4.3
Aerial photograph in Limpopo province (alternative segments). . . . . . . . . . . . . .
74
4.4
Illustration of hierarchical clustering operating in agglomerative mode. . . . . . . . . .
75
4.5
A silhouette plot of 3 clusters formed. . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.1
Multiple aerial photos used to create a time series. . . . . . . . . . . . . . . . . . . . .
85
5.2
Time series created of multiple aerial photos. . . . . . . . . . . . . . . . . . . . . . .
86
5.3
EKF fits the process function to a time series. . . . . . . . . . . . . . . . . . . . . . .
94
213
List of Figures
5.4
~ i. . . . . . . . . . . . . . . . . . . . . . . . . .
EKF estimates the state-space vector W
95
5.5
Least squares fitting model to annual time series. . . . . . . . . . . . . . . . . . . . .
97
5.6
Least squares applied to time series using sliding window. . . . . . . . . . . . . . . .
98
5.7
Least squares fits the model to a time series. . . . . . . . . . . . . . . . . . . . . . . . 99
~ i . . . . . . . . . . . . . . . . . . . . . 100
Least squares estimates the parameter vector W
5.8
5.9
M-estimator fits the model to a time series. . . . . . . . . . . . . . . . . . . . . . . . . 102
~ i . . . . . . . . . . . . . . . . . . . . . . 103
5.10 M-estimator estimates the parameter vector W
5.11 FFT models a time series using harmonics. . . . . . . . . . . . . . . . . . . . . . . . . 105
~ i . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.12 FFT estimates the parameter vector W
6.1
Illustration of sliding window operating on a time series. . . . . . . . . . . . . . . . . 112
6.2
Two sliding window extracted separated at two
6.3
Two sliding window extracted separated at two 2π time increments. . . . . . . . . . . 115
6.4
Example of Seasonal Fourier features extracted with sliding windows. . . . . . . . . . 117
6.5
Multi-spectral temporal sliding window used to extract subsequences. . . . . . . . . . 118
6.6
Change detection example operating on the first two spectral bands. . . . . . . . . . . 119
7.1
FFT of the MODIS spectral band 1’s time series. . . . . . . . . . . . . . . . . . . . . 123
7.2
Tracking of the first two spectral bands using EKF. . . . . . . . . . . . . . . . . . . . 124
8.1
Example of land cover change in Midstream estates. . . . . . . . . . . . . . . . . . . . 139
8.2
Example of land cover change in Limpopo province. . . . . . . . . . . . . . . . . . . 140
8.3
Land cover change identified in the Sekuruwe area. . . . . . . . . . . . . . . . . . . . 141
8.4
Flow diagram of complete system outline. . . . . . . . . . . . . . . . . . . . . . . . . 143
8.5
Illustration of the effective change detection delay ∆τ . . . . . . . . . . . . . . . . . . 144
8.6
Illustration of simulated land cover change using different blending periods. . . . . . . 145
8.7
Classification accuracies of least squares using different lengths of sliding window. . . 151
8.8
Parameter comparison for least squares using different lengths of sliding window. . . . 152
8.9
Standard deviation of mean parameter reported by BVS. . . . . . . . . . . . . . . . . 154
π
2
time increments. . . . . . . . . . . . 114
8.10 Standard deviation of amplitude parameter reported by BVS. . . . . . . . . . . . . . . 154
8.11 Expected residuals reported by BVS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.12 Computing the average silhouette value Save for different number of classes. . . . . . . 159
8.13 Change detection map of the entire Limpopo province. . . . . . . . . . . . . . . . . . 180
8.14 Change detection map of the entire Gauteng province. . . . . . . . . . . . . . . . . . . 182
8.15 Examples of natural vegetation and settlements in different provinces. . . . . . . . . . 185
Department of Electrical, Electronic and Computer Engineering
University of Pretoria
214
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement