Learning Bayesian Models with R

Learning Bayesian Models with R
Learning Bayesian Models with R
Table of Contents
Learning Bayesian Models with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introducing the Probability Theory
Probability distributions
Conditional probability
Bayesian theorem
Marginal distribution
Expectations and covariance
Binomial distribution
Beta distribution
Gamma distribution
Dirichlet distribution
Wishart distribution
Exercises
References
Summary
2. The R Environment
Setting up the R environment and packages
Installing R and RStudio
Your first R program
Managing data in R
Data Types in R
Data structures in R
Importing data into R
Slicing and dicing datasets
Vectorized operations
Writing R programs
Control structures
Functions
Scoping rules
Loop functions
lapply
sapply
mapply
apply
tapply
Data visualization
High-level plotting functions
Low-level plotting commands
Interactive graphics functions
Sampling
Random uniform sampling from an interval
Sampling from normal distribution
Exercises
References
Summary
3. Introducing Bayesian Inference
Bayesian view of uncertainty
Choosing the right prior distribution
Non-informative priors
Subjective priors
Conjugate priors
Hierarchical priors
Estimation of posterior distribution
Maximum a posteriori estimation
Laplace approximation
Monte Carlo simulations
The Metropolis-Hasting algorithm
R packages for the Metropolis-Hasting algorithm
Gibbs sampling
R packages for Gibbs sampling
Variational approximation
Prediction of future observations
Exercises
References
Summary
4. Machine Learning Using Bayesian Inference
Why Bayesian inference for machine learning?
Model overfitting and bias-variance tradeoff
Selecting models of optimum complexity
Subset selection
Model regularization
Bayesian averaging
An overview of common machine learning tasks
References
Summary
5. Bayesian Regression Models
Generalized linear regression
The arm package
The Energy efficiency dataset
Regression of energy efficiency with building parameters
Ordinary regression
Bayesian regression
Simulation of the posterior distribution
Exercises
References
Summary
6. Bayesian Classification Models
Performance metrics for classification
The Naïve Bayes classifier
Text processing using the tm package
Model training and prediction
The Bayesian logistic regression model
The BayesLogit R package
The dataset
Preparation of the training and testing datasets
Using the Bayesian logistic model
Exercises
References
Summary
7. Bayesian Models for Unsupervised Learning
Bayesian mixture models
The bgmm package for Bayesian mixture models
Topic modeling using Bayesian inference
Latent Dirichlet allocation
R packages for LDA
The topicmodels package
The lda package
Exercises
References
Summary
8. Bayesian Neural Networks
Two-layer neural networks
Bayesian treatment of neural networks
The brnn R package
Deep belief networks and deep learning
Restricted Boltzmann machines
Deep belief networks
The darch R package
Other deep learning packages in R
Exercises
References
Summary
9. Bayesian Modeling at Big Data Scale
Distributed computing using Hadoop
RHadoop for using Hadoop from R
Spark – in-memory distributed computing
SparkR
Linear regression using SparkR
Computing clusters on the cloud
Amazon Web Services
Creating and running computing instances on AWS
Installing R and RStudio
Running Spark on EC2
Microsoft Azure
IBM Bluemix
Other R packages for large scale machine learning
The parallel R package
The foreach R package
Exercises
References
Summary
Index
Learning Bayesian Models with R
Learning Bayesian Models with R
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written permission
of the publisher, except in the case of brief quotations embedded in critical articles or
reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be
caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2015
Production reference: 1231015
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-760-3
www.packtpub.com
Credits
Author
Dr. Hari M. Koduvely
Reviewers
Philip B. Graff
Nishanth Upadhyaya
Commissioning Editor
Kartikey Pandey
Acquisition Editor
Nikhil Karkal
Content Development Editor
Athira Laji
Technical Editor
Taabish Khan
Copy Editor
Trishya Hajare
Project Coordinator
Bijal Patel
Proofreader
Safis Editing
Indexer
Hemangini Bari
Graphics
Abhinash Sahu
Production Coordinator
Nitesh Thakur
Cover Work
Nitesh Thakur
About the Author
Dr. Hari M. Koduvely is an experienced data scientist working at the Samsung R&D
Institute in Bangalore, India. He has a PhD in statistical physics from the Tata Institute
of Fundamental Research, Mumbai, India, and post-doctoral experience from the
Weizmann Institute, Israel, and Georgia Tech, USA. Prior to joining Samsung, the author
has worked for Amazon and Infosys Technologies, developing machine learning-based
applications for their products and platforms. He also has several publications on
Bayesian inference and its applications in areas such as recommendation systems and
predictive health monitoring. His current interest is in developing large-scale machine
learning methods, particularly for natural language understanding.
I would like to express my gratitude to all those who have helped me throughout my
career, without whom this book would not have been possible. This includes my
teachers, mentors, friends, colleagues, and all the institutions in which I worked,
especially my current employer, Samsung R&D Institute, Bangalore. A special mention
to my spouse, Prathyusha, and son, Pranav, for their immense moral support during the
writing of the book.
About the Reviewers
Philip B. Graff is a data scientist with the Johns Hopkins University Applied Physics
Laboratory. He works with graph analytics for a large-scale automated pattern
discovery.
Philip obtained his PhD in physics from the University of Cambridge on a Gates
Cambridge Scholarship, and a BS in physics and mathematics from the University of
Maryland, Baltimore County. His PhD thesis implemented Bayesian methods for
gravitational wave detection and the training of neural networks for machine learning.
Philip's post-doctoral research at NASA Goddard Space Flight Center and the
University of Maryland, College Park, applied Bayesian inference to the detection and
measurement of gravitational waves by ground and space-based detectors, LIGO and
LISA, respectively. He also implemented machine leaning methods for improved
gamma-ray burst data analysis. He has published books in the fields of astrophysical
data analysis and machine learning.
I would like to thank Ala for her support while I reviewed this book.
Nishanth Upadhyaya has close to 10 years of experience in the area of analytics,
Monte Carlo methods, signal processing, machine learning, and building end-to-end
data products. He is active on StackOverflow and GitHub. He has a couple of patents in
the area of item response theory and stochastic optimization. He has also won third
place in the first ever Aadhaar hackathon organized by Khosla labs.
www.PacktPub.com
Support files, eBooks, discount offers,
and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and
as a print book customer, you are entitled to a discount on the eBook copy. Get in touch
with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.
Preface
Bayesian inference provides a unified framework to deal with all sorts of uncertainties
when learning patterns from data using machine learning models and using it for
predicting future observations. However, learning and implementing Bayesian models is
not easy for data science practitioners due to the level of mathematical treatment
involved. Also, applying Bayesian methods to real-world problems requires high
computational resources. With the recent advancements in cloud and high-performance
computing and easy access to computational resources, Bayesian modeling has become
more feasible to use for practical applications today. Therefore, it would be
advantageous for all data scientists and data engineers to understand Bayesian methods
and apply them in their projects to achieve better results.
What this book covers
This book gives comprehensive coverage of the Bayesian machine learning models and
the R packages that implement them. It begins with an introduction to the fundamentals
of probability theory and R programming for those who are new to the subject. Then, the
book covers some of the most important machine learning methods, both supervised
learning and unsupervised learning, implemented using Bayesian inference and R. Every
chapter begins with a theoretical description of the method, explained in a very simple
manner. Then, relevant R packages are discussed and some illustrations using datasets
from the UCI machine learning repository are given. Each chapter ends with some
simple exercises for you to get hands-on experience of the concepts and R packages
discussed in the chapter. The state-of-the-art topics covered in the chapters are
Bayesian regression using linear and generalized linear models, Bayesian classification
using logistic regression, classification of text data using Naïve Bayes models, and
Bayesian mixture models and topic modeling using Latent Dirichlet allocation.
The last two chapters are devoted to the latest developments in the field. One chapter
discusses deep learning, which uses a class of neural network models that are currently
at the frontier of artificial intelligence. The book concludes with the application of
Bayesian methods on Big Data using frameworks such as Hadoop and Spark.
Chapter 1, Introducing the Probability Theory, covers the foundational concepts of
probability theory, particularly those aspects required for learning Bayesian inference,
which are presented to you in a simple and coherent manner.
Chapter 2, The R Environment, introduces you to the R environment. After reading
through this chapter, you will learn how to import data into R, make a selection of
subsets of data for its analysis, and write simple R programs using functions and control
structures. Also, you will get familiar with the graphical capabilities of R and some
advanced capabilities such as loop functions.
Chapter 3, Introducing Bayesian Inference, introduces you to the Bayesian statistic
framework. This chapter includes a description of the Bayesian theorem, concepts such
as prior and posterior probabilities, and different methods to estimate posterior
distribution such as MAP estimates, Monte Carlo simulations, and variational estimates.
Chapter 4, Machine Learning Using Bayesian Inference, gives an overview of what
machine learning is and what some of its high-level tasks are. This chapter also
discusses the importance of Bayesian inference in machine learning, particularly in the
context of how it can help to avoid important issues such as model overfit and how to
select optimum models.
Chapter 5, Bayesian Regression Models, presents one of the most common supervised
machine learning tasks, namely, regression modeling, in the Bayesian framework. It
shows by using an example how you can get tighter confidence intervals of prediction
using Bayesian regression models.
Chapter 6, Bayesian Classification Models, presents how to use the Bayesian
framework for another common machine learning task, classification. The two Bayesian
models of classification, Naïve Bayes and Bayesian logistic regression, are discussed
along with some important metrics for evaluating the performance of classifiers.
Chapter 7, Bayesian Models for Unsupervised Learning, introduces you to the
concepts behind unsupervised and semi-supervised machine learning and their Bayesian
treatment. The two most important Bayesian unsupervised models, the Bayesian mixture
model and LDA, are discussed.
Chapter 8, Bayesian Neural Networks, presents an important class of machine learning
model, namely neural networks, and their Bayesian implementation. Neural network
models are inspired by the architecture of the human brain and they continue to be an
area of active research and development. The chapter also discusses deep learning, one
of the latest advances in neural networks, which is used to solve many problems in
computer vision and natural language processing with remarkable accuracy.
Chapter 9, Bayesian Modeling at Big Data Scale, covers various frameworks for
performing large-scale Bayesian machine learning such as Hadoop, Spark, and
parallelization frameworks that are native to R. The chapter also discusses how to set
up instances on cloud services, such as Amazon Web Services and Microsoft Azure,
and run R programs on them.
What you need for this book
To learn the examples and try the exercises presented in this book, you need to install
the latest version of the R programming environment and the RStudio IDE. Apart from
this, you need to install the specific R packages that are mentioned in each chapter of
this book separately.
Who this book is for
This book is intended for data scientists who analyze large datasets to generate insights
and for data engineers who develop platforms, solutions, or applications based on
machine learning. Although many data science practitioners are quite familiar with
machine learning techniques and R, they may not know about Bayesian inference and its
merits. This book, therefore, would be helpful to even experienced data scientists and
data engineers to learn about Bayesian methods and incorporate them in to their projects
to get better results. No prior experience is required in R or probability theory to use
this book.
Conventions
In this book, you will find a number of text styles that distinguish between different
kinds of information. Here are some examples of these styles and an explanation of their
meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The
first function is gibbs_met."
A block of code is set as follows:
myMean ←function(x){
s ←sum(x)
l ←length(x)
mean ←s/l
mean
}
>x ←c(10,20,30,40,50)
>myMean(x)
[1] 30
Any command-line input or output is written as follows:
setwd("directory path")
New terms and important words are shown in bold. Words that you see on the screen,
for example, in menus or dialog boxes, appear in the text like this: "You can also set this
from the menu bar of RStudio by clicking on Session | Set Working Directory."
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or disliked. Reader feedback is important for us as it helps us
develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention
the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help
you to get the most from your purchase.
Downloading the example code
You can download the example code files from your account at
http://www.packtpub.com for all the Packt Publishing books you have purchased. If you
purchased this book elsewhere, you can visit http://www.packtpub.com/support and
register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you could report this to us. By doing so, you can save
other readers from frustration and help us improve subsequent versions of this book. If
you find any errata, please report them by visiting http://www.packtpub.com/submiterrata, selecting your book, clicking on the Errata Submission Form link, and entering
the details of your errata. Once your errata are verified, your submission will be
accepted and the errata will be uploaded to our website or added to any list of existing
errata under the Errata section of that title.
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the
search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you
come across any illegal copies of our works in any form on the Internet, please provide
us with the location address or website name immediately so that we can pursue a
remedy.
Please contact us at <[email protected]> with a link to the suspected pirated
material.
We appreciate your help in protecting our authors and our ability to bring you valuable
content.
Questions
If you have a problem with any aspect of this book, you can contact us at
<[email protected]>, and we will do our best to address the problem.
Chapter 1. Introducing the Probability
Theory
Bayesian inference is a method of learning about the relationship between variables
from data, in the presence of uncertainty, in real-world problems. It is one of the
frameworks of probability theory. Any reader interested in Bayesian inference should
have a good knowledge of probability theory to understand and use Bayesian inference.
This chapter covers an overview of probability theory, which will be sufficient to
understand the rest of the chapters in this book.
It was Pierre-Simon Laplace who first proposed a formal definition of probability with
mathematical rigor. This definition is called the Classical Definition and it states the
following:
The theory of chance consists in reducing all the events of the same kind to a certain number of cases
equally possible, that is to say, to such as we may be equally undecided about in regard to their existence,
and in determining the number of cases favorable to the event whose probability is sought. The ratio of this
number to that of all the cases possible is the measure of this probability, which is thus simply a fraction
whose numerator is the number of favorable cases and whose denominator is the number of all the cases
possible.
--Pierre-Simon Laplace, A Philosophical Essay on Probabilities
What this definition means is that, if a random experiment can result in
mutually
exclusive and equally likely outcomes, the probability of the event is given by:
Here,
is the number of occurrences of the event
.
To illustrate this concept, let us take a simple example of a rolling dice. If the dice is a
fair dice, then all the faces will have an equal chance of showing up when the dice is
rolled. Then, the probability of each face showing up is 1/6. However, when one rolls
the dice 100 times, all the faces will not come in equal proportions of 1/6 due to
random fluctuations. The estimate of probability of each face is the number of times the
face shows up divided by the number of rolls. As the denominator is very large, this
ratio will be close to 1/6.
In the long run, this classical definition treats the probability of an uncertain event as the
relative frequency of its occurrence. This is also called a frequentist approach to
probability. Although this approach is suitable for a large class of problems, there are
cases where this type of approach cannot be used. As an example, consider the
following question: Is Pataliputra the name of an ancient city or a king? In such
cases, we have a degree of belief in various plausible answers, but it is not based on
counts in the outcome of an experiment (in the Sanskrit language Putra means son,
therefore some people may believe that Pataliputra is the name of an ancient king in
India, but it is a city).
Another example is, What is the chance of the Democratic Party winning the election
in 2016 in America? Some people may believe it is 1/2 and some people may believe it
is 2/3. In this case, probability is defined as the degree of belief of a person in the
outcome of an uncertain event. This is called the subjective definition of probability.
One of the limitations of the classical or frequentist definition of probability is that it
cannot address subjective probabilities. As we will see later in this book, Bayesian
inference is a natural framework for treating both frequentist and subjective
interpretations of probability.
Probability distributions
In both classical and Bayesian approaches, a probability distribution function is the
central quantity, which captures all of the information about the relationship between
variables in the presence of uncertainty. A probability distribution assigns a probability
value to each measurable subset of outcomes of a random experiment. The variable
involved could be discrete or continuous, and univariate or multivariate. Although
people use slightly different terminologies, the commonly used probability distributions
for the different types of random variables are as follows:
Probability mass function (pmf) for discrete numerical random variables
Categorical distribution for categorical random variables
Probability density function (pdf) for continuous random variables
One of the well-known distribution functions is the normal or Gaussian distribution,
which is named after Carl Friedrich Gauss, a famous German mathematician and
physicist. It is also known by the name bell curve because of its shape. The
mathematical form of this distribution is given by:
Here,
is the mean or location parameter and
is the standard deviation or scale
parameter (
is called variance). The following graphs show what the distribution
looks like for different values of location and scale parameters:
One can see that as the mean changes, the location of the peak of the distribution
changes. Similarly, when the standard deviation changes, the width of the distribution
also changes.
Many natural datasets follow normal distribution because, according to the central limit
theorem, any random variable that can be composed as a mean of independent random
variables will have a normal distribution. This is irrespective of the form of the
distribution of this random variable, as long as they have finite mean and variance and
all are drawn from the same original distribution. A normal distribution is also very
popular among data scientists because in many statistical inferences, theoretical results
can be derived if the underlying distribution is normal.
Now, let us look at the multidimensional version of normal distribution. If the random
variable is an N-dimensional vector, x is denoted by:
Then, the corresponding normal distribution is given by:
Here, corresponds to the mean (also called location) and
matrix (also called scale).
is an N x N covariance
To get a better understanding of the multidimensional normal distribution, let us take the
case of two dimensions. In this case,
Here,
and
are the variances along
and the covariance matrix is given by:
and
between
and
and
is shown in the following image:
directions, and
. A plot of two-dimensional normal distribution for
is the correlation
,
,
If
, then the two-dimensional normal distribution will be reduced to the product of
two one-dimensional normal distributions, since
would become diagonal in this
case. The following 2D projections of normal distribution for the same values of
and
but with
and
illustrate this case:
The high correlation between x and y in the first case forces most of the data points
along the 45 degree line and makes the distribution more anisotropic; whereas, in the
second case, when the correlation is zero, the distribution is more isotropic.
We will briefly review some of the other well-known distributions used in Bayesian
inference here.
Conditional probability
Often, one would be interested in finding the probability of the occurrence of a set of
random variables when other random variables in the problem are held fixed. As an
example of population health study, one would be interested in finding what is the
probability of a person, in the age range 40-50, developing heart disease with high
blood pressure and diabetes. Questions such as these can be modeled using conditional
probability, which is defined as the probability of an event, given that another event has
happened. More formally, if we take the variables A and B, this definition can be
rewritten as follows:
Similarly:
The following Venn diagram explains the concept more clearly:
In Bayesian inference, we are interested in conditional probabilities corresponding to
multivariate distributions. If
denotes the entire random
variable set, then the conditional probability of
, given that
is fixed at some value, is given by the ratio of joint probability of
and joint probability of
:
In the case of two-dimensional normal distribution, the conditional probability of
interest is as follows:
It can be shown that (exercise 2 in the Exercises section of this chapter) the RHS can be
simplified, resulting in an expression for
again with the mean
in the form of a normal distribution
and variance
.
Bayesian theorem
From the definition of the conditional probabilities
show the following:
and
, it is easy to
Rev. Thomas Bayes (1701–1761) used this rule and formulated his famous Bayes
theorem that can be interpreted if
represents the initial degree of belief (or prior
probability) in the value of a random variable A before observing B; then, its posterior
probability or degree of belief after accounted for B will get updated according to the
preceding equation. So, the Bayesian inference essentially corresponds to updating
beliefs about an uncertain system after having made some observations about it. In the
sense, this is also how we human beings learn about the world. For example, before we
visit a new city, we will have certain prior knowledge about the place after reading
from books or on the Web.
However, soon after we reach the place, this belief will get updated based on our initial
experience of the place. We continuously update the belief as we explore the new city
more and more. We will describe Bayesian inference more in detail in Chapter 3,
Introducing Bayesian Inference.
Marginal distribution
In many situations, we are interested only in the probability distribution of a subset of
random variables. For example, in the heart disease problem mentioned in the previous
section, if we want to infer the probability of people in a population having a heart
disease as a function of their age only, we need to integrate out the effect of other
random variables such as blood pressure and diabetes. This is called marginalization:
Or:
Note that marginal distribution is very different from conditional distribution. In
conditional probability, we are finding the probability of a subset of random variables
with values of other random variables fixed (conditioned) at a given value. In the case
of marginal distribution, we are eliminating the effect of a subset of random variables
by integrating them out (in the sense averaging their effect) from the joint distribution.
For example, in the case of two-dimensional normal distribution, marginalization with
respect to one variable will result in a one-dimensional normal distribution of the other
variable, as follows:
The details of this integration is given as an exercise (exercise 3 in the Exercises
section of this chapter).
Expectations and covariance
Having known the distribution of a set of random variables
, what one
would be typically interested in for real-life applications is to be able to estimate the
average values of these random variables and the correlations between them. These are
computed formally using the following expressions:
For example, in the case of two-dimensional normal distribution, if we are interested in
finding the correlation between the variables and , it can be formally computed
from the joint distribution using the following formula:
Binomial distribution
A binomial distribution is a discrete distribution that gives the probability of heads in n
independent trials where each trial has one of two possible outcomes, heads or tails,
with the probability of heads being p. Each of the trials is called a Bernoulli trial. The
functional form of the binomial distribution is given by:
Here,
denotes the probability of having k heads in n trials. The mean of the
binomial distribution is given by np and variance is given by np(1-p). Have a look at
the following graphs:
The preceding graphs show the binomial distribution for two values of n; 100 and 1000
for p = 0.7. As you can see, when n becomes large, the Binomial distribution becomes
sharply peaked. It can be shown that, in the large n limit, a binomial distribution can be
approximated using a normal distribution with mean np and variance np(1-p). This is a
characteristic shared by many discrete distributions that, in the large n limit, they can be
approximated by some continuous distributions.
Beta distribution
The Beta distribution denoted by
reflection
Here,
is a function of the power of
, and its
is given by:
are parameters that determine the shape of the distribution function and
is the Beta function given by the ratio of Gamma functions:
.
The Beta distribution is a very important distribution in Bayesian inference. It is the
conjugate prior probability distribution (which will be defined more precisely in the
next chapter) for binomial, Bernoulli, negative binomial, and geometric distributions. It
is used for modeling the random behavior of percentages and proportions. For example,
the Beta distribution has been used for modeling allele frequencies in population
genetics, time allocation in project management, the proportion of minerals in rocks, and
heterogeneity in the probability of HIV transmission.
Gamma distribution
The Gamma distribution denoted by
is another common distribution
used in Bayesian inference. It is used for modeling the waiting times such as survival
rates. Special cases of the Gamma distribution are the well-known Exponential and ChiSquare distributions.
In Bayesian inference, the Gamma distribution is used as a conjugate prior for the
inverse of variance of a one-dimensional normal distribution or parameters such as the
rate ( ) of an exponential or Poisson distribution.
The mathematical form of a Gamma distribution is given by:
Here,
and
are the shape and rate parameters, respectively (both take values
greater than zero). There is also a form in terms of the scale parameter
,
which is common in econometrics. Another related distribution is the Inverse-Gamma
distribution that is the distribution of the reciprocal of a variable that is distributed
according to the Gamma distribution. It's mainly used in Bayesian inference as the
conjugate prior distribution for the variance of a one-dimensional normal distribution.
Dirichlet distribution
The Dirichlet distribution is a multivariate analogue of the Beta distribution. It is
commonly used in Bayesian inference as the conjugate prior distribution for multinomial
distribution and categorical distribution. The main reason for this is that it is easy to
implement inference techniques, such as Gibbs sampling, on the Dirichlet-multinomial
distribution.
The Dirichlet distribution of order
simplex as follows:
Here,
,
is defined over an open
, and
dimensional
.
Wishart distribution
The Wishart distribution is a multivariate generalization of the Gamma distribution. It is
defined over symmetric non-negative matrix-valued random variables. In Bayesian
inference, it is used as the conjugate prior to estimate the distribution of inverse of the
covariance matrix
(or precision matrix) of the normal distribution. When we
discussed Gamma distribution, we said it is used as a conjugate distribution for the
inverse of the variance of the one-dimensional normal distribution.
The mathematical definition of the Wishart distribution is as follows:
Here,
denotes the determinant of the matrix
degrees of freedom.
of dimension
and
is the
A special case of the Wishart distribution is when
corresponds to the wellknown Chi-Square distribution function with degrees of freedom.
Wikipedia gives a list of more than 100 useful distributions that are commonly used by
statisticians (reference 1 in the Reference section of this chapter). Interested readers
should refer to this article.
Exercises
1. By using the definition of conditional probability, show that any multivariate joint
distribution of N random variables
factorization:
has the following trivial
2. The bivariate normal distribution is given by:
Here:
By using the definition of conditional probability, show that the conditional
distribution
can be written as a normal distribution of the form
where
and
.
3. By using explicit integration of the expression in exercise 2, show that the
marginalization of bivariate normal distribution will result in univariate normal
distribution.
4. In the following table, a dataset containing the measurements of petal and sepal
sizes of 15 different Iris flowers are shown (taken from the Iris dataset, UCI
machine learning dataset repository). All units are in cms:
Sepal Length Sepal Width Petal Length Petal Width Class of Flower
5.1
3.5
1.4
0.2
Iris-setosa
4.9
3
1.4
0.2
Iris-setosa
4.7
3.2
1.3
0.2
Iris-setosa
4.6
3.1
1.5
0.2
Iris-setosa
5
3.6
1.4
0.2
Iris-setosa
7
3.2
4.7
1.4
Iris-versicolor
6.4
3.2
4.5
1.5
Iris-versicolor
6.9
3.1
4.9
1.5
Iris-versicolor
5.5
2.3
4
1.3
Iris-versicolor
6.5
2.8
4.6
1.5
Iris-versicolor
6.3
3.3
6
2.5
Iris-virginica
5.8
2.7
5.1
1.9
Iris-virginica
7.1
3
5.9
2.1
Iris-virginica
6.3
2.9
5.6
1.8
Iris-virginica
6.5
3
5.8
2.2
Iris-virginica
Answer the following questions:
1. What is the probability of finding flowers with a sepal length more than 5 cm
and a sepal width less than 3 cm?
2. What is the probability of finding flowers with a petal length less than 1.5 cm;
given that petal width is equal to 0.2 cm?
3. What is the probability of finding flowers with a sepal length less than 6 cm
and a petal width less than 1.5 cm; given that the class of the flower is Iris-
versicolor?
References
1. http://en.wikipedia.org/wiki/List_of_probability_distributions
2. Feller W. An Introduction to Probability Theory and Its Applications. Vol. 1.
Wiley Series in Probability and Mathematical Statistics. 1968. ISBN-10:
0471257087
3. Jayes E.T. Probability Theory: The Logic of Science. Cambridge University
Press. 2003. ISBN-10: 0521592712
4. Radziwill N.M. Statistics (The Easier Way) with R: an informal text on applied
statistics. Lapis Lucera. 2015. ISBN-10: 0692339426
Summary
To summarize this chapter, we discussed elements of probability theory; particularly
those aspects required for learning Bayesian inference. Due to lack of space, we have
not covered many elementary aspects of this subject. There are some excellent books on
this subject, for example, books by William Feller (reference 2 in the References
section of this chapter), E. T. Jaynes (reference 3 in the References section of this
chapter), and M. Radziwill (reference 4 in the References section of this chapter).
Readers are encouraged to read these to get a more in-depth understanding of
probability theory and how it can be applied in real-life situations.
In the next chapter, we will introduce the R programming language that is the most
popular open source framework for data analysis and Bayesian inference in particular.
Chapter 2. The R Environment
R is currently one of the most popular programming environments for statistical
computing. It was evolved as an open source language from the S programming language
developed at Bell Labs. The main creators of R are two academicians, Robert
Gentleman and Ross Ihaka, from the University of Auckland in New Zealand.
The main reasons for the popularity of R, apart from free software under GNU General
Public License, are the following:
R is very easy to use. It is an interpreted language and at the same time can be used
for procedural programming.
R supports both functional and object-oriented paradigms. It has very strong
graphical and data visualization capabilities.
Through its LaTex-like documentation support, R can be used for making highquality documentation.
Being an open source software, R has a large number of contributed packages that
makes almost all statistical modeling possible in this environment.
This chapter is intended to give a basic introduction to R so that any reader who is not
familiar with the language can follow the rest of the book by reading through this
chapter. It is not possible to give a detailed description of the R language in one chapter
and the interested reader should consult books specially written in R programming. I
would recommend The Art of R Programming (reference 1 in the References section of
this chapter) and R Cookbook (reference 2 in the References section of this chapter) for
those users who are mainly interested in using R for analyzing and modeling data. For
those who are interested in learning about the advanced features of R, for example, for
writing complex programs or R packages, Advanced R (reference 3 in the References
section of this chapter) is an excellent book.
Setting up the R environment and
packages
R is a free software under GNU open source license. R comes with a basic package and
also has a large number of user-contributed packages for advanced analysis and
modeling. It also has a nice graphics user interface-based editor called RStudio. In this
section, we will learn how to download R, set up the R environment in your computer,
and write a simple R program.
Installing R and RStudio
The Comprehensive R Archive Network (CRAN) hosts all releases of R and the
contributed packages. R for Windows can be installed by downloading the binary of the
base package from http://cran.r-project.org; a standard installation should be sufficient.
For Linux and Mac OS X, the webpage gives instructions on how to download and
install the software. At the time of writing this book, the latest release was version
3.1.2. Various packages need to be installed separately from the package page. One can
install any package from the R command prompt using the following command:
install.packages("package name")
After installing the package, one needs to load the package before using it with the
following command:
library("package name")
A very useful integrated development environment (IDE) for R is RStudio. It can be
downloaded freely from http://www.rstudio.com/. RStudio works on Windows, Linux,
and Mac platforms. It has both a desktop version and also a server version that can be
used for writing R programs through a browser interface on a remote server. After
installing R and RStudio, it is useful to set the default working directory to the directory
of your choice. RStudio reads and writes files containing R codes into the working
directory. To find out what the current directory is, use the R command getwd( ). To
change the working directory to a directory of your preference, use the following
command:
setwd("directory path")
You can also set this from the menu bar of RStudio by clicking on Session | Set
Working Directory.
Your first R program
Let us write a simple program to add two integers x and y resulting in their sum z. On
the command prompt in RStudio, type the following commands and press Enter:
>x <-2
>y <-3
>z <-x+y
>print(z)
[1] 5
Now, you can assign different values to x and y and print z to see how z changes.
Instead of print(z), you can also simply enter z to print its values.
Managing data in R
Before we start any serious programming in R, we need to learn how to import data into
an R environment and which data types R supports. Often, for a particular analysis, we
will not use the entire dataset. Therefore, we need to also learn how to select a subset
of the data for any analysis. This section will cover these aspects.
Data Types in R
R has five basic data types as follows:
Integer
Numeric (real)
Complex
Character
Logical (True/False)
The default representation of numbers in R is double precision real number (numeric). If
you want an integer representation explicitly, you need to add the suffix L. For example,
simply entering 1 on the command prompt will store 1 as a numeric object. To store 1
as an integer, you need to enter 1L. The command class(x) will give the class (type) of
the object x. Therefore, entering class(1) on command prompt will give the answer
numeric whereas entering class(1L) will give the answer integer.
R also has a special number Inf that represents Infinity. The number NaN (not a
number) is used to represent an undefined value such as 0/0. Missing values are
represented by using the symbol NA.
Data structures in R
The data structures in R can be classified as either homogeneous (all elements
containing the same data type) or heterogeneous (elements containing different data
types). Furthermore, each of these have different structures depending upon the number
of dimensions:
Homogeneous:
Atomic vector: one-dimensional
Matrix: two-dimensional
Array: N-dimensional
Heterogeneous:
List: one-dimensional
Data frame: two-dimensional
The most basic object in R is a vector. To create an empty integer vector of size 10,
enter the following command on the R prompt:
>v <-vector("integer",10)
>v
[1] 0000000000
You can assign the value m to nth component of the vector using the following
command:
> v[5] <-1
> v
[1] 0000100000
Readers should note that unlike in many programming languages, the array index in R
starts with 1 and not 0.
Whereas a vector can only contain objects of the same type, a list, although similar to
the vector, can contain objects of different types. The following command will create a
list containing integers, real numbers, and characters:
> l <-list(1L, 2L, 3, 4, "a", "b")
> str(l)
List of 6
$: int 1
$: int 2
$: num 3
$: num 4
$: chr "a"
$: chr "b"
Here, we used the str() function in R that shows the structure of any R object.
R has a special function c() to combine multiple numbers of basic data into a vector or
list. For example, c(1,3,6,2,-1) will produce a vector containing numbers from
1,2,3,6,-1:
> c(1, 3, 6, 2, -1)
[1] 1 3 6 2 -1
A matrix is the generalization of a vector into two dimensions. Consider the following
command:
>m <-matrix(c(1:9),nrow=3,ncol=3)
This command will generate a matrix m of size 3 x 3 containing numbers from 1 to 9.
The most common data structure used for storing data in R is a data frame. A data frame,
like the list, can contain data of different types (numeric, integer, Boolean, or character).
It is essentially a list of vectors of equal length. Therefore, it has the same twodimensional structure as a matrix. The length (found using length( )) of a data frame
is the length of the underlying list that is the number of columns in the data frame. There
are simple commands nrow( ) and ncol( ) for finding the number of rows and
columns of a data frame. The other two attributes of a data frame are rownames( ) and
colnames( ) that can be used to either set or find the names of rows or columns.
Importing data into R
Data that is in the form of a table can be easily loaded into R using the read.table(…)
function. It has several arguments to make the import very flexible. Some of the useful
arguments are the following:
file: The name of a file or a complete URL
header: A logical value indicating whether the
file has a header line containing
names of the variables
sep: A character indicating the column separator field
row.names: A vector of row names
col.names: A vector of names for variables
skip: The number of lines in the data file to be skipped before reading the data
nrows: The number of rows in the dataset
stringsASFactors: A logical value indicating if the character variables can be
coded as factors or not
For small datasets, one can use read.table("filename.txt") without specifying
other arguments; the rest R will figure out itself. Another useful function is read.csv()
for reading CSV files only.
In addition to loading data from text files, data can be imported into R by connecting to
external databases through various interfaces. One such popular interface is Open
Database Connectivity (ODBC). The RODBC package in R provides access to
different databases through the ODBC interface. This package contains different
functions for connecting with a database and performing various operations. Some of the
important functions in the RODBC package are as follows:
odbcConnect(dsn, uid="user_name", pwd="password"):
Used to open a
connection to an ODBC database having registered data source name dsn.
sqlFetch(channel, sqtable): Used to read a table from an ODBC database to
a data frame.
sqlQuery(channel, query): Used to submit a query to an ODBC database and
return the results.
sqlSave(channel, mydf, tablename = sqtable, append = FALSE): Used
to write or update (append = TRUE) a data frame to a table in the ODBC database.
close(channel): Used to close the connection. Here, channel is the connection
handle as returned by odbcConnect.
Slicing and dicing datasets
Often, in data analysis, one needs to slice and dice the full data frame to select a few
variables or observations. This is called subsetting. R has some powerful and fast
methods for doing this.
To extract subsets of R objects, one can use the following three operators:
Single bracket [ ]: This returns an object of the same class as the original. The
single bracket operator can be used to select more than one element of an object.
Some examples are as follows:
>x <-c(10,20,30,40,50)
>x[1:3]
[1] 10 20 30
>x[x >25]
[1] 30 40 50
>f <-x >30
>x[f]
[1] 40 50
>m <-matrix(c(1:9),nrow=3,ncol=3)
>m[1 ,] #select the entire first row
[1] 1 4 7
>m[
[1]
,2] #select the entire second column
4 5 6
Double bracket [[ ]]: This is used to extract a single element of a list or data
frame. The returned object need not be the same type as the initial object. Some
examples are as follows:
>y <-list("a", "b", "c", "d", "e")
>y[1]
[[1]]
[1] "a"
>class(y[1])
[1] "list"
>y[[1]]
[1] "a"
>class(y[[1]])
[1] "character"
Dollar sign $: This is used to extract elements of a list or data frame by name.
Some examples are as follows:
>z <-list(John = 12 ,Mary = 18,Alice = 24 ,Bob = 17 ,Tom = 21)
>z$Bob
[1] 17
Use of negative index values: This is used to drop a particular index or column—
one subset with a negative sign for the corresponding index. For example, to drop
Mary and Bob from the preceding list, use the following code:
> y <-z[c(-2, -4)]
> y
Vectorized operations
In R, many operations, such as arithmetical operations involving vectors and matrices,
can be done very efficiently using vectorized operations. For example, if you are adding
two vectors x and y, their elements are added in parallel. This also makes the code
more concise and easier to understand. For example, one does not need a for( ) loop
to add two vectors in the code:
>x <-c(1,2,3,4,5)
>y <-c(10,20,30,40,50)
>z <-x+y
>z
[1]
11 22 33 44 55
>w <-x*y
>w
[1]
10 40 90 160 250
Another very useful example of vectorized operations is in the case of matrices. If X and
Y are two matrices, the following operations can be carried out in R in a vectorized
form:
>X*Y ## Element-wise multiplication
>X/Y ## Element-wise division
>X %*% Y ## Standard matrix multiplication
Writing R programs
Although much data analysis in R can be carried out in an interactive manner using
command prompt, often for more complex tasks, one needs to write R scripts. As
mentioned in the introduction, R has both the perspective of a functional and objectoriented programming language. In this section, some of the standard syntaxes of the
programming in R are described.
Control structures
Control structures are meant for controlling the flow of execution of a program. The
standard control structures are as follows:
if and else: To test a condition
for: To loop over a set of statements for a fixed number of times
while: To loop over a set of statements while a condition is true
repeat: To execute an infinite loop
break: To break the execution of a loop
next: To skip an iteration of a loop
return: To exit a function
Functions
If one wants to use R for more serious programming, it is essential to know how to
write functions. They make the language more powerful and elegant. R has many built-in
functions, such as mean(), sort(), sin(), plot(), and many more, which are written
using R commands.
A function is defined as follows:
>fname<-function(arg1,arg2,
R Expressions
}
){
Here, fname is the name of the function; arg1, arg2, and so on, are arguments passed to
the function. Note that unlike in other languages, functions in R do not end with a return
statement. By default, the last statement executed inside the body of the function is
returned by the function.
Once a function is defined, it is executed simply by entering the function name with the
values for the arguments:
>fname(arg1,arg2,…)
The important properties of functions in R are as follows:
Functions are first-class citizens
Functions can be passed as arguments to other functions
One can define a function inside another function (nesting)
The arguments of the functions can be matched by position or name
Let's consider a simple example of a function, which given an input vector x, calculates
its mean. To write this function, open a new window in RStudio for R script from the
menu bar through File | New File | R Script. In this R script, enter the following lines of
code:
myMean <-function(x){
s <-sum(x)
l <-length(x)
mean <-s/l
mean
}
Select the entire code and use the keys Ctrl + Enter to execute the script. This
completes the definition of the myMean function. To use this function on the command
prompt, enter the following:
>x <-c(10,20,30,40,50)
>myMean(x)
This will generate the following result:
>myMean(x)
[1] 30
Scoping rules
In programming languages, it is very important to understand the scopes of all variables
to avoid errors during execution. There are two types of scoping of a variable in a
function: lexical scoping and dynamic scoping. In the case of lexical scoping, the value
of a variable in a function is looked up in the environment in which the function was
defined. Generally, this is the global environment. In the case of dynamic scoping, the
value of a variable is looked up in the environment in which the function was called (the
calling environment).
R uses lexical scoping that makes it possible to write functions inside a function. This is
illustrated with the following example:
>x <-0.1
>f <-function(y){
x*y
}
>g <-function(y){
x<-5
x-f(y)
}
>g(10)
[1] 4
The answer is 4 because while evaluating function f, the value of x is taken from the
global environment, which is 0.1, whereas while evaluating function g, the value of x is
taken from the local environment of g, which is 5.
Lexical scoping has some disadvantages. Since the value of a variable is looked up
from the environment in which the function is defined, all functions must carry a pointer
to their respective defining environments. Also, all objects must be stored in memory
during the execution of the program.
Loop functions
Often, we have a list containing some objects and we want to apply a function to every
element of the list. For example, we have a list of results of a survey, containing m
questions from n participants. We would like to find the average response for each
question (assuming that all questions have a response as numeric values). One could use
a for loop over the set of questions and find an average among n users using the mean()
function in R. Loop functions come in handy in such situations and one can do such
computations in a more compact way. These are like iterators in other languages such as
Java.
The following are the standard loop functions in R:
lapply: To loop over a list and evaluate a function on each element
sapply: The same as lapply, but with the output in a more simpler form
mapply: A multivariate version of sapply
apply: To apply functions over array margins
tapply: To apply a function to each cell of a ragged array
lapply
The lapply() function is used in the following manner:
>lapply(X,FUN,
)
Here, X is a list or vector containing data. The FUN is the name of a function that needs
to be applied on each element of the list or vector. The last argument represents optional
arguments. The result of using lapply is always a list, regardless of the type of input.
As an example, consider the quarterly revenue of four companies in billions of dollars
(not real data). We would like to compute the yearly average revenue of all four
companies as follows:
>X<list(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.
7,10.8,11.5),Oracle=c(20.5,22.7,21.8,24.4) )
>lapply(X,mean)
$HP
[1] 14.575
$IBM
[1] 24.1
$Dell
[1] 10.225
$Oracle
[1] 22.35
sapply
The sapply() function is similar to lapply() with the additional option of simplifying
the output into a desired form. For example, sapply() can be used in the previous
dataset as follows:
> sapply(X,mean,simplify="array")
HP
IBM
Dell
Oracle
14.575
24.100
10.225
22.350
mapply
The lapply() and sapply() functions can only have one argument. If you want to
apply a function with multiple variable arguments, then mapply() becomes handy. Here
is how it is used:
>mapply(FUN,L1,L2,
,Ln,SIMPLIFY=TRUE)
Here,
are the lists to which the function FUN needs to be applied. For
example, consider the following list generation command:
>rep(x=10,times=5)
[1] 10 10 10 10 10
Here, the rep function repeats the value of x five times. Suppose we want to create a
list where the number 10 occurs 1 time, the number 20 occurs 2 times, and so on, we
can use mapply as follows:
>mapply(rep,x=c(10,20,30,40,50),times=1:5)
apply
The apply() function is useful for applying a function to the margins of an array or
matrix. The form of the function is as follows:
>apply(X,MARGIN,FUN,
)
Here, MARGIN is a vector giving the subscripts that the function will be applied over.
For example, in the case of a matrix, 1 indicates rows and 2 indicates columns, and
c(1,2)
indicates rows and columns. Consider the following example as an illustration:
>Y <-matrix(1:9,nrow=3,ncol=3)
>Y
[,1]
[,2]
[,3]
[1,]
1
4
7
[2,]
2
5
8
[1,]
3
6
9
>apply(Y,1,sum) #sum along the row
[1] 12 15 18
>apply(Y,2,sum) #sum along the column
[1] 6 15 24
tapply
The tapply() function is used to apply a function over the subsets of a vector. The
function description is as follows:
>tapply(X,INDEX,FUN,SIMPLIFY=TRUE)
Let us consider the earlier example of the quarterly revenue of five companies:
>X<X(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.7,1
0.8,11.5),Oracle=c(20.5,22.7,21.8,24.4) )
Using lapply(), we found the average yearly revenue of each company. Suppose we
want to find the revenue per quarter averaged over all four companies, we can use
tapply() as follows; here we use the function c instead of the list to create X:
>X<c(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.7,1
0.8,11.5),Oracle=c(20.5,22.7,21.8,24.4) )
>f<-factor(rep(c("Q1","Q2","Q3","Q4"),times=4) )
>f
[1] Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Levels Q1 Q2 Q3 Q4
>tapply(X,f,mean,simplify=TRUE)
Q1
Q2
Q3
Q4
15.97
17.80
18.10
19.37
By creating the factor list with levels as quarter values, we can apply the mean function
for each quarter using tapply().
Data visualization
One of the powerful features of R is its functions for generating high-quality plots and
visualize data. The graphics functions in R can be divided into three groups:
High-level plotting functions to create new plots, add axes, labels, and titles.
Low-level plotting functions to add more information to an existing plot. This
includes adding extra points, lines, and labels.
Interactive graphics functions to interactively add information to, or extract
information from, an existing plot.
The R base package itself contains several graphics functions. For more advanced graph
applications, one can use packages such as ggplot2, grid, or lattice. In particular,
ggplot2 is very useful for generating visually appealing, multilayered graphs. It is based
on the concept of grammar of graphics. Due to lack of space, we are not covering these
packages in this book. Interested readers should consult the book by Hadley Wickham
(reference 4 in the References section of this chapter).
High-level plotting functions
Let us start with the most basic plotting functions in R as follows:
plot( ):
This is the most common plotting function in R. It is a generic function
where the output depends on the type of the first argument.
plot(x, y): This produces a scatter plot of y versus x.
plot(x): If x is a real value vector, the output will be a plot of the value of x
versus its index on the X axis. If x is a complex number, then it will plot the real
part versus the imaginary part.
plot(f, y): Here, f is a factor object and y is a numeric vector. The function
produces box plots of y for each level of f.
plot(y ~ expr): Here, y is any object and expr is a list of object names
separated by + (for example, p + q + r). The function plots y against every object
named in expr.
There are two useful functions in R for visualizing multivariate data:
pairs(X):
If X is a data frame containing numeric data, then this function produces
a pair-wise scatter plot matrix of the variables defined by the columns of X.
coplot(y ~ x | z): If y and x are numeric vectors and z is a factor object, then
this function plots y versus x for every level of z.
For plotting distributions of data, one can use the following functions:
hist(x): This produces a histogram of the numeric
qqplot(x, y): This plots the quantiles of x versus
vector x.
the quantiles of y to compare
their respective distributions.
qqnorm(x): This plots the numeric vector x against the expected normal order
scores.
Low-level plotting commands
To add points and lines to a plot, the following commands can be used:
points(x, y): This adds point (x, y) to the current plot.
lines(x, y): This adds a connecting line to the current plot.
abline(a, b): This adds a line of the slope b and intercepts a to the current plot.
polygon(x, y, …): This draws a polygon defined by the ordered vertices (x, y,
…).
To add the text to a plot, use the following functions:
text(x, y, labels): This adds text to the current plot at point (x, y).
legend(x, y, legend): This adds a legend to the current plot at point (x, y).
title(main, sub): This adds a title main at the top of the current plot in a large
font and a subtitle sub at the bottom in a smaller font.
axis(side, …): This adds an axis to the current plot on the side given by the first
argument. The side can take values from 1 to 4 counting clockwise from the
bottom.
The following example shows how to plot a scatter plot and add a trend line. For this,
we will use the famous Iris dataset, created by R. A. Fisher, that is available in R itself:
data(iris)
str(iris)
plot(iris$Petal.Width, iris$Petal.Length, col = "blue", xlab = "X",
ylab = "Y")
title(main = "Plot of Iris Data", sub = "Petal Length (Y) Vs Petal
Width (X)")
fitlm <- lm(iris$Petal.Length ~ iris$Petal.Width)
abline(fitlm[1], fitlm[2], col = "red")
Interactive graphics functions
There are functions in R that enable users to add or extract information from a plot using
the mouse in an interactive manner:
locator (n , type):
This waits for the user to select the n locations on the
current plot using the left-mouse button. Here, type is one of n, p, l or o to plot
points or lines at these locations. For example, to place a legend Outlier near an
outlier point, use the following code:
>text(locator(1),"Outlier" ,adj=0")
identify(x, y, label): This allows the user to highlight any of the points, x
and y, selected using the left-mouse button by placing the label nearby.
Sampling
Often, we would be interested in creating a representative dataset, for some analysis or
design of experiments, by sampling from a population. This is particularly the case for
Bayesian inference, as we will see in the later chapters, where samples are drawn from
posterior distribution for inference. Therefore, it would be useful to learn how to
sample N points from some well-known distributions in this chapter.
Before we use any particular sampling methods, readers should note that R, like any
other computer program, uses pseudo random number generators for sampling. It is
useful to supply a starting seed number to get reproducible results. This can be done
using the set.seed(n) command with an integer n as the seed.
Random uniform sampling from an interval
To generate n random numbers (numeric) that are uniformly distributed in the interval
[a, b], one can use the runif() function:
>runif(5,1,10) #generates 5 random numbers between 1 and 10
[1] 7.416
9.846
3.093
2.656
1.561
Without any arguments, runif() will generate uniform random numbers between 0 and
1.
If we want to generate random integers uniformly distributed in an interval, the function
to use is sample():
>sample(1:100,10,replace=T)
#generates 10 random integers between 1
and 100
[1] 24 51 46 87 30 86 50 45 53 62
The option replace=T indicates that the repetition is allowed.
Sampling from normal distribution
Often, we may want to generate data that is distributed according to a particular
distribution, say normal distribution. In the case of univariate distributions, R has
several in-built functions for this. For sampling data from a normal distribution, the
function to be used is rnorm(). For example, consider the following code:
>rnorm(5,mean=0,sd=1)
[1] 0.759 -1.676
0.569
0.928 -0.609
This generates five random numbers distributed according to a normal distribution with
mean 0 and standard deviation 1.
Similarly, one can use the rbinom() function for sampling from a binomial distribution,
rpois() to sample from a Poisson distribution, rbeta() to sample from a Beta
distribution, and rgamma() to sample from a Gamma distribution to mention a few other
distributions.
Exercises
For the following exercises in this chapter, we use the Auto MPG dataset from the UCI
Machine Learning repository (references 5 and 6 in the References section of this
chapter). The dataset can be downloaded from
https://archive.ics.uci.edu/ml/datasets.html. The dataset contains the fuel consumption of
cars in the US measured during 1970-1982. Along with consumption values, there are
attribute variables, such as the number of cylinders, displacement, horse power, weight,
acceleration, year, origin, and the name of the car:
Load the dataset into R using the read.table() function.
Produce a box plot of mpg values for each car name.
Write a function that will compute the scaled value (subtract the mean and divide
by standard deviation) of a column whose name is given as an argument of the
function.
Use the lapply() function to compute scaled values for all variables.
Produce a scatter plot of mgp versus acceleration for each car name using
coplot(). Use legends to annotate the graph.
References
1. Matloff N. The Art of R Programming – A Tour of Statistical Software Design.
No Starch Press. 2011. ISBN-10: 1593273843
2. Teetor P. R Cookbook. O'Reilly Media. 2011. ISBN-10: 0596809158
3. Wickham H. Advanced R. Chapman & Hall/CRC The R Series. 2015. ISBN-10:
1466586966
4. Wickham H. ggplot2: Elegant Graphics for Data Analysis (Use R!). Springer.
2010. ISBN-10: 0387981403
5. Auto MPG Data Set, UCI Machine Learning repository,
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
6. Quinlan R. "Combining Instance-Based and Model-Based Learning". In: Tenth
International Conference of Machine Learning. 236-243. University of
Massachusetts, Amherst. Morgan Kaufmann. 1993
Tip
Downloading the example code
You can download the example code files from your account at
http://www.packtpub.com for all the Packt Publishing books you have purchased. If you
purchased this book elsewhere, you can visit http://www.packtpub.com/support and
register to have the files e-mailed directly to you.
Summary
In this chapter, you were introduced to the R environment. After reading through this
chapter, you learned how to import data into R, make a selection of subsets of data for
their analysis, and write simple R programs using functions and control structures. Also,
you should now be familiar with the graphical capabilities of R and some advanced
capabilities, such as loop functions. In the next chapter, we will begin the central theme
of this book, Bayesian inference.
Chapter 3. Introducing Bayesian
Inference
In Chapter 1, Introducing the Probability Theory, we learned about the Bayes theorem
as the relation between conditional probabilities of two random variables such as A and
B. This theorem is the basis for updating beliefs or model parameter values in Bayesian
inference, given the observations. In this chapter, a more formal treatment of Bayesian
inference will be given. To begin with, let us try to understand how uncertainties in a
real-world problem are treated in Bayesian approach.
Bayesian view of uncertainty
The classical or frequentist statistics typically take the view that any physical processgenerating data containing noise can be modeled by a stochastic model with fixed
values of parameters. The parameter values are learned from the observed data through
procedures such as maximum likelihood estimate. The essential idea is to search in the
parameter space to find the parameter values that maximize the probability of observing
the data seen so far. Neither the uncertainty in the estimation of model parameters from
data, nor the uncertainty in the model itself that explains the phenomena under study, is
dealt with in a formal way. The Bayesian approach, on the other hand, treats all
sources of uncertainty using probabilities. Therefore, neither the model to explain an
observed dataset nor its parameters are fixed, but they are treated as uncertain
variables. Bayesian inference provides a framework to learn the entire distribution of
model parameters, not just the values, which maximize the probability of observing the
given data. The learning can come from both the evidence provided by observed data
and domain knowledge from experts. There is also a framework to select the best model
among the family of models suited to explain a given dataset.
Once we have the distribution of model parameters, we can eliminate the effect of
uncertainty of parameter estimation in the future values of a random variable predicted
using the learned model. This is done by averaging over the model parameter values
through marginalization of joint probability distribution, as explained in Chapter 1,
Introducing the Probability Theory.
Consider the joint probability distribution of N random variables again, as discussed in
Chapter 1, Introducing the Probability Theory:
This time, we have added one more term, m, to the argument of the probability
distribution, in order to indicate explicitly that the parameters are generated by the
model m. Then, according to Bayes theorem, the probability distribution of model
parameters conditioned on the observed data
by:
and model m is given
Formally, the term on the LHS of the equation
is called posterior
probability distribution. The second term appearing in the numerator of RHS,
,
is called the prior probability distribution. It represents the prior belief about the
model parameters, before observing any data, say, from the domain knowledge. Prior
distributions can also have parameters and they are called hyperparameters. The term
is the likelihood of model m explaining the observed data.
Since
, it can be considered as a
normalization constant
as follows:
Here,
step n,
and
observations
. The preceding equation can be rewritten in an iterative form
represents values of observations that are obtained at time
is the marginal parameter distribution updated until time step n - 1,
is the model parameter distribution updated after seeing the
at time step n.
Casting Bayes theorem in this iterative form is useful for online learning and it suggests
the following:
Model parameters can be learned in an iterative way as more and more data or
evidence is obtained
The posterior distribution estimated using the data seen so far can be treated as a
prior model when the next set of observations is obtained
Even if no data is available, one could make predictions based on prior
distribution created using the domain knowledge alone
To make these points clear, let's take a simple illustrative example. Consider the case
where one is trying to estimate the distribution of the height of males in a given region.
The data used for this example is the height measurement in centimeters obtained from
M volunteers sampled randomly from the population. We assume that the heights are
distributed according to a normal distribution with the mean
and variance
:
As mentioned earlier, in classical statistics, one tries to estimate the values of
and
from observed data. Apart from the best estimate value for each parameter, one
could also determine an error term of the estimate. In the Bayesian approach, on the
other hand,
assume
and
are also treated as random variables. Let's, for simplicity,
is a known constant. Also, let's assume that the prior distribution for
normal distribution with (hyper) parameters
and
expression for posterior distribution of is given by:
is a
. In this case, the
Here, for convenience, we have used the notation
for
. It is a simple
exercise to expand the terms in the product and complete the squares in the exponential.
This is given as an exercise at the end of the chapter. The resulting expression for the
posterior distribution
is given by:
Here,
represents the sample mean. Though the preceding expression
looks complex, it has a very simple interpretation. The posterior distribution is also a
normal distribution with the following mean:
The variance is as follows:
The posterior mean is a weighted sum of prior mean
and sample mean . As the
sample size M increases, the weight of the sample mean increases and that of the prior
decreases. Similarly, posterior precision (inverse of the variance) is the sum of the
prior precision
and precision of the sample mean
:
As M increases, the contribution of precision from observations (evidence) outweighs
that from the prior knowledge.
Let's take a concrete example where we consider age distribution with the population
mean 5.5 and population standard deviation 0.5. We sample 100 people from this
population by using the following R script:
>set.seed(100)
>age_samples <- rnorm(10000,mean = 5.5,sd=0.5)
We can calculate the posterior distribution using the following R function:
>age_mean <- function(n){
mu0 <- 5
sd0 <- 1
mus <- mean(age_samples[1:n])
sds <- sd(age_samples[1:n])
mu_n <- (sd0^2/(sd0^2 + sds^2/n)) * mus + (sds^2/n/(sd0^2 +
sds^2/n)) * mu0
mu_n
}
>samp <- c(25,50,100,200,400,500,1000,2000,5000,10000)
>mu <- sapply(samp,age_mean,simplify = "array")
>plot(samp,mu,type="b",col="blue",ylim=c(5.3,5.7),xlab="no of
samples",ylab="estimate of mean")
>abline(5.5,0)
One can see that as the number of samples increases, the estimated mean asymptotically
approaches the population mean. The initial low value is due to the influence of the
prior, which is, in this case, 5.0.
This simple and intuitive picture of how the prior knowledge and evidence from
observations contribute to the overall model parameter estimate holds in any Bayesian
inference. The precise mathematical expression for how they combine would be
different. Therefore, one could start using a model for prediction with just prior
information, either from the domain knowledge or the data collected in the past. Also, as
new observations arrive, the model can be updated using the Bayesian scheme.
Choosing the right prior distribution
In the preceding simple example, we saw that if the likelihood function has the form of a
normal distribution, and when the prior distribution is chosen as normal, the posterior
also turns out to be a normal distribution. Also, we could get a closed-form analytical
expression for the posterior mean. Since the posterior is obtained by multiplying the
prior and likelihood functions and normalizing by integration over the parameter
variables, the form of the prior distribution has a significant influence on the posterior.
This section gives some more details about the different types of prior distributions and
guidelines as to which ones to use in a given context.
There are different ways of classifying prior distributions in a formal way. One of the
approaches is based on how much information a prior provides. In this scheme, the
prior distributions are classified as Informative, Weakly Informative, Least
Informative, and Non-informative. A detailed discussion of each of these classes is
beyond the scope of this book, and interested readers should consult relevant books
(references 1 and 2 in the References section of this chapter). Here, we take more of a
practitioner's approach and illustrate some of the important classes of the prior
distributions commonly used in practice.
Non-informative priors
Let's start with the case where we do not have any prior knowledge about the model
parameters. In this case, we want to express complete ignorance about model
parameters through a mathematical expression. This is achieved through what are called
non-informative priors. For example, in the case of a single random variable x that can
take any value between
and
, the non-informative prior for its mean would be
the following:
Here, the complete ignorance of the parameter value is captured through a uniform
distribution function in the parameter space. Note that a uniform distribution is not a
proper distribution function since its integral over the domain is not equal to 1;
therefore, it is not normalizable. However, one can use an improper distribution
function for the prior as long as it is multiplied by the likelihood function; the resulting
posterior can be normalized.
If the parameter of interest is variance
, then by definition it can only take nonnegative values. In this case, we transform the variable so that the transformed variable
has a uniform probability in the range from
to
:
It is easy to show, using simple differential calculus, that the corresponding noninformative distribution function in the original variable
would be as follows:
Another well-known non-informative prior used in practical applications is the Jeffreys
prior, which is named after the British statistician Harold Jeffreys. This prior is
invariant under reparametrization of and is defined as proportional to the square root
of the determinant of the Fisher information matrix:
Here, it is worth discussing the Fisher information matrix a little bit. If X is a random
variable distributed according to
, we may like to know how much information
observations of X carry about the unknown parameter . This is what the Fisher
Information Matrix provides. It is defined as the second moment of the score (first
derivative of the logarithm of the likelihood function):
Let's take a simple two-dimensional problem to understand the Fisher information
matrix and Jeffreys prior. This example is given by Prof. D. Wittman of the University
of California (reference 3 in the References section of this chapter). Let's consider two
types of food item: buns and hot dogs.
Let's assume that generally they are produced in pairs (a hot dog and bun pair), but
occasionally hot dogs are also produced independently in a separate process. There are
two observables such as the number of hot dogs ( ) and the number of buns ( ), and
two model parameters such as the production rate of pairs ( ) and the production rate
of hot dogs alone ( ). We assume that the uncertainty in the measurements of the counts
of these two food products is distributed according to the normal distribution, with
variance
and
, respectively. In this case, the Fisher Information matrix for this
problem would be as follows:
In this case, the inverse of the Fisher information matrix would correspond to the
covariance matrix:
We have included one problem in the Exercises section of this chapter to compute the
Fisher information matrix and Jeffrey's prior. Readers are requested to attempt this in
order to get a feeling of how to compute Jeffrey's prior from observations.
Subjective priors
One of the key strengths of Bayesian statistics compared to classical (frequentist)
statistics is that the framework allows one to capture subjective beliefs about any
random variables. Usually, people will have intuitive feelings about minimum,
maximum, mean, and most probable or peak values of a random variable. For example,
if one is interested in the distribution of hourly temperatures in winter in a tropical
country, then the people who are familiar with tropical climates or climatology experts
will have a belief that, in winter, the temperature can go as low as 15°C and as high as
27°C with the most probable temperature value being 23°C. This can be captured as a
prior distribution through the Triangle distribution as shown here.
The Triangle distribution has three parameters corresponding to a minimum value (a),
the most probable value (b), and a maximum value (c). The mean and variance of this
distribution are given by:
One can also use a PERT distribution to represent a subjective belief about the
minimum, maximum, and most probable value of a random variable. The PERT
distribution is a reparametrized Beta distribution, as follows:
Here:
The PERT distribution is commonly used for project completion time analysis, and the
name originates from project evaluation and review techniques. Another area where
Triangle and PERT distributions are commonly used is in risk modeling.
Often, people also have a belief about the relative probabilities of values of a random
variable. For example, when studying the distribution of ages in a population such as
Japan or some European countries, where there are more old people than young, an
expert could give relative weights for the probability of different ages in the
populations. This can be captured through a relative distribution containing the
following details:
Here, min and max represent the minimum and maximum values, {values} represents the
set of possible observed values, and {weights} represents their relative weights. For
example, in the population age distribution problem, these could be the following:
The weights need not have a sum of 1.
Conjugate priors
If both the prior and posterior distributions are in the same family of distributions, then
they are called conjugate distributions and the corresponding prior is called a
conjugate prior for the likelihood function. Conjugate priors are very helpful for
getting get analytical closed-form expressions for the posterior distribution. In the
simple example we considered, we saw that when the noise is distributed according to
the normal distribution, choosing a normal prior for the mean resulted in a normal
posterior. The following table gives examples of some well-known conjugate pairs that
we will use in the later chapters of this book:
Likelihood function
Model parameters
Binomial
(probability)
Poisson
(rate)
Categorical
(probability, number of categories)
Univariate normal (known variance
Univariate normal (known mean
) (mean)
)
(variance)
Conjugate prior Hyperparameters
Beta
Gamma
Dirichlet
Normal
Inverse Gamma
Hierarchical priors
Sometimes, it is useful to define prior distributions for the hyperparameters itself. This
is consistent with the Bayesian view that all parameters should be treated as uncertain
by using probabilities. These distributions are called hyper-prior distributions. In
theory, one can continue this into many levels as a hierarchical model. This is one way
of eliciting the optimal prior distributions. For example:
is the prior distribution with a hyperparameter . We could define a prior
distribution for through a second set of equations, as follows:
Here,
is the hyper-prior distribution for the hyperparameter
, parametrized
by the hyper-hyper-parameter
. One can define a prior distribution for
in the
same way and continue the process forever. The practical reason for formalizing such
models is that, at some level of hierarchy, one can define a uniform prior for the hyper
parameters, reflecting complete ignorance about the parameter distribution, and
effectively truncate the hierarchy. In practical situations, typically, this is done at the
second level. This corresponds to, in the preceding example, using a uniform
distribution for
.
I want to conclude this section by stressing one important point. Though prior
distribution has a significant role in Bayesian inference, one need not worry about it too
much, as long as the prior chosen is reasonable and consistent with the domain
knowledge and evidence seen so far. The reasons are is that, first of all, as we have
more evidence, the significance of the prior gets washed out. Secondly, when we use
Bayesian models for prediction, we will average over the uncertainty in the estimation
of the parameters using the posterior distribution. This averaging is the key ingredient
of Bayesian inference and it removes many of the ambiguities in the selection of the
right prior.
Estimation of posterior distribution
So far, we discussed the essential concept behind Bayesian inference and also how to
choose a prior distribution. Since one needs to compute the posterior distribution of
model parameters before one can use the models for prediction, we discuss this task in
this section. Though the Bayesian rule has a very simple-looking form, the computation
of posterior distribution in a practically usable way is often very challenging. This is
primarily because computation of the normalization constant
involves Ndimensional integrals, when there are N parameters. Even when one uses a conjugate
prior, this computation can be very difficult to track analytically or numerically. This
was one of the main reasons for not using Bayesian inference for multivariate modeling
until recent decades. In this section, we will look at various approximate ways of
computing posterior distributions that are used in practice.
Maximum a posteriori estimation
Maximum a posteriori (MAP) estimation is a point estimation that corresponds to
taking the maximum value or mode of the posterior distribution. Though taking a point
estimation does not capture the variability in the parameter estimation, it does take into
account the effect of prior distribution to some extent when compared to maximum
likelihood estimation. MAP estimation is also called poor man's Bayesian inference.
From the Bayes rule, we have:
Here, for convenience, we have used the notation X for the N-dimensional vector
. The last relation follows because the denominator of RHS of Bayes rule
is independent of . Compare this with the following maximum likelihood estimate:
The difference between the MAP and ML estimate is that, whereas ML finds the mode
of the likelihood function, MAP finds the mode of the product of the likelihood function
and prior.
Laplace approximation
We saw that the MAP estimate just finds the maximum value of the posterior
distribution. Laplace approximation goes one step further and also computes the local
curvature around the maximum up to quadratic terms. This is equivalent to assuming that
the posterior distribution is approximately Gaussian (normal) around the maximum. This
would be the case if the amount of data were large compared to the number of
parameters: M >> N.
Here, A is an N x N Hessian matrix obtained by taking the derivative of the log of the
posterior distribution:
It is straightforward to evaluate the previous expressions at
definition of conditional probability:
, using the following
We can get an expression for P(X|m) from Laplace approximation that looks like the
following:
In the limit of a large number of samples, one can show that this expression simplifies to
the following:
The term
is called Bayesian information criterion (BIC)
and can be used for model selections or model comparison. This is one of the goodness
of fit terms for a statistical model. Another similar criterion that is commonly used is
Akaike information criterion (AIC), which is defined by
.
Now we will discuss how BIC can be used to compare different models for model
selection. In the Bayesian framework, two models such as
and
are compared
using the Bayes factor. The definition of the Bayes factor
odds to prior odds that is given by:
is the ratio of posterior
Here, posterior odds is the ratio of posterior probabilities of the two models of the
given data and prior odds is the ratio of prior probabilities of the two models, as given
in the preceding equation. If
model
is preferred by the data.
, model
is preferred by the data and if
,
In reality, it is difficult to compute the Bayes factor because it is difficult to get the
precise prior probabilities. It can be shown that, in the large N limit,
can be viewed as a rough approximation to
.
Monte Carlo simulations
The two approximations that we have discussed so far, the MAP and Laplace
approximations, are useful when the posterior is a very sharply peaked function about
the maximum value. Often, in real-life situations, the posterior will have long tails. This
is, for example, the case in e-commerce where the probability of the purchasing of a
product by a user has a long tail in the space of all products. So, in many practical
situations, both MAP and Laplace approximations fail to give good results. Another
approach is to directly sample from the posterior distribution. Monte Carlo simulation
is a technique used for sampling from the posterior distribution and is one of the
workhorses of Bayesian inference in practical applications. In this section, we will
introduce the reader to Markov Chain Monte Carlo (MCMC) simulations and also
discuss two common MCMC methods used in practice.
As discussed earlier, let
be the set of parameters that we are
interested in estimating from the data through posterior distribution. Consider the case
of the parameters being discrete, where each parameter has K possible values, that is,
. Set up a Markov process with states
and transition probability
matrix
. The essential idea behind MCMC simulations is that one can choose
the transition probabilities in such a way that the steady state distribution of the Markov
chain would correspond to the posterior distribution we are interested in. Once this is
done, sampling from the Markov chain output, after it has reached a steady state, will
give samples of
distributed according to the posterior distribution.
Now, the question is how to set up the Markov process in such a way that its steady
state distribution corresponds to the posterior of interest. There are two well-known
methods for this. One is the Metropolis-Hastings algorithm and the second is Gibbs
sampling. We will discuss both in some detail here.
The Metropolis-Hasting algorithm
The Metropolis-Hasting algorithm was one of the first major algorithms proposed for
MCMC (reference 4 in the References section of this chapter). It has a very simple
concept—something similar to a hill-climbing algorithm in optimization:
1. Let
be the state of the system at time step t.
2. To move the system to another state at time step t + 1, generate a candidate state
by sampling from a proposal distribution
. The proposal distribution is
chosen in such a way that it is easy to sample from it.
3. Accept the proposal move with the following probability:
4. If it is accepted,
= ; if not,
.
5. Continue the process until the distribution converges to the steady state.
Here,
is the posterior distribution that we want to simulate. Under certain
conditions, the preceding update rule will guarantee that, in the large time limit, the
Markov process will approach a steady state distributed according to
.
The intuition behind the Metropolis-Hasting algorithm is simple. The proposal
distribution
gives the conditional probability of proposing state
transition in the next time step from the current state
probability that the system is currently in state
in the next time step. Similarly,
. Therefore,
to make a
is the
and would make a transition to state
is the probability that the system is
currently in state and would make a transition to state
in the next time step. If the
ratio of these two probabilities is more than 1, accept the move. Alternatively, accept
the move only with the probability given by the ratio. Therefore, the Metropolis-Hasting
algorithm is like a hill-climbing algorithm where one accepts all the moves that are in
the upward direction and accepts moves in the downward direction once in a while
with a smaller probability. The downward moves help the system not to get stuck in
local minima.
Let's revisit the example of estimating the posterior distribution of the mean and
variance of the height of people in a population discussed in the introductory section.
This time we will estimate the posterior distribution by using the Metropolis-Hasting
algorithm. The following lines of R code do this job:
>set.seed(100)
>mu_t <- 5.5
>sd_t <- 0.5
>age_samples <- rnorm(10000,mean = mu_t,sd = sd_t)
>#function to compute log likelihood
>loglikelihood <- function(x,mu,sigma){
singlell <- dnorm(x,mean = mu,sd = sigma,log = T)
sumll <- sum(singlell)
sumll
}
>#function to compute prior distribution for mean on log scale
>d_prior_mu <- function(mu){
dnorm(mu,0,10,log=T)
}
>#function to compute prior distribution for std dev on log scale
>d_prior_sigma <- function(sigma){
dunif(sigma,0,5,log=T)
}
>#function to compute posterior distribution on log scale
>d_posterior <- function(x,mu,sigma){
loglikelihood(x,mu,sigma) + d_prior_mu(mu) + d_prior_sigma(sigma)
}
>#function to make transition moves
tran_move <- function(x,dist = .1){
x + rnorm(1,0,dist)
}
>num_iter <- 10000
>posterior <- array(dim = c(2,num_iter))
>accepted <- array(dim=num_iter - 1)
>theta_posterior <-array(dim=c(2,num_iter))
>values_initial <- list(mu = runif(1,4,8),sigma = runif(1,1,5))
>theta_posterior[1,1] <- values_initial$mu
>theta_posterior[2,1] <- values_initial$sigma
>for (t in 2:num_iter){
#proposed next values for parameters
theta_proposed <- c(tran_move(theta_posterior[1,t1]),tran_move(theta_posterior[2,t-1]))
p_proposed <- d_posterior(age_samples,mu =
theta_proposed[1],sigma = theta_proposed[2])
p_prev <-d_posterior(age_samples,mu = theta_posterior[1,t1],sigma = theta_posterior[2,t-1])
eps <- exp(p_proposed - p_prev)
# proposal is accepted if posterior density is higher w/
theta_proposed
# if posterior density is not higher, it is accepted with
probability eps
accept <- rbinom(1,1,prob = min(eps,1))
accepted[t - 1] <- accept
if (accept == 1){
theta_posterior[,t] <- theta_proposed
} else {
theta_posterior[,t] <- theta_posterior[,t-1]
}
}
To plot the resulting posterior distribution, we use the sm package in R:
>library(sm)
x <cbind(c(theta_posterior[1,1:num_iter]),c(theta_posterior[2,1:num_iter
]))
xlim <- c(min(x[,1]),max(x[,1]))
ylim <- c(min(x[,2]),max(x[,2]))
zlim <- c(0,max(1))
sm.density(x,
xlab = "mu",ylab="sigma",
zlab = " ",zlim = zlim,
xlim = xlim ,ylim = ylim,col="white")
title("Posterior density")
The resulting posterior distribution will look like the following figure:
Though the Metropolis-Hasting algorithm is simple to implement for any Bayesian
inference problem, in practice it may not be very efficient in many cases. The main
reason for this is that, unless one carefully chooses a proposal distribution
,
there would be too many rejections and it would take a large number of updates to reach
the steady state. This is particularly the case when the number of parameters are high.
There are various modifications of the basic Metropolis-Hasting algorithms that try to
overcome these difficulties. We will briefly describe these when we discuss various R
packages for the Metropolis-Hasting algorithm in the following section.
R packages for the Metropolis-Hasting algorithm
There are several contributed packages in R for MCMC simulation using the
Metropolis-Hasting algorithm, and here we describe some popular ones.
The mcmc package contributed by Charles J. Geyer and Leif T. Johnson is one of the
popular packages in R for MCMC simulations. It has the metrop function for running the
basic Metropolis-Hasting algorithm. The metrop function uses a multivariate normal
distribution as the proposal distribution.
Sometimes, it is useful to make a variable transformation to improve the speed of
convergence in MCMC. The mcmc package has a function named morph for doing this.
Combining these two, the function morph.metrop first transforms the variable, does a
Metropolis on the transformed density, and converts the results back to the original
variable.
Apart from the mcmc package, two other useful packages in R are MHadaptive
contributed by Corey Chivers and the Evolutionary Monte Carlo (EMC) algorithm
package by Gopi Goswami. Due to lack of space, we will not be discussing these two
packages in this book. Interested readers are requested to download these from the CRAN project's site and experiment with them.
Gibbs sampling
As mentioned before, the Metropolis-Hasting algorithm suffers from the drawback of
poor convergence, due to too many rejections, if one does not choose a good proposal
distribution. To avoid this problem, two physicists Stuart Geman and Donald Geman
proposed a new algorithm (reference 5 in the References section of this chapter). This
algorithm is called Gibbs sampling and it is named after the famous physicist J W
Gibbs. Currently, Gibbs sampling is the workhorse of MCMC for Bayesian inference.
Let
be the set of parameters of the model that we wish to estimate:
1. Start with an initial state
.
2. At each time step, update the components one by one, by drawing from a
distribution conditional on the most recent value of rest of the components:
3. After N steps, all components of the parameter will be updated.
4. Continue with step 2 until the Markov process converges to a steady state.
Gibbs sampling is a very efficient algorithm since there are no rejections. However, to
be able to use Gibbs sampling, the form of the conditional distributions of the posterior
distribution should be known.
R packages for Gibbs sampling
Unfortunately, there are not many contributed general purpose Gibbs sampling packages
in R. The gibbs.met package provides two generic functions for performing MCMC in a
Naïve way for user-defined target distribution. The first function is gibbs_met. This
performs Gibbs sampling with each 1-dimensional distribution sampled by using the
Metropolis algorithm, with normal distribution as the proposal distribution. The second
function, met_gaussian, updates the whole state with independent normal distribution
centered around the previous state. The gibbs.met package is useful for general purpose
MCMC on moderate dimensional problems.
In the Exercises section of this chapter, we will discuss one problem that involves
sampling from the two-dimensional normal distribution by using both the MetropolisHasting algorithm and Gibbs sampling to make these concepts more clear. Readers can
use these mentioned packages for solving this exercise.
Apart from the general purpose MCMC packages, there are several packages in R
designed to solve a particular type of machine-learning problems. The GibbsACOV
package can be used for one-way mixed-effects ANOVA and ANCOVA models. The
lda package performs collapsed Gibbs sampling methods for topic (LDA) models. The
stocc package fits a spatial occupancy model via Gibbs sampling. The binomlogit
package implements an efficient MCMC for Binomial Logit models. Bmk is a package
for doing diagnostics of MCMC output. Bayesian Output Analysis Program (BOA) is
another similar package. RBugs is an interface of the well-known OpenBUGS MCMC
package. The ggmcmc package is a graphical tool for analyzing MCMC simulation.
MCMCglm is a package for generalized linear mixed models and BoomSpikeSlab is a
package for doing MCMC for Spike and Slab regression. Finally, SamplerCompare is
a package (more of a framework) for comparing the performance of various MCMC
packages.
Variational approximation
In the variational approximation scheme, one assumes that the posterior distribution
can be approximated to a factorized form:
Note that the factorized form is also a conditional distribution, so each
can have
dependence on other s through the conditioned variable X. In other words, this is not
a trivial factorization making each parameter independent. The advantage of this
factorization is that one can choose more analytically tractable forms of distribution
functions
. In fact, one can vary the functions
in such a way that it
is as close to the true posterior
as possible. This is mathematically
formulated as a variational calculus problem, as explained here.
Let's use some measures to compute the distance between the two probability
distributions, such as
and
, where
. One of the
standard measures of distance between probability distributions is the Kullback-Leibler
divergence, or KL-divergence for short. It is defined as follows:
The reason why it is called a divergence and not distance is that
is not
symmetric with respect to Q and P. One can use the relation
and rewrite the preceding expression as an equation for log P(X):
Here:
Note that, in the equation for ln P(X), there is no dependence on Q on the LHS.
Therefore, maximizing
with respect to Q will minimize
, since their
sum is a term independent of Q. By choosing analytically tractable functions for Q, one
can do this maximization in practice. It will result in both an approximation for the
posterior and a lower bound for ln P(X) that is the logarithm of evidence or marginal
likelihood, since
.
Therefore, variational approximation gives us two quantities in one shot. A posterior
distribution can be used to make predictions about future observations (as explained in
the next section) and a lower bound for evidence can be used for model selection.
How does one implement this minimization of KL-divergence in practice? Without
going into mathematical details, here we write a final expression for the solution:
Here,
implies that the expectation of the logarithm of the joint
distribution
is taken over all the parameters
except for . Therefore, the
minimization of KL-divergence leads to a set of coupled equations; one for each
needs to be solved self-consistently to obtain the final solution. Though the
variational approximation looks very complex mathematically, it has a very simple,
intuitive explanation. The posterior distribution of each parameter is obtained by
averaging the log of the joint distribution over all the other variables. This is analogous
to the Mean Field theory in physics where, if there are N interacting charged particles,
the system can be approximated by saying that each particle is in a constant external
field, which is the average of fields produced by all the other particles.
We will end this section by mentioning a few R packages for variational approximation.
The VBmix package can be used for variational approximation in Bayesian mixture
models. A similar package is vbdm used for Bayesian discrete mixture models. The
package vbsr is used for variational inference in Spike Regression Regularized Linear
Models.
Prediction of future observations
Once we have the posterior distribution inferred from data using some of the methods
described already, it can be used to predict future observations. The probability of
observing a value Y, given observed data X, and posterior distribution of parameters
is given by:
Note that, in this expression, the likelihood function
is averaged by using
the distribution of the parameter given by the posterior
. This is, in fact, the
core strength of the Bayesian inference. This Bayesian averaging eliminates the
uncertainty in estimating the parameter values and makes the prediction more robust.
Exercises
1. Derive the equation for the posterior mean by expanding the square in the
2.
3.
4.
5.
exponential
for each i, collecting all similar power terms, and making a
perfect square again. Note that the product of exponentials can be written as the
exponential of a sum of terms.
For this exercise, we use the dataset corresponding to Smartphone-Based
Recognition of Human Activities and Postural Transitions, from the UCI Machine
Learning repository (https://archive.ics.uci.edu/ml/datasets/SmartphoneBased+Recognition+of+Human+Activities+and+Postural+Transitions). It contains
values of acceleration taken from an accelerometer on a smartphone. The original
dataset contains x, y, and z components of the acceleration and the corresponding
timestamp values. For this exercise, we have used only the two horizontal
components of the acceleration x and y. In this exercise, let's assume that the
acceleration follows a normal distribution. Let's also assume a normal prior
distribution for the mean values of acceleration with a hyperparameter for a mean
that is uniformly distributed in the interval (-0.5, 0.5) and a known variance equal
to 1. Find the posterior mean value by using the expression given in the equation.
Write an R function to compute the Fisher information matrix. Obtain the Fisher
information matrix for this problem by using the dataset mentioned in exercise 1 of
this section.
Set up an MCMC simulation for this problem by using the mcmc package in R. Plot
a histogram of the simulated data.
Set up an MCMC simulation using Gibbs sampling. Compare the results with that
of the Metropolis algorithm.
References
1. Berger J.O. Statistical Decision Theory and Bayesian Analysis. Springer Series
in Statistics. 1993. ISBN-10: 0387960988
2. Jayes E.T. Probability Theory: The Logic of Science. Cambridge University
Press. 2003. ISBN-10: 052159271
3. Wittman D. "Fisher Matrix for Beginners". Physics Department, University of
California at Davis (http://www.physics.ucdavis.edu/~dwittman/Fisher-matrixguide.pdf)
4. Metropolis N, Rosenbluth A.W., Rosenbluth M.N., Teller A.H., Teller E.
"Equations of State Calculations by Fast Computing Machines". Journal of
Chemical Physics 21 (6): 1087–1092. 1953
5. Geman S., Geman D. "Stochastic Relaxation, Gibbs Distributions, and the
Bayesian Restoration of Images". IEEE Transactions on Pattern Analysis and
Machine Intelligence 6 (6): 721-741. 1984
Summary
In this chapter, we covered basic principles of Bayesian inference. Starting with how
uncertainty is treated differently in Bayesian statistics compared to classical statistics,
we discussed deeply various components of Bayes' rule. Firstly, we learned the
different types of prior distributions and how to choose the right one for your problem.
Then we learned the estimation of posterior distribution using techniques such as MAP
estimation, Laplace approximation, and MCMC simulations. Once the readers have
comprehended this chapter, they will be in a position to apply Bayesian principles in
their data analytics problems. Before we start discussing specific Bayesian machine
learning problems, in the next chapter, we will review machine learning in general.
Chapter 4. Machine Learning Using
Bayesian Inference
Now that we have learned about Bayesian inference and R, it is time to use both for
machine learning. In this chapter, we will give an overview of different machine
learning techniques and discuss each of them in detail in subsequent chapters. Machine
learning is a field at the intersection of computer science and statistics, and a subbranch
of artificial intelligence or AI. The name essentially comes from the early works in AI
where researchers were trying to develop learning machines that automatically learned
the relationship between input and output variables from data alone. Once a machine is
trained on a dataset for a given problem, it can be used as a black box to predict values
of output variables for new values of input variables.
It is useful to set this learning process of a machine in a mathematical framework. Let X
and Y be two random variables such that we seek a learning machine that learns the
relationship between these two variables from data and predicts the value of Y, given
the value of X. The system is fully characterized by a joint probability distribution
P(X,Y), however, the form of this distribution is unknown. The goal of learning is to find
a function f(X), which maps from X to Y, such that the predictions
contain as
small error as possible. To achieve this, one chooses a loss function L(Y, f(X)) and finds
an f(X) that minimizes the expected or average loss over the joint distribution of X and Y
given by:
In Statistical Decision Theory, this is called empirical risk minimization. The typical
loss function used is square loss function (
), if Y is a
continuous variable, and Hinge loss function (
), if Y is
a binary discrete variable with values . The first case is typically called regression
and second case is called binary classification, as we will see later in this chapter.
The mathematical framework described here is called supervised learning, where the
machine is presented with a training dataset containing ground truth values
corresponding to pairs (Y, X). Let us consider the case of square loss function again.
Here, the learning task is to find an f(X) that minimizes the following:
Since the objective is to predict values of Y for the given values of X, we have used the
conditional distribution P(Y|X) inside the integral using factorization of P(X, Y). It can
be shown that the minimization of the preceding loss function leads to the following
solution:
The meaning of the preceding equation is that the best prediction of Y for any input value
X is the mean or expectation denoted by E, of the conditional probability distribution
P(Y|X) conditioned at X.
In Chapter 3, Introducing Bayesian Inference, we mentioned maximum likelihood
estimation (MLE) as a method for learning the parameters
of any distribution P(X).
In general, MLE is the same as the minimization of a square loss function if the
underlying distribution is a normal distribution.
Note that, in empirical risk minimization, we are learning the parameter, E[(Y|X)], the
mean of the conditional distribution, for a given value of X. We will use one particular
machine learning task, linear regression, to explain the advantage of Bayesian inference
over the classical method of learning. However, before this, we will briefly explain
some more general aspects of machine learning.
There are two types of supervised machine learning models, namely generative models
and discriminative models. In the case of generative models, the algorithm tries to learn
the joint probability of X and Y, which is P(X,Y), from data and uses it to estimate mean
P(Y|X). In the case of discriminative models, the algorithm tries to directly learn the
desired function, which is the mean of P(Y|X), and no modeling of the X variable is
attempted.
Labeling values of the target variable in the training data is done manually. This makes
supervised learning very expensive when one needs to use very large datasets as in the
case of text analytics. However, very often, supervised learning methods produce the
most accurate results.
If there is not enough training data available for learning, one can still use machine
learning through unsupervised learning. Here, the learning is mainly through the
discovery of patterns of associations between variables in the dataset. Clustering data
points that have similar features is a classic example.
Reinforcement learning is the third type of machine learning, where the learning takes
place in a dynamic environment where the machine needs to perform certain actions
based on its current state. Associated with each action is a reward. The machine needs
to learn what action needs to be taken at each state so that the total reward is
maximized. This is typically how a robot learns to perform tasks, such as driving a
vehicle, in a real-life environment.
Why Bayesian inference for machine
learning?
We have already discussed the advantages of Bayesian statistics over classical
statistics in the last chapter. In this chapter, we will see in more detail how some of the
concepts of Bayesian inference that we learned in the last chapter are useful in the
context of machine learning. For this purpose, we take one simple machine learning
task, namely linear regression. Let us consider a learning task where we have a dataset
D containing N pair of points
and the goal is to build a machine learning model
using linear regression that it can be used to predict values of
, given new values of
.
In linear regression, first, we assume that Y is of the following form:
Here, F(X) is a function that captures the true relationship between X and Y, and is an
error term that captures the inherent noise in the data. It is assumed that this noise is
characterized by a normal distribution with mean 0 and variance
. What this implies
is that if we have an infinite training dataset, we can learn the form of F(X) from data
and, even then, we can only predict Y up to an additive noise term . In practice, we
will have only a finite training dataset D; hence, we will be able to learn only an
approximation for F(X) denoted by
.
Note that we are discussing two types of errors here. One is an error term that is due
to the inherent noise in the data that we cannot do much about. The second error is in
learning F(X), approximately through the function
In general,
from the dataset D.
, which the approximate mapping between input variable X and
output variable Y, is a function of X with a set of parameters
. When
is a
linear function of the parameters
, we say the learning model is linear. It is a general
misconception that linear regression corresponds to the case only if
is a linear
function of X. The reason for linearity in the parameter and not in X is that, during the
minimization of the loss function, one actually minimizes over the parameter values to
find the best
. Hence, a function that is linear in will lead to a linear
optimization problem that can be tackled analytically and numerically more easily.
Therefore, linear regression corresponds to the following:
This is an expansion over a set of M basis functions
. Here, each basis function
is a function of X without any unknown parameters. In machine learning, these
are called feature functions or model features. For the linear regression problem, the
loss function, therefore, can be written as follows:
Here,
is the transpose of the parameter vector
vector composed of the basis functions
and B(X) is the
. Learning
from a dataset implies estimating the values of
by minimizing the loss
function through some optimization schemes such as gradient descent.
It is important to choose as many basis functions as possible to capture interesting
patterns in the data. However, choosing more numbers of basis functions or features
will overfit the model in the sense that it will even start fitting the noise contained in the
data. Overfit will lead to poor predictions on new input data. Therefore, it is important
to choose an optimum number of best features to maximize the predictive accuracy of
any machine learning model. In machine learning based on classical statistics, this is
achieved through what is called bias-variance tradeoff and model regularization.
Whereas, in machine learning through Bayesian inference, accuracy of a predictive
model can be maximized through Bayesian model averaging, and there is no need to
impose model regularization or bias-variance tradeoff. We will learn each of these
concepts in the following sections.
Model overfitting and bias-variance
tradeoff
The expected loss mentioned in the previous section can be written as a sum of three
terms in the case of linear regression using squared loss function, as follows:
Here, Bias is the difference
between the true model F(X) and
average value of
taken over an ensemble of datasets. Bias is a measure of how
much the average prediction over all datasets in the ensemble differs from the true
regression function F(X). Variance is given by
. It is a
measure of extent to which the solution for a given dataset varies around the mean over
all datasets. Hence, Variance is a measure of how much the function
is
sensitive to the particular choice of dataset D. The third term Noise, as mentioned
earlier, is the expectation of difference
between observation and the true
regression function, over all the values of X and Y. Putting all these together, we can
write the following:
The objective of machine learning is to learn the function
from data that
minimizes the expected loss E[L]. One can keep minimizing the bias by keeping more
and more basis functions in the model and thereby increasing the model's complexity.
However, since each of the model parameters
are learned from a given dataset, the
more complex the model becomes, the more sensitive its parameter estimation would be
to the dataset used. This results in increased variance for more complex models. Hence,
in any supervised machine learning task, there is a tradeoff between model bias and
model complexity. One has to choose a model of optimum complexity to minimize the
error of prediction on an unseen dataset. In the classical or frequentist approach, this is
done by partitioning the labeled data into three sets. One is the training set, the second is
the validation set, and the third is the test set. Models of different complexity that are
trained using the training set are evaluated using the validation dataset to choose the
model with optimum complexity. It is then, finally, evaluated against the test set to
estimate the prediction error.
Selecting models of optimum complexity
There are different ways of selecting models with the right complexity so that the
prediction error on unseen data is less. Let's discuss each of these approaches in the
context of the linear regression model.
Subset selection
In the subset selection approach, one selects only a subset of the whole set of variables,
which are significant, for the model. This not only increases the prediction accuracy of
the model by decreasing model variance, but it is also useful from the interpretation
point of view. There are different ways of doing subset selection, but the following two
are the most commonly used approaches:
Forward selection: In forward selection, one starts with no variables (intercept
alone), and by using a greedy algorithm, adds other variables one by one. For each
step, the variable that most improves the fit is chosen to add to the model.
Backward selection: In backward selection, one starts with the full model and
sequentially deletes the variable that has the least impact on the fit. At each step,
the variable with the least Z-score is selected for elimination. In statistics, the Zscore of a random variable is a measure of the standard deviation between an
element and its mean. A small value of Z-score (typically < 2) indicates that the
effect of the variable is more likely by chance and is not statistically significant.
Model regularization
In this approach, one adds a penalty term to the loss function that does not allow the size
of the parameter to become very large during minimization. There are two main ways of
doing this:
Ridge regression: This simple type of regularization is where the additional term
is proportional to the magnitude of the parameter vector given by
. The loss
function for linear regression with the regularization term can be written as
follows:
Parameters having a large magnitude will contribute more to the loss. Hence,
minimization of the preceding loss function will typically produce parameters
having small values and reduce the overfit. The optimum value of is found from
the validation set.
Lasso: In Lasso also, one adds a penalty term similar to ridge regression, but the
term is proportional to the sum of modulus of each parameter and not its square:
Though this looks like a simple change, Lasso has some very important differences
with respect to ridge regression. First of all, the presence of the
term makes
the loss function nonlinear in parameters . The corresponding minimization
problem is called the quadratic programming problem compared to the linear
programming problem in ridge regression, for which a closed form solution is
available. Due to the particular form
of the penalty, when the coefficients
shrink as a result of minimization, some of them eventually become zero. So, Lasso
is also in some sense a subset selection problem.
A detailed discussion of various subset selection and model regularization approaches
can be found in the book by Trevor Hastie et.al (reference 1 in the References section of
this chapter).
Bayesian averaging
So far, we have learned that simply minimizing the loss function (or equivalently
maximizing the log likelihood function in the case of normal distribution) is not enough
to develop a machine learning model for a given problem. One has to worry about
models overfitting the training data, which will result in larger prediction errors on new
datasets. The main advantage of Bayesian methods is that one can, in principle, get
away from this problem, without using explicit regularization and different datasets for
training and validation. This is called Bayesian model averaging and will be discussed
here. This is one of the answers to our main question of the chapter, why Bayesian
inference for machine learning?
For this, let's do a full Bayesian treatment of the linear regression problem. Since we
only want to explain how Bayesian inference avoids the overfitting problem, we will
skip all the mathematical derivations and state only the important results here. For more
details, interested readers can refer to the book by Christopher M. Bishop (reference 2
in the References section of this chapter).
The linear regression equation
, with
having a normal distribution with
zero mean and variance
(equivalently, precision
), can be cast in a
probability distribution form with Y having a normal distribution with mean f(X) and
precision . Therefore, linear regression is equivalent to estimating the mean of the
normal distribution:
Since
, where the set of basis functions B(X) is known and we are
assuming here that the noise parameter is also a known constant, only
taken as an uncertain variable for a fully Bayesian treatment.
needs to be
The first step in Bayesian inference is to compute a posterior distribution of parameter
vector . For this, we assume that the prior distribution of
is an M dimensional
normal distribution (since there are M components) with mean
and covariance
matrix
. As we have seen in Chapter 3, Introducing Bayesian Inference, this
corresponds to taking a conjugate distribution for the prior:
The corresponding posterior distribution is given by:
Here,
and
.
Here, B is an N x M matrix formed by stacking basis vectors B, at different values of X,
on top of each other as shown here:
Now that we have the posterior distribution for
as a closed-form analytical
expression, we can use it to predict new values of Y. To get an analytical closed-form
expression for the predictive distribution of Y, we make an assumption that
and
. This corresponds to a prior with zero mean and isotropic covariance matrix
characterized by one precision parameter . The predictive distribution or the
probability that the prediction for a new value of X = x is y, is given by:
This equation is the central theme of this section. In the classical or frequentist
approach, one estimates a particular value
for the parameter
from the training
dataset and finds the probability of predicting y by simply using
. This
does not address the overfitting of the model unless regularization is used. In Bayesian
inference, we are integrating out the parameter variable
by using its posterior
probability distribution
learned from the data. This averaging will
remove the necessity of using regularization or keeping the parameters to an optimal
level through bias-variance tradeoff. This can be seen from the closed-form expression
for P(y|x), after we substitute the expressions for
and
for
the linear regression problem and do the integration. Since both are normal
distributions, the integration can be done analytically that results in the following simple
expression for P(y|x):
Here,
.
This equation implies that the variance of the predictive distribution consists of two
terms. One term, 1/ , coming from the inherent noise in the data and the second term
coming from the uncertainty associated with the estimation of model parameter
from
data. One can show that as the size of training data N becomes very large, the second
term decreases and in the limit
it becomes zero.
The example shown here illustrates the power of Bayesian inference. Since one can take
care of uncertainty in the parameter estimation through Bayesian averaging, one doesn't
need to keep separate validation data and all the data can be used for training. So, a full
Bayesian treatment of a problem avoids the overfitting issue. Another major advantage
of Bayesian inference, which we will not go into in this section, is treating latent
variables in a machine learning model. In the next section, we will give a high-level
overview of the various common machine learning tasks.
An overview of common machine
learning tasks
This section is a prequel to the following chapters, where we will discuss different
machine learning techniques in detail. At a high level, there are only a handful of tasks
that machine learning tries to address. However, for each of such tasks, there are
several approaches and algorithms in place.
The typical tasks in any machine learning are one of the following:
Classification
Regression
Clustering
Association rules
Forecasting
Dimensional reduction
Density estimation
In classification, the objective is to assign a new data point to one of the predetermined
classes. Typically, this is either a supervised or semi-supervised learning problem. The
well-known machine learning algorithms used for classification are logistic regression,
support vector machines (SVM), decision trees, Naïve Bayes, neural networks,
Adaboost, and random forests. Here, Naïve Bayes is a Bayesian inference-based
method. Other algorithms, such as logistic regression and neural networks, have also
been implemented in the Bayesian framework.
Regression is probably the most common machine learning problem. It is used to
determine the relation between a set of input variables (typically, continuous variables)
and an output (dependent) variable that is continuous. We discussed the simplest
example of linear regression in some detail in the previous section. More complex
examples of regression are generalized linear regression, spline regression, nonlinear
regression using neural networks, support vector regression, and Bayesian network.
Bayesian formulations of regression include the Bayesian network and Bayesian linear
regression.
Clustering is a classic example of unsupervised learning. Here, the objective is to group
together similar items in a dataset based on certain features of the data. The number of
clusters is not known in advance. Hence, clustering is more of a pattern detection
problem. The well-known clustering algorithms are K-means clustering, hierarchical
clustering, and Latent Dirichlet allocation (LDA). In this, LDA is formulated as a
Bayesian inference problem. Other clustering methods using Bayesian inference include
the Bayesian mixture model.
Association rule mining is an unsupervised method that finds items that are co-occurring
in large transactions of data. The market basket analysis, which finds the items that are
sold together in a supermarket, is based on association rule mining. The Apriori
algorithm and frequent pattern matching algorithm are two main methods used for
association rule mining.
Forecasting is similar to regression, except that the data is a time series where there are
observations with different values of time stamp and the objective is to predict future
values based on the current and past values. For this purpose, one can use methods such
as ARIMA, neural networks, and dynamic Bayesian networks.
One of the fundamental issues in machine learning is called the curse of dimensionality.
Since there can be a large number of features in a machine learning model, the typical
minimization of error that one has to do to estimate model parameters will involve
search and optimization in a large dimensional space. Most often, data will be very
sparse in this higher dimensional space. This can make the search for optimal
parameters very inefficient. To avoid this problem, one tries to project this higher
dimensional space into a lower dimensional space containing a few important
variables. One can then use these lower dimensional variables as features. The two
well-known examples of dimensional reduction are principal component analysis and
self-organized maps.
Often, the probability distribution of a population is directly estimated, without any
parametric models, from a small amount of observed data for making inferences. This is
called density estimation. The simplest form of density estimation is histograms,
though it is not adequate for many practical applications. The more sophisticated density
estimations are kernel density estimation (KDE) and vector quantization.
References
1. Friedman J., Hastie T., and Tibshirani R. The Elements of Statistical Learning –
Data Mining, Inference, and Prediction. Springer Series in Statistics. 2009
2. Bishop C.M. Pattern Recognition and Machine Learning (Information Science
and Statistics). Springer. 2006. ISBN-10: 0387310738
Summary
In this chapter, we got an overview of what machine learning is and what some of its
high-level tasks are. We also discussed the importance of Bayesian inference in
machine learning, particularly in the context of how it can help to avoid important
issues, such as model overfit and how to select optimum models. In the coming chapters,
we will learn some of the Bayesian machine learning methods in detail.
Chapter 5. Bayesian Regression Models
In the previous chapter, we covered the theory of Bayesian linear regression in some
detail. In this chapter, we will take a sample problem and illustrate how it can be
applied to practical situations. For this purpose, we will use the generalized linear
model (GLM) packages in R. Firstly, we will give a brief introduction to the concept of
GLM to the readers.
Generalized linear regression
Recall that in linear regression, we assume the following functional form between the
dependent variable Y and independent variable X:
Here,
is a set of basis functions and
is the parameter vector. Usually, it is assumed that
, so
represents an intercept or a bias term. Also, it is assumed that is a noise term
distributed according to the normal distribution with mean zero and variance
also showed that this results in the following equation:
. We
One can generalize the preceding equation to incorporate not only the normal
distribution for noise but any distribution in the exponential family (reference 1 in the
References section of this chapter). This is done by defining the following equation:
Here, g is called a link function. The well-known models, such as logistic regression,
log-linear models, Poisson regression, and so on, are special cases of GLM. For
example, in the case of ordinary linear regression, the link function would be
For logistic regression, it is
function, and for Poisson regression, it is
, which is the inverse of the logistic
.
.
In the Bayesian formulation of GLMs, unlike ordinary linear regression, there are no
closed-form analytical solutions. One needs to specify prior probabilities for the
regression coefficients. Then, their posterior probabilities are typically obtained
through Monte Carlo simulations.
The arm package
In this chapter, for the purpose of illustrating Bayesian regression models, we will use
the arm package of R. This package was developed by Andrew Gelman and coworkers, and it can be downloaded from the website at http://CRAN.Rproject.org/package=arm.
The arm package has the bayesglm function that implements the Bayesian generalized
linear model with an independent normal, t, or Cauchy prior distributions, for the model
coefficients. We will use this function to build Bayesian regression models.
The Energy efficiency dataset
We will use the Energy efficiency dataset from the UCI Machine Learning repository for
the illustration of Bayesian regression (reference 2 in the References section of this
chapter). The dataset can be downloaded from the website at
http://archive.ics.uci.edu/ml/datasets/Energy+efficiency. The dataset contains the
measurements of energy efficiency of buildings with different building parameters.
There are two energy efficiency parameters measured: heating load (Y1) and cooling
load (Y2).
The building parameters used are: relative compactness (X1), surface area (X2), wall
area (X3), roof area (X4), overall height (X5), orientation (X6), glazing area (X7), and
glazing area distribution (X8). We will try to predict heating load as a function of all the
building parameters using both ordinary regression and Bayesian regression, using the
glm functions of the arm package. We will show that, for the same dataset, Bayesian
regression gives significantly smaller prediction intervals.
Regression of energy efficiency with
building parameters
In this section, we will do a linear regression of the building's energy efficiency
measure, heating load (Y1) as a function of the building parameters. It would be useful
to do a preliminary descriptive analysis to find which building variables are
statistically significant. For this, we will first create bivariate plots of Y1 and all the X
variables. We will also compute the Spearman correlation between Y1 and all the X
variables. The R script for performing these tasks is as follows:
>library(ggplot2)
>library(gridExtra)
>df <- read.csv("ENB2012_data.csv",header = T)
>df <- df[,c(1:9)]
>str(df)
>df[,6] <- as.numeric(df[,6])
>df[,8] <- as.numeric(df[,8])
>attach(df)
>bp1 <- ggplot(data = df,aes(x = X1,y = Y1)) + geom_point()
>bp2 <- ggplot(data = df,aes(x = X2,y = Y1)) + geom_point()
>bp3 <- ggplot(data = df,aes(x = X3,y = Y1)) + geom_point()
>bp4 <- ggplot(data = df,aes(x = X4,y = Y1)) + geom_point()
>bp5 <- ggplot(data = df,aes(x = X5,y = Y1)) + geom_point()
>bp6 <- ggplot(data = df,aes(x = X6,y = Y1)) + geom_point()
>bp7 <- ggplot(data = df,aes(x = X7,y = Y1)) + geom_point()
>bp8 <- ggplot(data = df,aes(x = X8,y = Y1)) + geom_point()
>grid.arrange(bp1,bp2,bp3,bp4,bp5,bp6,bp7,bp8,nrow = 2,ncol = 4)
>detach(df)
>cor.val <- cor(df[,1:8],df[,9],method = "spearman")
>cor.val
[,1]
X1 0.622134697
X2 -0.622134697
X3 0.471457650
X4 -0.804027000
X5 0.861282577
X6 -0.004163071
X7 0.322860320
X8 0.068343464
From the b-plots and correlation coefficient values, we can conclude that variables X6
and X8 do not have a significant influence on Y1 and, hence, can be dropped from the
model.
Ordinary regression
Before we look at the Bayesian linear regression, let's do an ordinary linear regression.
The following R code fits the linear regression model using the lm function in the base
R on the training data and predicts the values of Y1 on the test dataset:
>#Removing X6 and X8 since they don't have significant correlation
with Y1
>df <- df[,c(1,2,3,4,5,7,9)]
>str(df)
>Splitting data set to Train and Test set in the ratio 80:20
>set.seed(123)
>samp <- sample.int(nrow(df),as.integer(nrow(df)*0.2),replace = F)
>dfTest <- df[samp,]
>dfTrain <- df[-samp,]
>xtest <- dfTest[,1:6]
>ytest <- dfTest[,7]
>library(arm)
>attach(dfTrain)
>#Ordinary Multivariate Regression
>fit.ols <- lm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfTrain)
>summary(fit.ols)
>fit.coeff <- fit.ols$coefficients
>ypred.ols <- predict.lm(fit.ols,xtest,interval = "prediction",se.fit
= T)
>ypred.ols$fit
>yout.ols <- as.data.frame(cbind(ytest,ypred.ols$fit))
>ols.upr <- yout.ols$upr
>ols.lwr <- yout.ols$lwr
Bayesian regression
To perform Bayesian linear regression, we use the bayesglm() function of the arm
package. As we described in the introduction, for the GLM, if we choose family as
gaussian (the same as normal distribution) and link function as identity, then the
GLM is equivalent to ordinary linear regression. Hence, if we use the bayesglm()
function with the gaussian family and the identity link function, then we are
performing a Bayesian linear regression.
For the Bayesian model, we need to specify a prior distribution. For the Gaussian
distribution, the default settings are prior.mean = 0, prior.scale = NULL, and
prior.df = Inf. The following R code can be used for Bayesian linear regression:
>fit.bayes <- bayesglm(Y1 ~ X1 + X2 + X3 + X4 + X5 +
X7,family=gaussian(link=identity),data=dfTrain,prior.df =
Inf,prior.mean = 0,prior.scale = NULL,maxit = 10000)
>ypred.bayes <- predict.glm(fit.bayes,newdata = xtest,se.fit = T)
>ypred.bayes$fit
To compare the results of the ordinary regression and Bayesian regression, we plot the
prediction on test data with prediction errors for both methods on a single graph. For
this purpose, we will use the ggplot2 package:
>library(ggplot2)
>library(gridExtra)
>yout.ols <- as.data.frame(cbind(ytest,ypred.ols$fit))
>ols.upr <- yout.ols$upr
>ols.lwr <- yout.ols$lwr
>p.ols <- ggplot(data = yout.ols,aes(x = yout.ols$ytest,y =
yout.ols$fit)) + geom_point() + ggtitle("Ordinary Regression
Prediction on Test Data") + labs(x = "Y-Test",y = "Y-Pred")
>p.ols + geom_errorbar(ymin = ols.lwr,ymax = ols.upr)yout.bayes <as.data.frame(cbind(ytest,ypred.bayes$fit))
>names(yout.bayes) <- c("ytest","fit")
>critval <- 1.96 #approx for 95% CI
>bayes.upr <- ypred.bayes$fit + critval * ypred.bayes$se.fit
>bayes.lwr <- ypred.bayes$fit - critval * ypred.bayes$se.fit
>p.bayes <- ggplot(data = yout.bayes,aes(x = yout.bayes$ytest,y =
yout.bayes$fit)) + geom_point() + ggtitle("Bayesian Regression
Prediction on Test Data") + labs(x = "Y-Test",y = "Y-Pred")
>p.bayes + geom_errorbar(ymin = bayes.lwr,ymax = bayes.upr)
>p1 <>p2 <-
p.ols + geom_errorbar(ymin = ols.lwr,ymax = ols.upr)
p.bayes + geom_errorbar(ymin = bayes.lwr,ymax = bayes.upr)
>grid.arrange(p1,p2,ncol = 2)
One can see that the Bayesian approach gives much more compact, 95% confident
prediction intervals compared to ordinary regression. This is happening because, in the
Bayesian approach, one computes a distribution of parameters. The prediction is made
using a set of values sampled from the posterior distribution and averaged to get the
final prediction and confidence interval.
Simulation of the posterior distribution
If one wants to find out the posterior of the model parameters, the sim( ) function of the
arm package becomes handy. The following R script will simulate the posterior
distribution of parameters and produce a set of histograms:
>posterior.bayes <- as.data.frame(coef(sim(fit.bayes)))
>attach(posterior.bayes)
>h1 <- ggplot(data = posterior.bayes,aes(x = X1)) +
+ ggtitle("Histogram X1")
>h2 <- ggplot(data = posterior.bayes,aes(x = X2)) +
+ ggtitle("Histogram X2")
>h3 <- ggplot(data = posterior.bayes,aes(x = X3)) +
+ ggtitle("Histogram X3")
>h4 <- ggplot(data = posterior.bayes,aes(x = X4)) +
+ ggtitle("Histogram X4")
>h5 <- ggplot(data = posterior.bayes,aes(x = X5)) +
+ ggtitle("Histogram X5")
>h7 <- ggplot(data = posterior.bayes,aes(x = X7)) +
+ ggtitle("Histogram X7")
>grid.arrange(h1,h2,h3,h4,h5,h7,nrow = 2,ncol = 3)
>detach(posterior.bayes)
geom_histogram()
geom_histogram()
geom_histogram()
geom_histogram()
geom_histogram()
geom_histogram()
Exercises
1. Use the multivariate dataset named Auto MPG from the UCI Machine Learning
repository (reference 3 in the References section of this chapter). The dataset can
be downloaded from the website at
https://archive.ics.uci.edu/ml/datasets/Auto+MPG. The dataset describes
automobile fuel consumption in miles per gallon (mpg) for cars running in
American cities. From the folder containing the datasets, download two files:
auto-mpg.data and auto-mpg.names. The auto-mpg.data file contains the data
and it is in space-separated format. The auto-mpg.names file has several details
about the dataset, including variable names for each column. Build a regression
model for the fuel efficiency, as a function displacement (disp), horse power (hp),
weight (wt), and acceleration (accel), using both OLS and Bayesian GLM. Predict
the values for mpg in the test dataset using both the OLS model and Bayesian GLM
model (using the bayesglm function). Find the Root Mean Square Error (RMSE)
values for OLS and Bayesian GLM and compare the accuracy and prediction
intervals for both the methods.
References
1. Friedman J., Hastie T., and Tibshirani R. The Elements of Statistical Learning –
Data Mining, Inference, and Prediction. Springer Series in Statistics. 2009
2. Tsanas A. and Xifara A. "Accurate Quantitative Estimation of Energy Performance
of Residential Buildings Using Statistical Machine Learning Tools". Energy and
Buildings. Vol. 49, pp. 560-567. 2012
3. Quinlan R. "Combining Instance-based and Model-based Learning". In: Tenth
International Conference of Machine Learning. 236-243. University of
Massachusetts, Amherst. Morgan Kaufmann. 1993. Original dataset is from StatLib
library maintained by Carnegie Mellon University.
Summary
In this chapter, we illustrated how Bayesian regression is more useful for prediction
with a tighter confidence interval using the Energy efficiency dataset and the bayesglm
function of the arm package. We also learned how to simulate the posterior distribution
using the sim function in the same R package. In the next chapter, we will learn about
Bayesian classification.
Chapter 6. Bayesian Classification
Models
We introduced the classification machine learning task in Chapter 4, Machine Learning
Using Bayesian Inference, and said that the objective of classification is to assign a
data record into one of the predetermined classes. Classification is one of the most
studied machine learning tasks and there are several well-established state of the art
methods for it. These include logistic regression models, support vector machines,
random forest models, and neural network models. With sufficient labeled training data,
these models can achieve accuracies above 95% in many practical problems.
Then, the obvious question is, why would you need to use Bayesian methods for
classification? There are two answers to this question. One is that often it is difficult to
get a large amount of labeled data for training. When there are hundreds or thousands of
features in a given problem, one often needs a large amount of training data for these
supervised methods to avoid overfitting. Bayesian methods can overcome this problem
through Bayesian averaging and hence require only a small to medium size training data.
Secondly, most of the methods, such as SVM or NN, are like black box machines. They
will give you very accurate results, but little insight as to which variables are important
for the example. Often, in many practical problems, for example, in the diagnosis of a
disease, it is important to identify leading causes. Therefore, a black box approach
would not be sufficient. Bayesian methods have an inherent feature called Automatic
Relevance Determination (ARD) by which important variables in a problem can be
identified.
In this chapter, two Bayesian classification models will be discussed. The first one is
the popular Naïve Bayes method for text classification. The second is the Bayesian
logistic regression model. Before we discuss each of these models, let's review some of
the performance metrics that are commonly used in the classification task.
Performance metrics for classification
To understand the concepts easily, let's take the case of binary classification, where the
task is to classify an input feature vector into one of the two states: -1 or 1. Assume that
1 is the positive class and -1 is the negative class. The predicted output contains only -1
or 1, but there can be two types of errors. Some of the -1 in the test set could be
predicted as 1. This is called a false positive or type I error. Similarly, some of the 1
in the test set could be predicted as -1. This is called a false negative or type II error.
These two types of errors can be represented in the case of binary classification as a
confusion matrix as shown below.
Predicted Class
Confusion Matrix
Positive Negative
Positive TP
FN
Negative FP
TN
Actual Class
From the confusion matrix, we can derive the following performance metrics:
Precision:
This gives the percentage of correct answers in the
output predicted as positive
Recall:
This gives the percentage of positives in the test data set
that have been correctly predicted
F-Score:
True positive rate:
False positive rate:
classified as positive
This is the geometric mean of precision and recall
This is the same as recall
This gives the percentage of negative classes
Also, Tpr is called sensitivity and 1 - Fpr is called specificity of the classifier. A plot
of Tpr versus Fpr (sensitivity versus 1 - specificity) is called an ROC curve (it stands
for receiver operating characteristic curve). This is used to find the best threshold
(operating point of the classifier) for deciding whether a predicted output (usually a
score or probability) belongs to class 1 or -1.
Usually, the threshold is taken as the inflation point of the ROC curve that gives the best
performance with the least false predictions. The area under the ROC curve or AUC is
another measure of classifier performance. For a purely random model, the ROC curve
will be a straight line along the diagonal. The corresponding value of AUC will be 0.5.
Classifiers with AUC above 0.8 will be considered as good, though this very much
depends on the problem to be solved.
The Naïve Bayes classifier
The name Naïve Bayes comes from the basic assumption in the model that the
probability of a particular feature
class label
is independent of any other feature
given the
. This implies the following:
Using this assumption and the Bayes rule, one can show that the probability of class
given features
,
, is given by:
Here,
is the normalization term obtained by summing the numerator
on all the values of k. It is also called Bayesian evidence or partition function Z. The
classifier selects a class label as the target class that maximizes the posterior class
probability
:
The Naïve Bayes classifier is a baseline classifier for document classification. One
reason for this is that the underlying assumption that each feature (words or m-grams) is
independent of others, given the class label typically holds good for text. Another
reason is that the Naïve Bayes classifier scales well when there is a large number of
documents.
There are two implementations of Naïve Bayes. In Bernoulli Naïve Bayes, features are
binary variables that encode whether a feature (m-gram) is present or absent in a
document. In multinomial Naïve Bayes, the features are frequencies of m-grams in a
document. To avoid issues when the frequency is zero, a Laplace smoothing is done on
the feature vectors by adding a 1 to each count. Let's look at multinomial Naïve Bayes in
some detail.
Let
be the number of times the feature
occurred in the class
in the training
data. Then, the likelihood function of observing a feature vector
given a class label
Here,
,
, is given by:
is the probability of observing the feature
in the class
Using Bayesian rule, the posterior probability of observing the class
vector X, is given by:
.
, given a feature
Taking logarithm on both the sides and ignoring the constant term Z, we get the
following:
So, by taking logarithm of posterior distribution, we have converted the problem into a
linear regression model with
as the coefficients to be determined from data.
This can be easily solved. Generally, instead of term frequencies, one uses TF-IDF
(term frequency multiplied by inverse frequency) with the document length normalized
to improve the performance of the model.
The R package e1071 (Miscellaneous Functions of the Department of Statistics) by
T.U. Wien contains an R implementation of Naïve Bayes. For this chapter, we will use
the SMS spam dataset from the UCI Machine Learning repository (reference 1 in the
References section of this chapter). The dataset consists of 425 SMS spam messages
collected from the UK forum Grumbletext, where consumers can submit spam SMS
messages. The dataset also contains 3375 normal (ham) SMS messages from the NUS
SMS corpus maintained by the National University of Singapore.
The dataset can be downloaded from the UCI Machine Learning repository
(https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). Let's say that we have
saved this as file SMSSpamCollection.txt in the working directory of R (actually, you
need to open it in Excel and save it is as tab-delimited file for it to read in R properly).
Then, the command to read the file into the tm (text mining) package would be the
following:
>spamdata <read.table("SMSSpamCollection.txt",sep="\t",stringsAsFactors =
default.stringsAsFactors())
We will first separate the dependent variable y and independent variables x and split
the dataset into training and testing sets in the ratio 80:20, using the following R
commands:
>samp<sample.int(nrow(spamdata),as.integer(nrow(spamdata)*0.2),replace=F)
>spamTest <-spamdata[samp,]
>spamTrain <-spamdata[-samp,]
>ytrain<-as.factor(spamTrain[,1])
>ytest<-as.factor(spamTest[,1])
>xtrain<-as.vector(spamTrain[,2])
>xtest<-as.vector(spamTest[,2])
Since we are dealing with text documents, we need to do some standard preprocessing
before we can use the data for any machine learning models. We can use the tm package
in R for this purpose. In the next section, we will describe this in some detail.
Text processing using the tm package
The tm package has methods for data import, corpus handling, preprocessing, metadata
management, and creation of term-document matrices. Data can be imported into the tm
package either from a directory, a vector with each component a document, or a data
frame. The fundamental data structure in tm is an abstract collection of text documents
called Corpus. It has two implementations; one is where data is stored in memory and is
called VCorpus (volatile corpus) and the second is where data is stored in the hard
disk and is called PCorpus (permanent corpus).
We can create a corpus of our SMS spam dataset by using the following R commands;
prior to this, you need to install the tm package and SnowballC package by using the
install.packages("packagename") command in R:
>library(tm)
>library(SnowballC)
>xtrain <- VCorpus(VectorSource(xtrain))
First, we need to do some basic text processing, such as removing extra white space,
changing all words to lowercase, removing stop words, and stemming the words. This
can be achieved by using the following functions in the tm package:
>#remove extra white space
>xtrain <- tm_map(xtrain,stripWhitespace)
>#remove punctuation
>xtrain <- tm_map(xtrain,removePunctuation)
>#remove numbers
>xtrain <- tm_map(xtrain,removeNumbers)
>#changing to lower case
>xtrain <- tm_map(xtrain,content_transformer(tolower))
>#removing stop words
>xtrain <- tm_map(xtrain,removeWords,stopwords("english"))
>#stemming the document
>xtrain <- tm_map(xtrain,stemDocument)
Finally, the data is transformed into a form that can be consumed by machine learning
models. This is the so called document-term matrix form where each document (SMS in
this case) is a row, the terms appearing in all documents are the columns, and the entry
in each cell denotes how many times each word occurs in one document:
>#creating Document-Term Matrix
>xtrain <- as.data.frame.matrix(DocumentTermMatrix(xtrain))
The same set of processes is done on the xtest dataset as well. The reason we
converted y to factors and xtrain to a data frame is to match the input format for the
Naïve Bayes classifier in the e1071 package.
Model training and prediction
You need to first install the e1071 package from CRAN. The naiveBayes() function
can be used to train the Naïve Bayes model. The function can be called using two
methods. The following is the first method:
>naiveBayes(formula,data,laplace=0,…,subset,na.action=na.pass)
Here formula stands for the linear combination of independent variables to predict the
following class:
>class ~ x1+x2+…
Also, data stands for either a data frame or contingency table consisting of categorical
and numerical variables.
If we have the class labels as a vector y and dependent variables as a data frame x, then
we can use the second method of calling the function, as follows:
>naiveBayes(x,y,laplace=0,…)
We will use the second method of calling in our example. Once we have a trained
model, which is an R object of class naiveBayes, we can predict the classes of new
instances as follows:
>predict(object,newdata,type=c(class,raw),threshold=0.001,eps=0,…)
So, we can train the Naïve Bayes model on our training dataset and score on the test
dataset by using the following commands:
>#Training the Naive Bayes Model
>nbmodel <- naiveBayes(xtrain,ytrain,laplace=3)
>#Prediction using trained model
>ypred.nb <- predict(nbmodel,xtest,type = "class",threshold = 0.075)
>#Converting classes to 0 and 1 for plotting ROC
>fconvert <- function(x){
if(x == "spam"){ y <- 1}
else {y <- 0}
y
}
>ytest1 <- sapply(ytest,fconvert,simplify = "array")
>ypred1 <- sapply(ypred.nb,fconvert,simplify = "array")
>roc(ytest1,ypred1,plot = T)
Here, the ROC curve for this model and dataset is shown. This is generated using the
pROC package in CRAN:
>#Confusion matrix
>confmat <- table(ytest,ypred.nb)
>confmat
pred.nb
ytest ham spam
ham 143 139
spam
9
35
From the ROC curve and confusion matrix, one can choose the best threshold for the
classifier, and the precision and recall metrics. Note that the example shown here is for
illustration purposes only. The model needs be to tuned further to improve accuracy.
We can also print some of the most frequent words (model features) occurring in the
two classes and their posterior probabilities generated by the model. This will give a
more intuitive feeling for the model exercise. The following R code does this job:
>tab <- nbmodel$tables
>fham <- function(x){
y <- x[1,1]
y
}
>hamvec <- sapply(tab,fham,simplify = "array")
>hamvec <- sort(hamvec,decreasing = T)
>fspam <- function(x){
y <- x[2,1]
y
}
>spamvec <- sapply(tab,fspam,simplify = "array")
>spamvec <- sort(spamvec,decreasing = T)
>prb <- cbind(spamvec,hamvec)
>print.table(prb)
The output table is as follows:
word Prob(word|spam) Prob(word|ham)
call
0.6994
0.4084
free 0.4294
0.3996
now 0.3865
0.3120
repli 0.2761
0.3094
text
0.2638
0.2840
spam 0.2270
0.2726
txt
0.2270
0.2594
get
0.2209
0.2182
stop 0.2086
0.2025
The table shows, for example, that given a document is spam, the probability of the
word call appearing in it is 0.6994, whereas the probability of the same word
appearing in a normal document is only 0.4084.
The Bayesian logistic regression model
The name logistic regression comes from the fact that the dependent variable of the
regression is a logistic function. It is one of the widely used models in problems where
the response is a binary variable (for example, fraud or not-fraud, click or no-click, and
so on).
A logistic function is defined by the following equation:
It has the particular feature that, as y varies from
to
, the function value varies
from 0 to 1. Hence, the logistic function is ideal for modeling any binary response as the
input signal is varied.
The inverse of the logistic function is called logit. It is defined as follows:
In logistic regression, y is treated as a linear function of explanatory variables X.
Therefore, the logistic regression model can be defined as follows:
Here,
is the set of basis functions and
are the model parameters as
explained in the case of linear regression in Chapter 4, Machine Learning Using
Bayesian Inference. From the definition of GLM in Chapter 5, Bayesian Regression
Models, one can immediately recognize that logistic regression is a special case of
GLM with the logit function as the link function.
Bayesian treatment of logistic regression is more difficult compared to the case of linear
regression. Here, the likelihood function consists of a product of logistic functions; one
for each data point. To compute the posterior, one has to normalize this function
multiplied by the prior (to get the denominator of the Bayes formula). One approach is
to use Laplace approximation as explained in Chapter 3, Introducing Bayesian
Inference. Readers might recall that in Laplace approximation, the posterior is
approximated as a Gaussian (normal) distribution about the maximum of the posterior.
This is achieved by finding the maximum a posteriori (MAP) solution first and
computing the second derivative of the negative log likelihood around the MAP
solution. Interested readers can find the details of Laplace approximation to logistic
regression in the paper by D.J.C. MacKay (reference 2 in the References section of this
chapter).
Instead of using an analytical approximation, Polson and Scott recently proposed a fully
Bayesian treatment of this problem using a data augmentation strategy (reference 3 in the
References section of this chapter). The authors have implemented their method in the R
package: BayesLogit. We will use this package to illustrate Bayesian logistic regression
in this chapter.
The BayesLogit R package
The package can be downloaded from the CRAN website at http://cran.rproject.org/web/packages/BayesLogit/index.html. The package contains the logit
function that can be used to perform a Bayesian logistic regression. The syntax for
calling this function is as follows:
>logit(Y,X,n=rep(1,length(Y) ),m0=rep(0,ncol(X)
),P0=matrix(0,nrow=ncol(X),ncol=ncol(X) ),samp=1000,burn=500)
Here, Y is an N-dimensional vector containing response values; X is an N x P
dimensional matrix containing values of independent variables, n is an N-dimensional
vector,
is a P-dimensional prior mean, and
is a P x P dimensional prior
precision. The other two arguments are related to MCMC simulation parameters. The
number of MCMC simulations saved is denoted by samp and the number of MCMC
simulations discarded at the beginning of the run before saving samples is denoted by
burn.
The dataset
To illustrate Bayesian logistic regression, we use the Parkinsons dataset from the UCI
Machine Learning repository (https://archive.ics.uci.edu/ml/datasets/Parkinsons). The
dataset was used by Little et.al. to detect Parkinson's disease by analyzing voice
disorder (reference 4 in the References section of this chapter). The dataset consists of
voice measurements from 31 people, of which 23 people have Parkinson's disease.
There are 195 rows corresponding to multiple measurements from a single individual.
The measurements can be grouped into the following sets:
The vocal fundamental frequency
Jitter
Shimmer
The ratio of noise to tonal components
The nonlinear dynamical complexity measures
The signal fractal scaling exponent
The nonlinear measures of fundamental frequency variation
In total, there are 22 numerical attributes.
Preparation of the training and testing datasets
Before we can train the Bayesian logistic model, we need to do some preprocessing of
the data. The dataset contains multiple measurements from the same individual. Here,
we take all observations; each from a sampled set of individuals in order to create the
training and test sets. Also, we need to separate the dependent variable (class label Y)
from the independent variables (X). The following R code does this job:
>#install.packages("BayesLogit") #One time installation of package
>library(BayesLogit)
>PDdata <- read.table("parkinsons.csv",sep=",",header=TRUE,row.names
= 1)
>rnames <- row.names(PDdata)
>cnames <- colnames(PDdata,do.NULL = TRUE,prefix = "col")
>colnames(PDdata)[17] <- "y"
>PDdata$y <- as.factor(PDdata$y)
>rnames.strip <- substr(rnames,10,12)
>PDdata1 <- cbind(PDdata,rnames.strip)
>rnames.unique <- unique(rnames.strip)
>set.seed(123)
>samp <sample(rnames.unique,as.integer(length(rnames.unique)*0.2),replace=F)
>PDtest <- PDdata1[PDdata1$rnames.strip %in% samp,-24] # -24 to
remove last column
>PDtrain <- PDdata1[!(PDdata1$rnames.strip %in% samp),-24] # -24 to
remove last column
>xtrain <- PDtrain[,-17]
>ytrain <- PDtrain[,17]
>xtest <- PDtest[,-17]
>ytest<- PDtest[,17]
Using the Bayesian logistic model
We can use xtrain and ytrain to train the Bayesian logistic regression model using the
logit( ) function:
>blmodel <- logit(ytrain,xtrain,n=rep(1,length(ytrain)),m0 =
rep(0,ncol(xtrain)),P0 =
matrix(0,nrow=ncol(xtrain),ncol=ncol(xtrain)),samp = 1000,burn = 500)
The summary( ) function will give a high-level summary of the fitted model:
>summary(blmodel)
To predict values of Y for a new dataset, we need to write a custom script as follows:
>psi <- blmodel$beta %*% t(xtrain)
>p
<- exp(psi) / (1 + exp(psi) )
>ypred.bayes <- colMeans(p)
# samp x n
# samp x n
The error of prediction can be computed by comparing it with the actual values of Y
present in ytest:
>table(ypred.bayes,ytest)
One can plot the ROC curve using the pROC package as follows:
>roc(ytrain,ypred.bayes,plot = T)
The ROC curve has an AUC of 0.942 suggesting a good classification accuracy. Again,
the model is presented here to illustrate the purpose and is not tuned to obtain maximum
performance.
Exercises
1. In this exercise, we will use the DBWorld e-mails dataset from the UCI Machine
Learning repository to compare the relative performance of Naïve Bayes and
BayesLogit methods. The dataset contains 64 e-mails from the DBWorld
newsletter and the task is to classify the e-mails into either announcements of
conferences or everything else. The reference for this dataset is a course by Prof.
Michele Filannino (reference 5 in the References section of this chapter). The
dataset can be downloaded from the UCI website at
https://archive.ics.uci.edu/ml/datasets/DBWorld+e-mails#.
Some preprocessing of the dataset would be required to use it for both the
methods. The dataset is in the ARFF format. You need to download the foreign R
package (http://cran.r-project.org/web/packages/foreign/index.html) and use the
read.arff( ) method in it to read the file into an R data frame.
References
1. Almeida T.A., Gómez Hidalgo J.M., and Yamakami A. "Contributions to the Study
of SMS Spam Filtering: New Collection and Results". In: 2011 ACM Symposium
on Document Engineering (DOCENG'11). Mountain View, CA, USA. 2011
2. MacKay D.J.C. "The Evidence Framework Applied to Classification Networks".
Neural Computation 4(5)
3. "Bayesian Inference for Logistic Models Using Pólya-Gamma Latent Variables".
Journal of the American Statistical Association. Volume 108, Issue 504, Page
1339. 2013
4. Costello D.A.E., Little M.A., McSharry P.E., Moroz I.M., and Roberts S.J.
"Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice
Disorder Detection". BioMedical Engineering OnLine. 2007
5. Filannino M. "DBWorld e-mail Classification Using a Very Small Corpus".
Project of Machine Learning Course. University of Manchester. 2011
Summary
In this chapter, we discussed the various merits of using Bayesian inference for the
classification task. We reviewed some of the common performance metrics used for the
classification task. We also learned two basic and popular methods for classification,
Naïve Bayes and logistic regression, both implemented using the Bayesian approach.
Having learned some important Bayesian-supervised machine learning techniques, in
the next chapter, we will discuss some unsupervised Bayesian models.
Chapter 7. Bayesian Models for
Unsupervised Learning
The machine learning models that we have discussed so far in the previous two chapters
share one common characteristic: they require training data containing ground truth. This
implies a dataset containing true values of the predicate or dependent variable that is
often manually labeled. Such machine learning where the algorithm is trained using
labeled data is called supervised learning. This type of machine learning gives a very
good performance in terms of accuracy of prediction. It is, in fact, the de facto method
used in most industrial systems using machine learning. However, the drawback of this
method is that, when one wants to train a model with large datasets, it would be difficult
to get the labeled data. This is particularly relevant in the era of Big Data as a lot of
data is available for organizations from various logs, transactions, and interactions with
consumers; organizations want to gain insight from this data and make predictions about
their consumers' interests.
In unsupervised methods, no labeled data is required for learning. The process of
learning happens through identifying dominant patterns and correlations present in the
dataset. Some common examples of unsupervised learning are clustering, association
rule mining, density estimation, and dimensional reduction. In clustering, naturally
occurring groups in data are identified using a suitable algorithm that makes use of some
distance measure between data points. In association rule mining, items that frequently
occur together in a transaction are identified from a transaction dataset. In dimensional
reduction techniques such as principal component analysis, the original dataset
containing a large number of variables (dimensions) is projected down to a lower
dimensional space where the maximum information in the data is present. Though
unsupervised learning doesn't require labeled training data, one would need a large
amount of data to learn all the patterns of interest and often the learning is more
computationally intensive.
In many practical cases, it would be feasible to create a small amount of labeled data.
The third type of learning, semi-supervised learning, is a method that makes use of this
small labeled dataset and propagates labels to the rest of the unlabeled training data
using suitable algorithms. In this chapter, we will cover Bayesian approaches for
unsupervised learnings. We will discuss in detail two important models: Gaussian
mixture models for clustering and Latent Dirichlet allocation for topic modeling.
Bayesian mixture models
In general, a mixture model corresponds to representing data using a mixture of
probability distributions. The most common mixture model is of the following type:
Here,
is a probability distribution of X with parameters
, and
represents the weight for the kth component in the mixture, such that
. If the
underlying probability distribution is a normal (Gaussian) distribution, then the mixture
model is called a Gaussian mixture model (GMM). The mathematical representation
of GMM, therefore, is given by:
Here, we have used the same notation, as in previous chapters, where X stands for an Ndimensional data vector
M such observations in the dataset.
representing each observation and there are
A mixture model such as this is suitable for clustering when the clusters have overlaps.
One of the applications of GMM is in computer vision. If one wants to track moving
objects in a video, it is useful to subtract the background image. This is called
background subtraction or foreground detection. GMMs are used for this purpose where
the intensity of each pixel is modeled using a mixture of Gaussian distributions
(reference 1 in the References section of this chapter).
The task of learning GMMs corresponds to learning the model parameters
and mixture weights
for all the components
. The
standard approach for learning GMMs is by using the maximum likelihood method. For
a dataset consisting of M observations, the logarithm of the likelihood function is given
by:
Unlike a single Gaussian model, maximizing the log-likelihood with respect to
parameters
cannot be done in a straightforward manner in GMM. This is
because there is no closed-form expression for the derivative in this case, since it is
difficult to compute the logarithm of a sum. Therefore, one uses what is called an
expectation-maximization (EM) algorithm to maximize the log-likelihood function.
The EM algorithm is an iterative algorithm, where each iteration consists of two
computations: expectation and maximization. The EM algorithm proceeds as follows:
1. Initialize parameters
likelihood.
,
, and
and evaluate the initial value of log-
2. In the expectation step, evaluate mixture components
the current parameter values
and
from log-likelihood using
.
3. In the maximization step, using the values of
computed in step 2, estimate new
parameter values
and
by the maximization of log-likelihood.
4. Compute a new value of the log-likelihood function using the estimated values of
, , and
from steps 2 and 3.
5. Repeat steps 2-4 until the log-likelihood function is converged.
In the Bayesian treatment of GMM, the maximization of log-likelihood is simplified by
introducing a latent variable Z. Let Z be a K-dimensional binary random variable having
only one element 1, and the rest of the K – 1 elements are 0. Using Z, one can write the
joint distribution of X and Z as follows:
Here:
And:
Therefore:
And:
The advantage of introducing a latent variable Z in the problem is that the expression for
log-likelihood is simplified, where the logarithm directly acts on the normal distribution
as in the case of a single Gaussian model. Therefore, it is straightforward to maximize
P(X, Z). However, the problem that still remains is that we don't know the value of Z!
So, the trick is to use an EM-like iterative algorithm where, in the E-step, the
expectation value of Z is estimated and in the M-step, using the last estimated value of
Z, we find the parameter values of the Gaussian distribution. The Bayesian version of
the EM algorithm for GMM proceeds as follows:
1. Initialize parameters
likelihood.
,
, and
and evaluate the initial value of log-
2. In the expectation step, use these values to compute the expectation value
3. In the maximization step, using
fixed, estimate
and
.
by maximizing
.
4. Compute the new likelihood function.
5. Repeat steps 2-4 until convergence.
A more detailed treatment of the Bayesian version of the EM algorithm and GMM can
be found in the book by Christopher M. Bishop (reference 2 in the References section of
this chapter). Here, we leave the theoretical treatment of the Bayesian GMM and
proceed to look at its R implementation in the bgmm package.
The bgmm package for Bayesian mixture models
The bgmm package was developed by Przemyslaw Biecek and Ewa Szczurek for
modeling gene expressions data (reference 3 in the References section of this chapter).
It can be downloaded from the CRAN website at http://cran.rproject.org/web/packages/bgmm/index.html. The package contains not only an
unsupervised version of GMM but fully supervised and semi-supervised
implementations as well. The following are the different models available in the bgmm
package:
Fully supervised GMM: This is the labeled data available for all records in a
training set. This includes the following:
The supervised( ) function
Semi-supervised GMM: This is the labeled data available for a small subset of
all records in a training set. This includes the following
The semisupervised( ) function
Partially supervised GMM: This is the labeled data available for a small subset
of all records, but these labels are uncertain. The values of labels are given with
some probability. There are two functions in the package for partially supervised
GMM. This includes the following::
The belief( ) function: The uncertainty of labels is expressed as a
probability distribution over its components. For the first m observations, a
belief matrix B of dimensions m x k is given as input where the matrix entry
denotes the probability that the ith record has the jth label.
The soft( ) function: In this approach, a plausibility matrix of dimension M
x k is defined across all records in the training set of size M. The matrix
element
is interpreted as the weight of the prior probability that the ith
record has the jth label. If there is no particular information about labels of
any records, they can be given equal weights. For the purpose of
implementation, a constraint is imposed on the matrix elements:
Unsupervised GMM: This labeled data is not available for any records. This
includes the following:
The unsupervised( ) function
The typical parameters that are passed to these functions are as follows:
X: This is a data.frame with the unlabelled X data.
.
knowns: This is a data.frame with the labeled X data.
B: This is a belief matrix that specifies the distribution of beliefs for the labeled
records. The number of rows of B should be the same as that of knowns.
P: This is a matrix of weights of prior probabilities (plausibilities).
class: This is a vector of classes or labels for the labeled records.
k: This is the number of components or columns of the B matrix.
init.params: These are the initial values for the estimates of model parameters.
The difference between the belief( ) and soft( ) functions is that, in the first case,
the input is a matrix containing prior probability values for each possible label,
whereas in the second case, the input is a matrix containing weights for each of the
priors and not the prior probability itself. For more details, readers are requested to
read the paper by Przemyslaw Biecek et.al (reference 3 in the References section of this
chapter).
Now, let's do a small illustrative example of using bgmm. We will use the ADL dataset
from the UCI Machine Learning repository. This dataset contains acceleration data from
wrist-worn accelerometers from 16 volunteers. The dataset and metadata details can be
found at
https://archive.ics.uci.edu/ml/datasets/Dataset+for+ADL+Recognition+with+Wristworn+Accelerometer. The research work on ADL monitoring systems, where this
dataset was generated, is published in the two papers by Bruno B. et.al. (reference 4
and reference 5 in the References section of this chapter).
For the example of bgmm, we will only use one folder in the dataset directory, namely
Brush_teeth. Firstly, we will do a small amount of preprocessing to combine data
from the different volunteers into a single file. The following R script does this job:
>#Set working directory to folder containing files (provide the
correct path)
>setwd("C:/…/ADL_Dataset/HMP_Dataset/Brush_teeth")
>flist <- list.files(path =
"C:/../ADL_Dataset/HMP_Dataset/Brush_teeth",pattern = "*.txt")
>all.data <- lapply(flist,read.table,sep = " ",header = FALSE)
>combined.data <- as.data.frame(do.call(rbind,all.data))
>combined.data.XZ <- combined.data[,c(1,3)]
The last step is to select the X and Z components of acceleration to create a twodimensional dataset.
The following R script calls the bgmm function and performs clustering. A simple scatter
plot of the data suggests that there could be four clusters in the dataset and choosing k =
4 would be sufficient:
>modelbgmm <- unsupervised(combined.data.XZ,k=4)
>summary(modelbgmm)
>plot.mModel(modelbgmm)
The clusters generated by bgmm can be seen in the following figure; there are four
clusters whose centers are represented by the four color dots and their respective
Gaussian densities are represented by the ellipses:
Topic modeling using Bayesian inference
We have seen the supervised learning (classification) of text documents in Chapter 6,
Bayesian Classification Models, using the Naïve Bayes model. Often, a large text
document, such as a news article or a short story, can contain different topics as
subsections. It is useful to model such intra-document statistical correlations for the
purpose of classification, summarization, compression, and so on. The Gaussian mixture
model learned in the previous section is more applicable for numerical data, such as
images, and not for documents. This is because words in documents seldom follow
normal distribution. A more appropriate choice would be multinomial distribution.
A powerful extension of mixture models to documents is the work of T. Hofmann on
Probabilistic Semantic Indexing (reference 6 in the References section of this chapter)
and that of David Blei, et. al. on Latent Dirichlet allocation (reference 7 in the
References section of this chapter). In these works, a document is described as a
mixture of topics and each topic is described by a distribution of words. LDA is a
generative unsupervised model for text documents. The task of LDA is to learn the
parameters of the topic distribution, word distributions, and mixture coefficients from
data. A brief overview of LDA is presented in the next section. Readers are strongly
advised to read the paper by David Blei, et al. to comprehend their approach.
Latent Dirichlet allocation
In LDA, it is assumed that words are the basic units of documents. A word is one
element of a set known as vocabulary, indexed by
. Here, V denotes the size of
the vocabulary. A word can be represented by a unit-basis vector, whose all
components are zero except the one corresponding to the word that has a value 1. For
example, the nth word in a vocabulary is described by a vector of size V, whose nth
component
and all other components
for
a collection of N words denoted by
. Similarly, a document is
and a corpus is a collection of
M documents denoted by
(note that documents are represented here
by a bold face w, whereas words are without bold face w).
As mentioned earlier, LDA is a generative probabilistic model of a corpus where
documents are represented as random mixtures over latent topics and each topic is
characterized by a distribution over words. To generate each document w in a corpus in
an LDA model, the following steps are performed:
1. Choose the value of N corresponding to the size of the document, according to a
Poisson distribution characterized by parameter
:
2. Choose the value of parameter
that characterizes the topic distribution from a
Dirichlet distribution characterized by parameter :
3. For each of the N words
1. Choose a topic
the parameter
according to the multinomial distribution characterized by
drawn in step 2:
2. Choose a word
characterized by
from the multinomial probability distribution
and conditioned on
:
Given values of N, , and , the joint distribution of a topic mixture
z, and set of words w, is given by:
Note that, in this case, only w is observed (the documents) and both
as latent (hidden) variables.
, set of topics
and z are treated
The Bayesian inference problem in LDA is the estimation of the posterior density of
latent variables
and z, given a document given by:
As usual, with many Bayesian models, this is intractable analytically and one has to use
approximate techniques, such as MCMC or variational Bayes, to estimate the posterior.
R packages for LDA
There are mainly two packages in R that can be used for performing LDA on documents.
One is the topicmodels package developed by Bettina Grün and Kurt Hornik and the
second one is lda developed by Jonathan Chang. Here, we describe both these
packages.
The topicmodels package
The topicmodels package is an interface to the C and C++ codes developed by the
authors of the papers on LDA and Correlated Topic Models (CTM) (references 7, 8,
and 9 in the References section of this chapter). The main function LDA in this package
is used to fit LDA models. It can be called by:
>LDA(X,K,method = "Gibbs",control = NULL,model = NULL,...)
Here, X is a document-term matrix that can be generated using the tm package and K is
the number of topics. The method is the method to be used for fitting. There are two
methods that are supported: Gibbs and VEM.
Let's do a small example of building LDA models using this package. The dataset used
is the Reuter_50_50 dataset from the UCI Machine Learning repository (references 10
and 11 in the References section of this chapter). The dataset can be downloaded from
https://archive.ics.uci.edu/ml/datasets/Reuter_50_50. For this exercise, we will only
use documents from one directory, namely AlanCrosby in the C50train directory. The
required preprocessing can be done using the following R script; readers should have
installed the tm and topicmodels packages before trying this exercise:
>library(topicmodels)
>library(tm)
>#creation of training corpus from reuters dataset
>dirsourcetrain <- DirSource(directory = "C:/
…/C50/C50train/AaronPressman")
>xtrain <- VCorpus(dirsourcetrain)
>#remove extra white space
>xtrain <- tm_map(xtrain,stripWhitespace)
>#changing to lower case
>xtrain <- tm_map(xtrain,content_transformer(tolower))
>#removing stop words
>xtrain <- tm_map(xtrain,removeWords,stopwords("english"))
>#stemming the document
>xtrain <- tm_map(xtrain,stemDocument)
>#creating Document-Term Matrix
>xtrain <- as.data.frame.matrix(DocumentTermMatrix(xtrain))
The same set of steps can be used to create the test dataset from the /…/C50/C50test/
directory.
Once we have the document-term matrices xtrain and xtest, the LDA model can be
built and tested using the following R script:
>#training lda model
>ldamodel <- LDA(xtrain,10,method = "VEM")
>#computation of perplexity, on training data (only with VEM method)
>perp <- perplexity(ldamodel)
>perp
[1] 407.3006
A value of perplexity around 100 indicates a good fit. In this case, we need to add more
training data or change the value of K to improve perplexity.
Now let's use the trained LDA model to predict the topics on the test dataset:
>#extracting topics from test data)
>postprob <- posterior(ldamodel,xtest)
>postprob$topics
Here, the test set contains only one file, namely 42764newsML.txt. The distribution of
its topic among the 10 topics produced by the LDA model is shown.
The lda package
The lda package was developed by Jonathan Chang and he implemented a collapsed
Gibbs sampling method for the estimation of posterior. The package can be downloaded
from the CRAN website at http://cran.r-project.org/web/packages/lda/index.html.
The main function in the package, lda.collapsed.gibbs.sampler, uses a collapsed
Gibbs sampler to fit three different models. These are Latent Dirichlet allocation
(LDA), supervised LDA (sLDA), and the mixed membership stochastic blockmodel
(MMSB). These functions take input documents and return point estimates of latent
parameters. These functions can be used in R as follows:
>lda.collapsed.gibbs.sampler(documents,K,vocab,num.iterations,alpha,e
ta,initial = NULL,burnin = NULL,compute.log.likelihood = FALSE,trace
= 0L,freeze.topics = FALSE)
Here, documents represents a list containing documents, the length of the list is equal to
D, and K is the number of topics; vocab is a character vector specifying the vocabulary
of words; alpha and eta are the values of hyperparameters.
Exercises
1. For the Reuter_50_50 dataset, fit the LDA model using the
lda.collapsed.gibbs.sampler function in the lda package and compare
performance with that of the topicmodels package. Note that you need to convert
the document-term matrix to lda format using the dtm2ldaformat( ) function in
the topicmodels package in order to use the lda package.
References
1. Bouwmans, T., El Baf F., and "Vachon B. Background Modeling Using Mixture of
Gaussians for Foreground Detection – A Survey" (PDF). Recent Patents on
Computer Science 1: 219-237. 2008
2. Bishop C.M. Pattern Recognition and Machine Learning. Springer. 2006
3. Biecek P., Szczurek E., Tiuryn J., and Vingron M. "The R Package bgmm: Mixture
Modeling with Uncertain Knowledge". Journal of Statistical Software. Volume 47,
Issue 3. 2012
4. Bruno B., Mastrogiovanni F., Sgorbissa A., Vernazza T., and Zaccaria R.
"Analysis of human behavior recognition algorithms based on acceleration data".
In: IEEE Int Conf on Robotics and Automation (ICRA), pp. 1602-1607. 2013
5. Bruno B., Mastrogiovanni F., Sgorbissa A., Vernazza T., and Zaccaria R. "Human
Motion Modeling and Recognition: A computational approach". In: IEEE
International Conference on Automation Science and Engineering (CASE). pp 156161. 2012
6. Hofmann T. "Probabilistic Latent Semantic Indexing". In: Twenty-Second Annual
International SIGIR Conference. 1999
7. Blei D.M., Jordan M.I., and Ng A.Y. "Latent Dirichlet Allocation". Journal of
Machine Learning Research 3. 993-1022. 2003
8. Blei D.M., and Lafferty J.D. "A Correlated Topic Model of Science". The Annals
of Applied Statistics. 1(1), 17-35. 2007
9. Phan X.H., Nguyen L.M., and Horguchi S. "Learning to Classify Short and Sparse
Text & Web with Hidden Topics from Large-scale Data Collections". In: 17th
International World Wide Web Conference (WWW 2008). pages 91-100. Beijing,
China. 2008
Summary
In this chapter, we discussed the concepts behind unsupervised and semi-supervised
machine learning, and their Bayesian treatment. We learned two important Bayesian
unsupervised models: the Bayesian mixture model and LDA. We discussed in detail the
bgmm package for the Bayesian mixture model, and the topicmodels and lda packages
for topic modeling. Since the subject of unsupervised learning is vast, we could only
cover a few Bayesian methods in this chapter, just to give a flavor of the subject. We
have not covered semi-supervised methods using both item labeling and feature
labeling. Interested readers should refer to more specialized books in this subject. In the
next chapter, we will learn another important class of models, namely neural networks.
Chapter 8. Bayesian Neural Networks
As the name suggests, artificial neural networks are statistical models built taking
inspirations from the architecture and cognitive capabilities of biological brains. Neural
network models typically have a layered architecture consisting of a large number of
neurons in each layer, and neurons between different layers are connected. The first
layer is called input layer, the last layer is called output layer, and the rest of the layers
in the middle are called hidden layers. Each neuron has a state that is determined by a
nonlinear function of the state of all neurons connected to it. Each connection has a
weight that is determined from the training data containing a set of input and output
pairs. This kind of layered architecture of neurons and their connections is present in the
neocortex region of human brain and is considered to be responsible for higher
functions such as sensory perception and language understanding.
The first computational model for neural network was proposed by Warren McCulloch
and Walter Pitts in 1943. Around the same time, psychologist Donald Hebb created a
hypothesis of learning based on the mechanism of excitation and adaptation of neurons
that is known as Hebb's rule. The hypothesis can be summarized by saying Neurons
that fire together, wire together. Although there were several researchers who tried to
implement computational models of neural networks, it was Frank Rosenblatt in 1958
who first created an algorithm for pattern recognition using a two-layer neural network
called Perceptron.
The research and applications of neural networks had both stagnant and great periods of
progress during 1970-2010. Some of the landmarks in the history of neural networks are
the invention of the backpropagation algorithm by Paul Werbos in 1975, a fast learning
algorithm for learning multilayer neural networks (also called deep learning networks)
by Geoffrey Hinton in 2006, and the use of GPGPUs to achieve greater computational
power required for processing neural networks in the latter half of the last decade.
Today, neural network models and their applications have again taken a central stage in
artificial intelligence with applications in computer vision, speech recognition, and
natural language understanding. This is the reason this book has devoted one chapter
specifically to this subject. The importance of Bayesian inference in neural network
models will become clear when we go into detail in later sections.
Two-layer neural networks
Let us look at the formal definition of a two-layer neural network. We follow the
notations and description used by David MacKay (reference 1, 2, and 3 in the
References section of this chapter). The input to the NN is given by
. The input values are first multiplied by a set of weights to
produce a weighted linear combination and then transformed using a nonlinear function
to produce values of the state of neurons in the hidden layer:
A similar operation is done at the second layer to produce final output values
The function
:
is usually taken as either a sigmoid function
or
. Another common function used for multiclass classification is
softmax defined as follows:
This is a normalized exponential function.
All these are highly nonlinear functions exhibiting the property that the output value has
a sharp increase as a function of the input. This nonlinear property gives neural
networks more computational flexibility than standard linear or generalized linear
models. Here, is called a bias parameter. The weights
form the weight vector w .
together with biases
The schematic structure of the two-layer neural network is shown here:
The learning in neural networks corresponds to finding the value of weight vector such
as w, such that for a given dataset consisting of ground truth values input and target
(output),
, the error of prediction of target values by the network is
minimum. For regression problems, this is achieved by minimizing the error function:
For the classification task, in neural network training, instead of squared error one uses
a cross entropy defined as follows:
To avoid overfitting, a regularization term is usually also included in the objective
function. The form of the regularization function is usually
, which gives
penalty to large values of w, reducing the chances of overfitting. The resulting objective
function is as follows:
Here, and are free parameters for which the optimum values can be found from
cross-validation experiments.
To minimize M(w) with respect to w, one uses the backpropagation algorithm as
described in the classic paper by Rumelhart, Hinton, and Williams (reference 3 in the
References section of this chapter). In the backpropagation for each input/output pair,
the value of the predicted output is computed using a forward pass from the input layer.
The error, or the difference between the predicted output and actual output, is
propagated back and at each node, the weights are readjusted so that the error is a
minimum.
Bayesian treatment of neural networks
To set the neural network learning in a Bayesian context, consider the error function
for the regression case. It can be treated as a Gaussian noise term for observing the
given dataset conditioned on the weights w. This is precisely the likelihood function that
can be written as follows:
Here, is the variance of the noise term given by
and represents a
probabilistic model. The regularization term can be considered as the log of the prior
probability distribution over the parameters:
Here,
is the variance of the prior distribution of weights. It can be easily
shown using Bayes' theorem that the objective function M(w) then corresponds to the
posterior distribution of parameters w:
In the neural network case, we are interested in the local maxima of
The posterior is then approximated as a Gaussian around each maxima
.
, as follows:
Here, A is a matrix of the second derivative of M(w) with respect to w and represents an
inverse of the covariance matrix. It is also known by the name Hessian matrix.
The value of hyper parameters
this, the probability
and
and
is found using the evidence framework. In
is used as a evidence to find the best values of
from data D. This is done through the following Bayesian rule:
By using the evidence framework and Gaussian approximation of posterior (references
2 and 5 in the References section of this chapter), one can show that the best value of
satisfies the following:
Also, the best value of
satisfies the following:
In these equations,
is the number of well-determined parameters given by
where k is the length of w.
The brnn R package
The brnn package was developed by Paulino Perez Rodriguez and Daniel Gianola, and
it implements the two-layer Bayesian regularized neural network described in the
previous section. The main function in the package is brnn( ) that can be called using
the following command:
>brnn(x,y,neurons,normalize,epochs,…,Monte_Carlo,…)
Here, x is an n x p matrix where n is the number of data points and p is the number of
variables; y is an n dimensional vector containing target values. The number of neurons
in the hidden layer of the network can be specified by the variable neurons. If the
indicator function normalize is TRUE, it will normalize the input and output, which is
the default option. The maximum number of iterations during model training is specified
using epochs. If the indicator binary variable Monte_Carlo is true, then an MCMC
method is used to estimate the trace of the inverse of the Hessian matrix A.
Let us try an example with the Auto MPG dataset that we used in Chapter 5, Bayesian
Regression Models. The following R code will import data, create training and test
sets, train a neural network model using training data, and make predictions for the test
set:
>install.packages("brnn") #one time installation
>library(brnn)
>mpgdataall <- read.csv("C:/…/auto-mpg.csv")#give the correct full
path
>mpgdata <- mpgdataall[,c(1,3,5,6)]
>#Fitting Bayesian NN Model
>ytrain <- mpgdata[1:100,1]
>xtrain <- as.matrix(mpgdata[1:100,2:4])
>mpg_brnn <- brnn(xtrain,ytrain,neurons=2,normalize = TRUE,epochs =
1000,Monte_Carlo = TRUE)
>summary(mpg_brnn)
A Bayesian regularized neural network
3 - 2 - 1 with 10 weights,biases and connection strengths
Inputs and output were normalized
Training finished because Changes in F= beta*SCE + alpha*Ew in last 3
iterations less than 0.001
>#Prediction using trained model
>ytest <- mpgdata[101:150,1]
>xtest <- as.matrix(mpgdata[101:150,2:4])
>ypred_brnn <- predict.brnn(mpg_brnn,xtest)
>plot(ytest,ypred_brnn)
>err <-ytest-ypred
>summary(err)
Deep belief networks and deep learning
Some of the pioneering advancements in neural networks research in the last decade
have opened up a new frontier in machine learning that is generally called by the name
deep learning (references 5 and 7 in the References section of this chapter). The
general definition of deep learning is, a class of machine learning techniques, where
many layers of information processing stages in hierarchical supervised
architectures are exploited for unsupervised feature learning and for pattern
analysis/classification. The essence of deep learning is to compute hierarchical
features or representations of the observational data, where the higher-level features
or factors are defined from lower-level ones (reference 8 in the References section of
this chapter). Although there are many similar definitions and architectures for deep
learning, two common elements in all of them are: multiple layers of nonlinear
information processing and supervised or unsupervised learning of feature
representations at each layer from the features learned at the previous layer. The
initial works on deep learning were based on multilayer neural network models.
Recently, many other forms of models have also been used, such as deep kernel
machines and deep Q-networks.
Even in previous decades, researchers have experimented with multilayer neural
networks. However, two reasons limited any progress with learning using such
architectures. The first reason is that the learning of the network parameters is a nonconvex optimization problem. Starting from random initial conditions, one gets stuck at
local minima during minimization of error. The second reason is that the associated
computational requirements were huge. A breakthrough for the first problem came when
Geoffrey Hinton developed a fast algorithm for learning a special class of neural
networks called deep belief nets (DBN). We will describe DBNs in more detail in
later sections. The high computational power requirements were met with the
advancement in computing using general purpose graphical processing units
(GPGPUs). What made deep learning so popular for practical applications is the
significant improvement in accuracy achieved in automatic speech recognition and
computer vision. For example, the word error rate in automatic speech recognition of a
switchboard conversational speech had reached a saturation of around 40% after years
of research.
However, using deep learning, the word error rate reduced dramatically to close to
10% in a matter of a few years. Another well-known example is how deep convolution
neural network achieved the least error rate of 15.3% in the 2012 ImageNet Large
Scale Visual Recognition Challenge compared to state-of-the-art methods that gave
26.2% as the least error rate (reference 7 in the References section of this chapter).
In this chapter, we will describe one class of deep learning models called deep belief
networks. Interested readers may wish to read the book by Li Deng and Dong Yu
(reference 9 in the References section of this chapter) for a detailed understanding of
various methods and applications of deep learning. We will follow their notations in the
rest of the chapter. We will also illustrate the use of DBN with the R package darch.
Restricted Boltzmann machines
A restricted Boltzmann machine (RBM) is a two-layer network (bi-partite graph), in
which one layer is a visible layer (v) and the second layer is a hidden layer (h). All
nodes in the visible layer and all nodes in the hidden layer are connected by undirected
edges, and there no connections between nodes in the same layer:
An RBM is characterized by the joint distribution of states of all visible units
and states of all hidden units
given by:
Here,
is called the energy function and
is the
normalization constant known by the name partition function from Statistical Physics
nomenclature.
There are mainly two types of RBMs. In the first one, both v and h are Bernoulli random
variables. In the second type, h is a Bernoulli random variable whereas v is a Gaussian
random variable. For Bernoulli RBM, the energy function is given by:
Here,
represents the weight of the edge between nodes and ; and
are bias
parameters for the visible and hidden layers respectively. For this energy function, the
exact expressions for the conditional probability can be derived as follows:
Here,
is the logistic function
.
If the input variables are continuous, one can use the Gaussian RBM; the energy function
of it is given by:
Also, in this case, the conditional probabilities of
This is a normal distribution with mean
and
will become as follows:
and variance 1.
Now that we have described the basic architecture of an RBM, how is it that it is
trained? If we try to use the standard approach of taking the gradient of log-likelihood,
we get the following update rule:
Here,
is the expectation of
computed using the dataset and
is the same expectation computed using the model. However, one cannot use this exact
expression for updating weights because
is difficult to compute.
The first breakthrough came to solve this problem and, hence, to train deep neural
networks, when Hinton and team proposed an algorithm called Contrastive
Divergence (CD) (reference 7 in the References section of this chapter). The essence
of the algorithm is described in the next paragraph.
The idea is to approximate
by using values of and
generated using
Gibbs sampling from the conditional distributions mentioned previously. One scheme of
doing this is as follows:
1. Initialize
from the dataset.
2. Find
by sampling from the conditional distribution
3. Find
by sampling from the conditional distribution
.
4. Find
by sampling from the conditional distribution
.
Once we find the values of
and
, use
.
, which is the product of ith
component of
and jth component of
, as an approximation for
. This
is called CD-1 algorithm. One can generalize this to use the values from the kth step of
Gibbs sampling and it is known as CD-k algorithm. One can easily see the connection
between RBMs and Bayesian inference. Since the CD algorithm is like a posterior
density estimate, one could say that RBMs are trained using a Bayesian inference
approach.
Although the Contrastive Divergence algorithm looks simple, one needs to be very
careful in training RBMs, otherwise the model can result in overfitting. Readers who
are interested in using RBMs in practical applications should refer to the technical
report (reference 10 in the References section of this chapter), where this is discussed
in detail.
Deep belief networks
One can stack several RBMs, one on top of each other, such that the values of hidden
units in the layer
would become values of visible units in the nth layer
, and so on. The resulting network is called a deep belief network. It was one of the
main architectures used in early deep learning networks for pretraining. The idea of
pretraining a NN is the following: in the standard three-layer (input-hidden-output) NN,
one can start with random initial values for the weights and using the backpropagation
algorithm, can find a good minimum of the log-likelihood function. However, when the
number of layers increases, the straightforward application of backpropagation does not
work because starting from output layer, as we compute the gradient values for the
layers deep inside, their magnitude becomes very small. This is called the gradient
vanishing problem. As a result, the network will get trapped in some poor local
minima. Backpropagation still works if we are starting from the neighborhood of a good
minimum. To achieve this, a DNN is often pretrained in an unsupervised way, using a
DBN. Instead of starting from random values of weights, train a DBN in an
unsupervised way and use weights from the DBN as initial weights for a corresponding
supervised DNN. It was seen that such DNNs pretrained using DBNs perform much
better (reference 8 in the References section of this chapter).
The layer-wise pretraining of a DBN proceeds as follows. Start with the first RBM and
train it using input data in the visible layer and the CD algorithm (or its latest better
variants). Then, stack a second RBM on top of this. For this RBM, use values sample
from
as the values for the visible layer. Continue this process for the desired
number of layers. The outputs of hidden units from the top layer can also be used as
inputs for training a supervised model. For this, add a conventional NN layer at the top
of DBN with the desired number of classes as the number of output nodes. Input for this
NN would be the output from the top layer of DBN. This is called DBN-DNN
architecture. Here, a DBN's role is generating highly efficient features (the output of
the top layer of DBN) automatically from the input data for the supervised NN in the top
layer.
The architecture of a five-layer DBN-DNN for a binary classification task is shown in
the following figure:
The last layer is trained using the backpropagation algorithm in a supervised manner for
the two classes
and . We will illustrate the training and classification with such a
DBN-DNN using the darch R package.
The darch R package
The darch package, written by Martin Drees, is one of the R packages using which one
can begin doing deep learning in R. It implements the DBN described in the previous
section (references 5 and 7 in the References section of this chapter). The package can
be downloaded from https://cran.r-project.org/web/packages/darch/index.html.
The main class in the darch package implements deep architectures and provides the
ability to train them with Contrastive Divergence and fine-tune with backpropagation,
resilient backpropagation, and conjugate gradients. The new instances of the class are
created with the newDArch constructor. It is called with the following arguments: a
vector containing the number of nodes in each layers, the batch size, a Boolean variable
to indicate whether to use the ff package for computing weights and outputs, and the
name of the function for generating the weight matrices. Let us create a network having
two input units, four hidden units, and one output unit:
install.packages("darch") #one time
>library(darch)
>darch <- newDArch(c(2,4,1),batchSize = 2,genWeightFunc =
generateWeights)
INFO [2015-07-19 18:50:29] Constructing a darch with 3 layers.
INFO [2015-07-19 18:50:29] Generating RBMs.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 2 visible
and 4 hidden units.
INFO [2015-07-19 18:50:29] Construct new RBM instance with 4 visible
and 1 hidden units.
Let us train the DBN with a toy dataset. We are using this because for training any
realistic examples, it would take a long time: hours, if not days. Let us create an input
data set containing two columns and four rows:
>inputs <- matrix(c(0,0,0,1,1,0,1,1),ncol=2,byrow=TRUE)
>outputs <- matrix(c(0,1,1,0),nrow=4)
Now, let us pretrain the DBN, using the input data:
>darch <- preTrainDArch(darch,inputs,maxEpoch=1000)
We can have a look at the weights learned at any layer using the getLayerWeights( )
function. Let us see how the hidden layer looks:
>getLayerWeights(darch,index=1)
[[1]]
[1,]
[2,]
[3,]
[,1]
8.167022
2.024671
-5.391781
[,2]
0.4874743
-10.7012389
5.5878931
[,3]
-7.563470
1.313231
3.254914
[,4]
-6.951426
1.070006
3.000914
Now, let's do a backpropagation for supervised learning. For this, we need to first set
the layer functions to sigmoidUnitDerivatives:
>layers <- getLayers(darch)
>for(i in length(layers):1){
layers[[i]][[2]] <- sigmoidUnitDerivative
}
>setLayers(darch) <- layers
>rm(layers)
Finally, the following two lines perform the backpropagation:
>setFineTuneFunction(darch) <- backpropagation
>darch <- fineTuneDArch(darch,inputs,outputs,maxEpoch=1000)
We can see the prediction quality of DBN on the training data itself by running darch as
follows:
>darch <- getExecuteFunction(darch)(darch,inputs)
>outputs_darch <- getExecOutputs(darch)
>outputs_darch[[2]]
[,1]
[1,] 9.998474e-01
[2,] 4.921130e-05
[3,] 9.997649e-01
[4,] 3.796699e-05
Comparing with the actual output, DBN has predicted the wrong output for the first and
second input rows. Since this example was just to illustrate how to use the darch
package, we are not worried about the 50% accuracy here.
Other deep learning packages in R
Although there are other deep learning packages in R, such as deepnet and RcppDL,
compared with libraries in other languages such as Cuda (C++) and Theano (Python),
R yet does not have good native libraries for deep learning. The only available package
is a wrapper for the Java-based deep learning open source project H2O. This R
package, h2o, allows running H2O via its REST API from within R. Readers who are
interested in serious deep learning projects and applications should use H2O using h2o
packages in R. One needs to install H2O in your machine to use h2o. We will cover
H2O in the next chapter when we discuss Big Data and the distributed computing
platform called Spark.
Exercises
1. For the Auto MPG dataset, compare the performance of predictive models using
ordinary regression, Bayesian GLM, and Bayesian neural networks.
References
1. MacKay D. J. C. Information Theory, Inference and Learning Algorithms.
Cambridge University Press. 2003. ISBN-10: 0521642981
2. MacKayD. J. C. "The Evidence Framework Applied to Classification Networks".
Neural Computation. Volume 4(3), 698-714. 1992
3. MacKay D. J. C. "Probable Networks and Plausible Predictions – a review of
practical Bayesian methods for supervised neural networks". Network:
Computation in neural systems
4. Hinton G. E., Rumelhart D. E., and Williams R. J. "Learning Representations by
Back Propagating Errors". Nature. Volume 323, 533-536. 1986
5. MacKay D. J. C. "Bayesian Interpolation". Neural Computation. Volume 4(3),
415-447. 1992
6. Hinton G. E., Krizhevsky A., and Sutskever I. "ImageNet Classification with Deep
Convolutional Neural Networks". Advances In Neural Information Processing
Systems (NIPS). 2012
7. Hinton G., Osindero S., and Teh Y. "A Fast Learning Algorithm for Deep Belief
Nets". Neural Computation. 18:1527–1554. 2006
8. Hinton G. and Salakhutdinov R. "Reducing the Dimensionality of Data with Neural
Networks". Science. 313(5786):504–507. 2006
9. Li Deng and Dong Yu. Deep Learning: Methods and Applications (Foundations
and Trends(r) in Signal Processing). Now Publishers Inc. Vol 7, Issue 3-4. 2014.
ISBN-13: 978-1601988140
10. Hinton G. "A Practical Guide to Training Restricted Boltzmann Machines". UTML
Tech Report 2010-003. Univ. Toronto. 2010
Summary
In this chapter, we learned about an important class of machine learning model, namely
neural networks, and their Bayesian implementation. These models are inspired by the
architecture of the human brain and they continue to be an area of active research and
development. We also learned one of the latest advances in neural networks that is
called deep learning. It can be used to solve many problems such as computer vision
and natural language processing that involves highly cognitive elements. The artificial
intelligent systems using deep learning were able to achieve accuracies comparable to
human intelligence in tasks such as speech recognition and image classification. With
this chapter, we have covered important classes of Bayesian machine learning models.
In the next chapter, we will look at a different aspect: large scale machine learning and
some of its applications in Bayesian models.
Chapter 9. Bayesian Modeling at Big
Data Scale
When we learned the principles of Bayesian inference in Chapter 3, Introducing
Bayesian Inference, we saw that as the amount of training data increases, contribution
to the parameter estimation from data overweighs that from the prior distribution. Also,
the uncertainty in parameter estimation decreases. Therefore, you may wonder why one
needs Bayesian modeling in large-scale data analysis. To answer this question, let us
look at one such problem, which is building recommendation systems for e-commerce
products.
In a typical e-commerce store, there will be millions of users and tens of thousands of
products. However, each user would have purchased only a small fraction (less than
10%) of all the products found in the store in their lifetime. Let us say the e-commerce
store is collecting users' feedback for each product sold as a rating on a scale of 1 to 5.
Then, the store can create a user-product rating matrix to capture the ratings of all users.
In this matrix, rows would correspond to users and columns would correspond to
products. The entry of each cell would be the rating given by the user (corresponding to
the row) to the product (corresponding to the column). Now, it is easy to see that
although the overall size of this matrix is huge, only less than 10% entries would have
values since every user would have bought only less than 10% products from the store.
So, this is a highly sparse dataset. Whenever there is a machine learning task where,
even though the overall data size is huge, the data is highly sparse, overfitting can
happen and one should rely on Bayesian methods (reference 1 in the References section
of this chapter). Also, many models such as Bayesian networks, Latent Dirichlet
allocation, and deep belief networks are built on the Bayesian inference paradigm.
When these models are trained on a large dataset, such as text corpora from Reuters,
then the underlying problem is large-scale Bayesian modeling. As it is, Bayesian
modeling is computationally intensive since we have to estimate the whole posterior
distribution of parameters and also do model averaging of the predictions. The presence
of large datasets will make the situation even worse. So what are the computing
frameworks that we can use to do Bayesian learning at a large scale using R? In the next
two sections, we will discuss some of the latest developments in this area.
Distributed computing using Hadoop
In the last decade, tremendous progress was made in distributed computing when two
research engineers from Google developed a computing paradigm called the
MapReduce framework and an associated distributed filesystem called Google File
System (reference 2 in the References section of this chapter). Later on, Yahoo
developed an open source version of this distributed filesystem named Hadoop that
became the hallmark of Big Data computing. Hadoop is ideal for processing large
amounts of data, which cannot fit into the memory of a single large computer, by
distributing the data into multiple computers and doing the computation on each node
locally from the disk. An example would be extracting relevant information from log
files, where typically the size of data for a month would be in the order of terabytes.
To use Hadoop, one has to write programs using MapReduce framework to parallelize
the computing. A Map operation splits the data into multiple key-value pairs and sends
it to different nodes. At each of those nodes, a computation is done on each of the keyvalue pairs. Then, there is a shuffling operation where all the pairs with the same value
of key are brought together. After this, a Reduce operation sums up all the results
corresponding to the same key from the previous computation step. Typically, these
MapReduce operations can be written using a high-level language called Pig. One can
also write MapReduce programs in R using the RHadoop package, which we will
describe in the next section.
RHadoop for using Hadoop from R
RHadoop is a collection of open source packages using which an R user can manage
and analyze data stored in the Hadoop Distributed File System (HDFS). In the
background, RHadoop will translate these as MapReduce operations in Java and run
them on HDFS.
The various packages in RHadoop and their uses are as follows:
rhdfs: Using this package, a user can connect to an HDFS from R and perform
basic actions such as read, write, and modify files.
rhbase: This is the package to connect to a HBASE database from R and to read,
write, and modify tables.
plyrmr: Using this package, an R user can do the common data manipulation tasks
such as the slicing and dicing of datasets. This is similar to the function of
packages such as plyr or reshape2.
rmr2: Using this package, a user can write MapReduce functions in R and execute
them in an HDFS.
Unlike the other packages discussed in this book, the packages associated with
RHadoop are not available from CRAN. They can be downloaded from the GitHub
repository at https://github.com/RevolutionAnalytics and are installed from the local
drive.
Here is a sample MapReduce code written using the rmr2 package to count the number
of words in a corpus (reference 3 in the References section of this chapter):
1. The first step involves loading the rmr2 library:
>library(rmr2)
>LOCAL <- T #to execute rmr2 locally
2. The second step involves writing the Map function. This function takes each line in
the text document and splits it into words. Each word is taken as a token. The
function emits key-value pairs where each distinct word is a key and value = 1:
>#map function
>map.wc <- function(k,lines){
words.list <- strsplit(lines,'\\s+^' )
words <- unlist(words.list)
return(keyval(words,1))
}
3. The third step involves writing a reduce function. This function groups all the same
key from different mappers and sums their value. Since, in this case, each word is
a key and the value = 1, the output of the reduce will be the count of the words:
>#reduce function
>reduce.wc<-function(word,counts){
return(keyval(word,sum(counts) ))
}
4. The fourth step involves writing a word count function combining the map and
reduce functions and executing this function on a file named hdfs.data stored in
the HDFS containing the input text:
>#word count function
>wordcount<-function(input,output=NULL){
mapreduce(input = input,output = output,input.format
= "text",map = map.wc,reduce = reduce.wc,combine = T)
}
>out<-wordcount(hdfs.data,hdfs.out)
5. The fifth step involves getting the output file from HDFS and printing the top five
lines:
>results<-from.dfs(out)
>results.df<-as.data.frame(results,stringAsFactors=F)
>colnames(results.df)<-c('word^' ,^' count^')
>head(results.df)
Spark – in-memory distributed
computing
One of the issues with Hadoop is that after a MapReduce operation, the resulting files
are written to the hard disk. Therefore, when there is a large data processing operation,
there would be many read and write operations on the hard disk, which makes
processing in Hadoop very slow. Moreover, the network latency, which is the time
required to shuffle data between different nodes, also contributes to this problem.
Another disadvantage is that one cannot make real-time queries from the files stored in
HDFS. For machine learning problems, during training phase, the MapReduce will not
persist over iterations. All this makes Hadoop not an ideal platform for machine
learning.
A solution to this problem was invented at Berkeley University's AMP Lab in 2009.
This came out of the PhD work of Matei Zaharia, a Romanian born computer scientist.
His paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for InMemory Cluster Computing (reference 4 in the References section of this chapter) gave
rise to the Spark project that eventually became a fully open source project under
Apache. Spark is an in-memory distributed computing framework that solves many of
the problems of Hadoop mentioned earlier. Moreover, it supports more type of
operations that just MapReduce. Spark can be used for processing iterative algorithms,
interactive data mining, and streaming applications. It is based on an abstraction called
Resilient Distributed Datasets (RDD). Similar to HDFS, it is also fault-tolerant.
Spark is written in a language called Scala. It has interfaces to use from Java and Python
and from the recent version 1.4.0; it also supports R. This is called SparkR, which we
will describe in the next section. The four classes of libraries available in Spark are
SQL and DataFrames, Spark Streaming, MLib (machine learning), and GraphX (graph
algorithms). Currently, SparkR supports only SQL and DataFrames; others are definitely
in the roadmap. Spark can be downloaded from the Apache project page at
http://spark.apache.org/downloads.html. Starting from 1.4.0 version, SparkR is included
in Spark and no separate download is required.
SparkR
Similar to RHadoop, SparkR is an R package that allows R users to use Spark APIs
through the RDD class. For example, using SparkR, users can run jobs on Spark from
RStudio. SparkR can be evoked from RStudio. To enable this, include the following
lines in your .Rprofile file that R uses at startup to initialize the environments:
Sys.setenv(SPARK_HOME/.../spark-1.5.0-bin-hadoop2.6")
#provide the correct path where spark downloaded folder is kept for
SPARK_HOME
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),""R",""lib"),".libPath
s()))
Once this is done, start RStudio and enter the following commands to start using
SparkR:
>library(SparkR)
>sc <- sparkR.init(master="local")
As mentioned, as of the latest version 1.5 when this chapter is in writing, SparkR
supports limited functionalities of R. This mainly includes data slicing and dicing and
summary stat functions. The current version does not support the use of contributed R
packages; however, it is planned for a future release. On machine learning, currently
SparkR supports the glm( ) function. We will do an example in the next section.
Linear regression using SparkR
In the following example, we will illustrate how to use SparkR for machine learning.
For this, we will use the same dataset of energy efficiency measurements that we used
for linear regression in Chapter 5, Bayesian Regression Models:
>library(SparkR)
>sc <- sparkR.init(master="local")
>sqlContext <- sparkRSQL.init(sc)
#Importing data
>df <read.csv("/Users/harikoduvely/Projects/Book/Data/ENB2012_data.csv",he
ader = T)
>#Excluding variable Y2,X6,X8 and removing records from 768
containing mainly null values
>df <- df[1:768,c(1,2,3,4,5,7,9)]
>#Converting to a Spark R Dataframe
>dfsr <- createDataFrame(sqlContext,df)
>model <- glm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X7,data = dfsr,family =
"gaussian")
> summary(model)
Computing clusters on the cloud
In order to process large datasets using Hadoop and associated R packages, one needs a
cluster of computers. In today's world, it is easy to get using cloud computing services
provided by Amazon, Microsoft, and others. One needs to pay only for the amount of
CPU and storage used. No need for upfront investments on infrastructure. The top four
cloud computing services are AWS by Amazon, Azure by Microsoft, Compute Cloud by
Google, and Bluemix by IBM. In this section, we will discuss running R programs on
AWS. In particular, you will learn how to create an AWS instance; install R, RStudio,
and other packages in that instance; develop and run machine learning models.
Amazon Web Services
Popularly known as AWS, Amazon Web Services started as an internal project in
Amazon in 2002 to meet the dynamic computing requirements to support their ecommerce business. This grew as an infrastructure as a service and in 2006 Amazon
launched two services to the world, Simple Storage Service (S3) and Elastic
Computing Cloud (EC2). From there, AWS grew at incredible pace. Today, they have
more than 40 different types of services using millions of servers.
Creating and running computing instances on
AWS
The best place to learn how to set up an AWS account and start using EC2 is the freely
available e-book from Amazon Kindle store named Amazon Elastic Compute Cloud
(EC2) User Guide (reference 6 in the References section of this chapter).
Here, we only summarize the essential steps involved in the process:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Create an AWS account.
Sign in to the AWS management console (https://aws.amazon.com/console/).
Click on the EC2 service.
Choose Amazon Machine Instance (AMI).
Choose an instance type.
Create a public-private key-pair.
Configure instance.
Add storage.
Tag instance.
Configure a security group (policy specifying who can access the instance).
Review and launch the instance.
Log in to your instance using SSH (from Linux/Ubuntu), Putty (from Windows), or a
browser using the private key provided at the time of configuring security and the IP
address given at the time of launching. Here, we are assuming that the instance you have
launched is a Linux instance.
Installing R and RStudio
To install R and RStudio, you need to be an authenticated user. So, create a new user
and give the user administrative privilege (sudo). After that, execute the following steps
from the Ubuntu shell:
1. Edit the /etc/apt/sources.list file.
2. Add the following line at the end:
deb http://cran.rstudio.com/bin/linux/ubuntu trusty .
3. Get the keys for the repository to run:
sudo apt-key adv
51716619E084DAB9
--keyserver keyserver.ubuntu.com –recv-keys
4. Update the package list:
sudo apt-get update
5. Install the latest version of R:
sudo apt-get install r-base-core
6. Install gdebi to install Debian packages from the local disk:
sudo apt-get install gdebi-core
7. Download the RStudio package:
wget http://download2.rstudio.org/r-studio-server-0.99.446amd64.deb
8. Install RStudio:
sudo gdebi r-studio-server-0.99.446-amd64.deb
Once the installation is completed successfully, RStudio running on your AWS instance
can be accessed from a browser. For this, open a browser and enter the URL
<your.aws.ip.no>:8787.
If you are able to use your RStudio running on the AWS instance, you can then install
other packages such as rhdfs, rmr2, and more from RStudio, build any machine learning
models in R, and run them on the AWS cloud.
Apart from R and RStudio, AWS also supports Spark (and hence SparkR). In the
following section, you will learn how to run Spark on an EC2 cluster.
Running Spark on EC2
You can launch and manage Spark clusters on Amazon EC2 using the spark-ec2 script
located in the ec2 directory of Spark in your local machine. To launch a Spark cluster
on EC2, use the following steps:
1. Go to the ec2 directory in the Spark folder in your local machine.
2. Run the following command:
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch
<cluster-name>
Here, <keypair> is the name of the keypair you used for launching the EC2
service mentioned in the Creating and running computing instances on AWS
section of this chapter. The <key-file> is the path in your local machine where
the private key has been downloaded and kept. The number of worker nodes is
specified by <num-slaves>.
3. To run your programs in the cluster, first SSH into the cluster using the following
command:
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
After logging into the cluster, you can use Spark as you use on the local machine.
More details on how to use Spark on EC2 can be found in the Spark documentation and
AWS documentation (references 5, 6, and 7 in the References section of the chapter).
Microsoft Azure
Microsoft Azure has full support for R and Spark. Microsoft had bought Revolution
Analytics, a company that started building and supporting an enterprise version of R.
Apart from this, Azure has a machine learning service where there are APIs for some
Bayesian machine learning models as well. A nice video tutorial of how to launch
instances on Azure and how to use their machine learning as a service can be found at
the Microsoft Virtual Academy website (reference 8 in the References section of the
chapter).
IBM Bluemix
Bluemix has full support for R through the full set of R libraries available on their
instances. IBM also has integration of Spark into their cloud services in their roadmap
plans. More details can be found at their documentation page (reference 9 in the
References section of the chapter).
Other R packages for large scale
machine learning
Apart from RHadoop and SparkR, there are several other native R packages
specifically built for large-scale machine learning. Here, we give a brief overview of
them. Interested readers should refer to CRAN Task View: High-Performance and
Parallel Computing with R (reference 10 in the References section of the chapter).
Though R is single-threaded, there exists several packages for parallel computation in
R. Some of the well-known packages are Rmpi (R version of the popular message
passing interface), multicore, snow (for building R clusters), and foreach. From R
2.14.0, a new package called parallel started shipping with the base R. We will discuss
some of its features here.
The parallel R package
The parallel package is built on top of the multicore and snow packages. It is useful for
running a single program on multiple datasets such as K-fold cross validation. It can be
used for parallelizing in a single machine over multiple CPUs/cores or across several
machines. For parallelizing across a cluster of machines, it evokes MPI (message
passing interface) using the Rmpi package.
We will illustrate the use of parallel package with the simple example of computing a
square of numbers in the list 1:100000. This example will not work in Windows since
the corresponding R does not support the multicore package. It can be tested on any
Linux or OS X platform.
The sequential way of performing this operation is to use the lapply function as
follows:
>nsquare <- function(n){return(n*n)}
>range <- c(1:100000)
>system.time(lapply(range,nsquare))
Using the mclapply function of the parallel package, this computation can be achieved
in much less time:
>library(parallel) #included in R core packages, no separate
installation required
>numCores<-detectCores( ) #to find the number of cores in the
machine
>system.time(mclapply(range,nsquare,mc.cores=numCores))
If the dataset is so large that it needs a cluster of computers, we can use the parLapply
function to run the program over a cluster. This needs the Rmpi package:
>install.packages(Rmpi)#one time
>library(Rmpi)
>numNodes<-4 #number of workers nodes
>cl<-makeCluster(numNodes,type="MPI")
>system.time(parLapply(cl,range,nsquare))
>stopCluster(cl)
>mpi.exit( )
The foreach R package
This is a new looping construct in R that can be executed in parallel across multicores
or clusters. It has two important operators: %do% for repeatedly doing a task and
%dopar% for executing tasks in parallel.
For example, the squaring function we discussed in the previous section can be
implemented using a single line command using the foreach package:
>install.packages(foreach)#one time
>install.packages(doParallel)#one time
>library(foreach)
>library(doParallel)
>system.time(foreach(i=1:100000)
%do% i^2) #for executing
sequentially
>system.time(foreach(i=1:100000)
%dopar% i^2) #for executing in
parallel
We will also do an example of quick sort using the foreach function:
>qsort<- function(x) {
n <- length(x)
if (n == 0) {
x
} else {
p <- sample(n,1)
smaller <- foreach(y=x[-p],.combine=c) %:% when(y <= x[p]) %do% y
larger <- foreach(y=x[-p],.combine=c) %:% when(y > x[p]) %do% y
c(qsort(smaller),x[p],qsort(larger))
}
}
qsort(runif(12))
These packages are still undergoing a lot of development. They have not yet been used
in a large way for Bayesian modeling. It is easy to use them for Bayesian inference
applications such as Monte Carlo simulations.
Exercises
1. Revisit the classification problem in Chapter 6, Bayesian Classification Models.
Repeat the same problem using the glm() function of SparkR.
2. Revisit the linear regression problem, we did in this chapter, using SparkR. After
creating the AWS instance, repeat this problem using RStudio server on AWS.
References
1. "MapReduce Implementation of Variational Bayesian Probabilistic Matrix
Factorization Algorithm". In: IEEE Conference on Big Data. pp 145-152. 2013
2. Dean J. and Ghemawat S. "MapReduce: Simplified Data Processing on Large
Clusters". Communications of the ACM 51 (1). 107-113
3. https://github.com/jeffreybreen/tutorial-rmr2-airline/blob/master/R/1-wordcount.R
4. Chowdhury M., Das T., Dave A., Franklin M.J., Ma J., McCauley M., Shenker S.,
Stoica I., and Zaharia M. "Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing". NSDI 2012. 2012
5. Amazon Elastic Compute Cloud (EC2) User Guide, Kindle e-book by Amazon
Web Services, updated April 9, 2014
6. Spark documentation for AWS at http://spark.apache.org/docs/latest/ec2scripts.html
7. AWS documentation for Spark at
http://aws.amazon.com/elasticmapreduce/details/spark/
8. Microsoft Virtual Academy website at
http://www.microsoftvirtualacademy.com/training-courses/getting-started-withmicrosoft-azure-machine-learning
9. IBM Bluemix Tutorial at
http://www.ibm.com/developerworks/cloud/bluemix/quick-start-bluemix.html
10. CRAN Task View for contributed packages in R at https://cran.rproject.org/web/views/HighPerformanceComputing.html
Summary
In this last chapter of the book, we covered various frameworks to implement largescale machine learning. These are very useful for Bayesian learning too. For example,
to simulate from a posterior distribution, one could run a Gibbs sampling over a cluster
of machines. We learned how to connect to Hadoop from R using the RHadoop package
and how to use R with Spark using SparkR. We also discussed how to set up clusters in
cloud services such as AWS and how to run Spark on them. Some of the native
parallelization frameworks such as parallel and foreach functions were also covered.
The overall aim of this book was to introduce readers to the area of Bayesian modeling
using R. Readers should have gained a good grasp of theory and concepts behind
Bayesian machine learning models. Since the examples were mainly given for the
purposes of illustration, I urge readers to apply these techniques to real-world problems
to appreciate the subject of Bayesian inference more deeply.
Index
A
Akaike information criterion (AIC) / Laplace approximation
allele frequencies
about / Beta distribution
arm package
about / The arm package
association rule mining
about / An overview of common machine learning tasks
B
Bayesian averaging
about / Bayesian averaging
Bayesian classification models
exercises / Exercises
Bayesian inference
Bayesian view of uncertainty / Bayesian view of uncertainty
exercises / Exercises
for machine learning / Why Bayesian inference for machine learning?
Bayesian information criterion (BIC) / Laplace approximation
Bayesian logistic regression model
about / The Bayesian logistic regression model
BayesLogit R package / The BayesLogit R package
dataset / The dataset
training, preparing for / Preparation of the training and testing datasets
datasets testing, preparing for / Preparation of the training and testing datasets
using / Using the Bayesian logistic model
Bayesian mixture models
about / Bayesian mixture models
bgmm package / The bgmm package for Bayesian mixture models
Bayesian modeling, at Big Data scale
exercises / Exercises
Bayesian models, for unsupervised learning
exercises / Exercises
Bayesian neural networks
exercises / Exercises
Bayesian Output Analysis Program (BOA) / R packages for Gibbs sampling
Bayesian regression models
exercises / Exercises
Bayesian theorem
about / Bayesian theorem
Bayesian treatment, of neural networks
about / Bayesian treatment of neural networks
Bayesian view of uncertainty
about / Bayesian view of uncertainty
prior distribution, selecting / Choosing the right prior distribution
posterior distribution, estimation / Estimation of posterior distribution
future observations, predicting / Prediction of future observations
BayesLogit R package
about / The BayesLogit R package
Beta distribution
about / Beta distribution
bgmm package
about / Bayesian mixture models
fully supervised GMM / The bgmm package for Bayesian mixture models
semi-supervised GMM / The bgmm package for Bayesian mixture models
partially supervised GMM / The bgmm package for Bayesian mixture models
unsupervised GMM / The bgmm package for Bayesian mixture models
bias-variance tradeoff
about / Model overfitting and bias-variance tradeoff
binomial distribution
about / Binomial distribution
binomlogit package / R packages for Gibbs sampling
Bmk / R packages for Gibbs sampling
BoomSpikeSlab package / R packages for Gibbs sampling
brnn R package
about / The brnn R package
C
CD-1 algorithm / Restricted Boltzmann machines
CD-k algorithm / Restricted Boltzmann machines
central limit theorem / Probability distributions
classification
about / An overview of common machine learning tasks
clustering
about / An overview of common machine learning tasks
clusters, computing on cloud
about / Computing clusters on the cloud
Amazon Web Services / Amazon Web Services
computing instances, running on AWS / Creating and running computing
instances on AWS
computing instances, creating / Creating and running computing instances on
AWS
R, installing / Installing R and RStudio
RStudio, installing / Installing R and RStudio
Spark, running on EC2 / Running Spark on EC2
Microsoft Azure / Microsoft Azure
IBM Bluemix / IBM Bluemix
common machine learning tasks
overview / An overview of common machine learning tasks
classification / An overview of common machine learning tasks
regression / An overview of common machine learning tasks
clustering / An overview of common machine learning tasks
association rules / An overview of common machine learning tasks
forecasting / An overview of common machine learning tasks
dimensional reduction / An overview of common machine learning tasks
density estimation / An overview of common machine learning tasks
Comprehensive R Archive Network (CRAN) / Installing R and RStudio
conditional probability
about / Conditional probability
conjugate distributions / Conjugate priors
conjugate prior for the likelihood function / Conjugate priors
Contrastive Divergence (CD) / Restricted Boltzmann machines
Correlated Topic Models (CTM)
about / The topicmodels package
covariance
about / Expectations and covariance
Cuda (C++) / Other deep learning packages in R
curse of dimensionality
about / An overview of common machine learning tasks
D
darch R package / The darch R package
data, managing in R
about / Managing data in R
data types / Data Types in R
data structures / Data structures in R
data, importing into R / Importing data into R
datasets, slicing / Slicing and dicing datasets
datasets, dicing / Slicing and dicing datasets
vectorized operations / Vectorized operations
data structures, R
homogeneous / Data structures in R
heterogeneous / Data structures in R
data types, R
integer / Data Types in R
complex / Data Types in R
numeric / Data Types in R
character / Data Types in R
logical / Data Types in R
data visualization
about / Data visualization
high-level plotting functions / High-level plotting functions
low-level plotting commands / Low-level plotting commands
interactive graphics functions / Interactive graphics functions
DBN-DNN architecture / Deep belief networks
deep belief nets (DBN) / Deep belief networks and deep learning
deep belief networks
about / Deep belief networks and deep learning, Deep belief networks
restricted Boltzmann machine (RBM) / Restricted Boltzmann machines
darch R package / The darch R package
deep learning packages / Other deep learning packages in R
deep learning
about / Deep belief networks and deep learning
deepnet / Other deep learning packages in R
density estimation
about / An overview of common machine learning tasks
dimensional reduction
about / An overview of common machine learning tasks
Dirichlet distribution
about / Dirichlet distribution
distributed computing
with Hadoop / Distributed computing using Hadoop
divergence
about / Variational approximation
E
econometrics
about / Gamma distribution
Elastic Computing Cloud (EC2) / Amazon Web Services
Energy efficiency dataset
about / The Energy efficiency dataset
energy function / Restricted Boltzmann machines
Evolutionary Monte Carlo (EMC) algorithm package / R packages for the
Metropolis-Hasting algorithm
exercises
about / Exercises
expectation-maximization (EM) algorithm / Bayesian mixture models
expectations
about / Expectations and covariance
F
false negative or type II error
about / Performance metrics for classification
false positive or type I error
about / Performance metrics for classification
foreach / Other R packages for large scale machine learning
foreach R package / The foreach R package
forecasting
about / An overview of common machine learning tasks
G
Gamma distribution
about / Gamma distribution
Gaussian mixture model (GMM) / Bayesian mixture models
generalized linear regression
about / Generalized linear regression
general purpose graphical processing units (GPGPUs) / Deep belief networks and
deep learning
ggmcmc package / R packages for Gibbs sampling
ggplot2 / Data visualization
ggplot2 package / Bayesian regression
gibbs.met package
R packages / R packages for Gibbs sampling
GibbsACOV package / R packages for Gibbs sampling
Gibbs sampling
about / Gibbs sampling
R packages / R packages for Gibbs sampling
gradient vanishing problem / Deep belief networks
grid / Data visualization
H
Hadoop
about / Distributed computing using Hadoop
Hadoop Distributed File System (HDFS) / RHadoop for using Hadoop from R
high-level plotting functions
about / High-level plotting functions
I
IBM Bluemix
about / IBM Bluemix
integrated Development environment (IDE) / Installing R and RStudio
interactive graphics functions
about / Interactive graphics functions
K
kernel density estimation (KDE)
about / An overview of common machine learning tasks
L
Laplace approximation / Laplace approximation
Latent Dirichlet allocation (LDA) / An overview of common machine learning
tasks
about / Latent Dirichlet allocation, The lda package
R packages / R packages for LDA
lattice / Data visualization
lda package / R packages for Gibbs sampling
about / The lda package
linear regression
using SparkR / Linear regression using SparkR
logit function
using / The Bayesian logistic regression model
loop functions, R programs
about / Loop functions
lapply / lapply
sapply / sapply
mapply / mapply
apply / apply
tapply / tapply
low-level plotting commands
about / Low-level plotting commands
M
MapReduce
about / Distributed computing using Hadoop
marginal distribution
about / Marginal distribution
marginalization
about / Marginal distribution
Markov Chain Monte Carlo (MCMC) simulations
about / Monte Carlo simulations
maximum a posteriori (MAP) estimation / Maximum a posteriori estimation
maximum likelihood estimate / Bayesian view of uncertainty
maximum likelihood method / Bayesian mixture models
MCMCglm package / R packages for Gibbs sampling
mcmc package / R packages for the Metropolis-Hasting algorithm
Metropolis-Hasting algorithm
about / The Metropolis-Hasting algorithm
R packages / R packages for the Metropolis-Hasting algorithm
MHadaptive / R packages for the Metropolis-Hasting algorithm
Microsoft Azure
about / Microsoft Azure
miles per gallon (mpg) / Exercises
mixed membership stochastic block model (MMSB)
about / The lda package
model overfitting
about / Model overfitting and bias-variance tradeoff
model regularization
about / Model regularization
Ridge regression / Model regularization
Lasso / Model regularization
models selection
about / Selecting models of optimum complexity
subset selection / Subset selection
model regularization / Model regularization
Monte Carlo simulations
about / Monte Carlo simulations
Metropolis-Hasting algorithm / The Metropolis-Hasting algorithm
Gibbs sampling / Gibbs sampling
multicore / Other R packages for large scale machine learning
N
Naïve Bayes classifier
about / The Naïve Bayes classifier
text processing, with tm package / Text processing using the tm package
model training and prediction / Model training and prediction
O
OpenBUGS MCMC package / R packages for Gibbs sampling
Open Database Connectivity (ODBC) / Importing data into R
P
parallel / Other R packages for large scale machine learning
parallel R package / The parallel R package
partially supervised GMM
belief( ) function / The bgmm package for Bayesian mixture models
soft( ) function / The bgmm package for Bayesian mixture models
partition function / Restricted Boltzmann machines
PCorpus (permanent corpus) / Text processing using the tm package
performance metrics, for classification
about / Performance metrics for classification
Pig
about / Distributed computing using Hadoop
posterior probability distribution
about / Bayesian view of uncertainty, Estimation of posterior distribution
estimation / Estimation of posterior distribution
maximum a posteriori (MAP) estimation / Maximum a posteriori estimation
Laplace approximation / Laplace approximation
Monte Carlo simulations / Monte Carlo simulations
variational approximation / Variational approximation
simulating / Simulation of the posterior distribution
prior probability distribution
about / Bayesian view of uncertainty
selecting / Choosing the right prior distribution
non-informative priors / Non-informative priors
subjective priors / Subjective priors
conjugate priors / Conjugate priors
hierarchical priors / Hierarchical priors
probability distributions
about / Probability distributions
probability mass function (pmf) / Probability distributions
categorical distribution / Probability distributions
probability density function (pdf) / Probability distributions
binomial distribution / Binomial distribution
Beta distribution / Beta distribution
Gamma distribution / Gamma distribution
Dirichlet distribution / Dirichlet distribution
Wishart distribution / Wishart distribution
R
R
installing / Installing R and RStudio, Installing R and RStudio
program, writing / Your first R program
data, managing / Managing data in R
RBugs / R packages for Gibbs sampling
RcppDL / Other deep learning packages in R
regression
about / An overview of common machine learning tasks
regression of energy efficiency, with building parameters
about / Regression of energy efficiency with building parameters
ordinary regression / Ordinary regression
Bayesian regression / Bayesian regression
R environment
setting up / Setting up the R environment and packages
exercises / Exercises
Resilient Distributed Datasets (RDD)
about / Spark – in-memory distributed computing
restricted Boltzmann machine (RBM) / Restricted Boltzmann machines
Reuter_50_50 dataset
about / The topicmodels package
RHadoop
about / RHadoop for using Hadoop from R
for using Hadoop from R / RHadoop for using Hadoop from R
rhdfs package / RHadoop for using Hadoop from R
rhbase package / RHadoop for using Hadoop from R
plyrmr package / RHadoop for using Hadoop from R
rmr2 package / RHadoop for using Hadoop from R
risk modeling / Subjective priors
Rmpi / Other R packages for large scale machine learning
ROC curve
about / Performance metrics for classification
RODBC package
about / Importing data into R
functions / Importing data into R
Root Mean Square Error (RMSE) / Exercises
R package e1071
about / The Naïve Bayes classifier
R packages
about / Setting up the R environment and packages
R packages, for large scale machine learning
about / Other R packages for large scale machine learning
parallel R package / The parallel R package
foreach R package / The foreach R package
R packages, for LDA
about / R packages for LDA
topicmodels package / The topicmodels package
lda package / The lda package
R programs
writing / Writing R programs
control structures / Control structures
functions / Functions
scoping rules / Scoping rules
loop functions / Loop functions
RStudio
about / Setting up the R environment and packages
URL / Installing R and RStudio
installing / Installing R and RStudio, Installing R and RStudio
S
SamplerCompare package / R packages for Gibbs sampling
sampling
about / Sampling
random uniform sampling, from interval / Random uniform sampling from an
interval
from normal distribution / Sampling from normal distribution
sigmoid function
about / Two-layer neural networks
Simple Storage Service (S3) / Amazon Web Services
snow / Other R packages for large scale machine learning
SnowballC package
about / Text processing using the tm package
softmax
about / Two-layer neural networks
Spark
about / Spark – in-memory distributed computing
URL / Spark – in-memory distributed computing
running, on EC2 / Running Spark on EC2
SparkR
about / SparkR
stocc package / R packages for Gibbs sampling
subsets, of R objects
Single bracket [ ] / Slicing and dicing datasets
Double bracket [[ ]] / Slicing and dicing datasets
Dollar sign $ / Slicing and dicing datasets
use of negative index values / Slicing and dicing datasets
subset selection approach
about / Subset selection
forward selection / Subset selection
backward selection / Subset selection
supervised LDA (sLDA)
about / The lda package
support vector machines (SVM) / An overview of common machine learning tasks
T
Theano / Other deep learning packages in R
tm package
about / Text processing using the tm package
topic modeling, with Bayesian inference
about / Topic modeling using Bayesian inference
Latent Dirichlet allocation / Latent Dirichlet allocation
topicmodels package
about / The topicmodels package
two-layer neural networks
about / Two-layer neural networks
U
Unsupervised( ) function
about / The bgmm package for Bayesian mixture models
parameters / The bgmm package for Bayesian mixture models
V
variational approximation
about / Variational approximation
variational calculus problem
about / Variational approximation
vbdm package
about / Variational approximation
VBmix package
about / Variational approximation
vbsr package
about / Variational approximation
VCorpus (volatile corpus) / Text processing using the tm package
W
Wishart distribution
about / Wishart distribution
word error rate / Deep belief networks and deep learning
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement