VUW VICTORIA

VUW VICTORIA
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
T E W H A R E W Ā N A N G A O T E Ū P O K O O T E I K A A M Ā U I
VUW VICTORIA
UNIVERSITY OF WELLINGTON
EXAMINATIONS – 2014
TRIMESTER ONE
COMP 307
************ WITH
SOLUTIONS ************
INTRODUCTION TO
ARTIFICIAL INTELLIGENCE
Time Allowed: THREE HOURS
Instructions:
Closed Book.
There are a total of 180 marks on this exam.
Attempt all questions.
Only silent non-programmable calculators or silent programmable calculators with
their memories cleared are permitted in this examination.
Non-electronic foreign language translation dictionaries may be used.
The appendix on the last sheet can be removed for reference for questions 2-4.
Questions
1. Search
[25]
2. Machine Learning Basics
[30]
3. Neural Networks
[15]
4. Evolutionary Computation and Learning
[20]
5. Uncertainty and Belief nets
[28]
6. Inference
[17]
7. Modelling sequences
[15]
8. Planning
[20]
9. Philosophy of AI
[10]
Question 1. Search
[25 marks]
Based on the figure below, answer questions (a) and (b).
A
B
E
C
H
D
I
J
F
K
L
G
M
N
O
(a) [2 marks] Assuming that you are using breadth first search, state the search order/path using the
letters in the nodes.
A, B, E, C, D, F, G, H, I, J, K, L, M, N, O
(b) [4 marks] Assuming that you are using iterative deepening search, state the search order/path
using the letters in the nodes.
Limit = 0: A
Limit = 1: A, B, E
Limit = 2: A, B, C, D, E, F, G
Limit = 3: A, B, C, H, I, D J, K, E, F, L, M, G, N, O
COMP 307
Page 2 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(c) [4 marks] Hill climbing is a basic local search technique.
(i) Describe the main idea of this technique. Draw a figure if necessary.
(ii) State a major limitation of this technique.
(iii) State a solution to avoid (or at least reduce the degree of) the limitation in part (ii).
(i) HC is a local search technique and aims to find the best state according to an objective
function. It only keeps one state and its evaluation/performance, and choose the best successor.
(ii) Can easily stuck to local maximum/optima. (iii) Adding a random/diversity component like
simulated annealing.
(d) [5 marks] Gradient descent search and (genetic) beam search are two heuristic search methods.
(i) Briefly describe the differences between them.
(ii) State a machine learning paradigm/technique that uses each of the two methods.
(i) local/global; partial/candidate; single/multiple solutions within a particular experiment run,
etc.
(ii) NNs/EAs
COMP 307
Page 3 of 31
continued...
The figure below shows a route finding example from a part of Romania’s city map, as we discussed
during the lectures. The numbers on left part (in the box) show the distances between every two
cities, while the numbers on the right part show the straight line distance between each city to the
final goal city Bucharest. Assuming that the initial city is Arad (and the goal city is Bucharest),
answer questions (f) and (g).
Oradea
71
Neamt
87
Zerind
151
75
Iasi
Arad
140
92
Sibiu
99
Fagaras
118
Vaslui
80
Rimnicu Vilcea
Timisoara
111
Lugoj
97
142
211
Pitesti
70
98
Mehadia
146
75
Drobeta
85
101
Hirsova
Urziceni
86
138
Bucharest
120
90
Craiova
Eforie
Giurgiu
Arad
Bucharest
Craiova
Drobeta
Eforie
Fagaras
Giurgiu
Hirsova
Iasi
Lugoj
366
0
160
242
161
176
77
151
226
244
Mehadia
Neamt
Oradea
Pitesti
Rimnicu Vilcea
Sibiu
Timisoara
Urziceni
Vaslui
Zerind
(e) [4 marks] Assuming you are using the greedy (best first) search technique, describe your search
path. Show your working. Draw a figure if necessary.
aaa
COMP 307
Page 4 of 31
continued...
241
234
380
100
193
253
329
80
199
374
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(f) [6 marks] Assuming you are using the A* search technique, describe your search path. Show
your working. Draw a figure if necessary.
aaa
COMP 307
Page 5 of 31
continued...
Question 2. Machine Learning Basics
[30 marks]
(a) [4 marks] There are several different paradigms in machine learning. State the name of an
algorithm or approach used in each of the following paradigms:
(i) Case based learning
(ii) Induction learning
(iii) Connectionist learning
(iv) Evolutionary learning
(b) [4 marks] The data sets are typically separated into a training set and a test set in (supervised)
machine learning systems. Briefly describe the role of each of them.
(i) A training set is a collection of instances that are directly used for learning a machine/concept/classifier,
A test set is a collection of instances that are used to measure the performance of the learned
machine/concept
(c) [4 marks] Briefly describe the K Nearest Neighbour method used for classification tasks.
Each unseen instance (in the test set) is compared with all the instances in the training set to
calculate the distance (typically Euclidean distance) or similarity for all the training instances,
Find the “nearest neighbour” (instance) from the training set based on some distance/similarity
measures,
Then choose the class label of the nearest neighbour as the class label of the unseen instance in
the test set.
COMP 307
Page 6 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(d) [5 marks] The k-means method is widely used in machine learning.
(i) State whether this method is typically used for classification, clustering, or association rules.
(ii) This method has an important assumption. State this assumption.
(iii) Briefly describe the process of this method.
It is a clustering method, and assumes the number of clusters is known in advance. The
following steps are typically applied: (1) Set k initial ”means” randomly from the data set (shown
in color). (2) Create k clusters by associating every instance with the nearest mean based on a
distance measure. (3) Replace the old means with the centroid of each of the k clusters (as the
new means). (4) Repeat the above two steps until convergence.
(e) [4 marks] Briefly describe the K-fold cross validation method used in machine learning experiments.
chop the available data into n equal chunks;
For each chunk in turn, treat it as the test set, treat other n-1 chunks as the training set, and the
classifier trained from the training set is applied to the test set;
The process is repeated n times with each chunk used exactly once as the test test;
The n results are “averaged”.
COMP 307
Page 7 of 31
continued...
(f) [9 marks] Consider the following data set with 10 instances describing whether a group of friends
want to have a party in a large room at a local community centre or going outside for BBQ, of which
5 instances were for InRoomParty (at the community centre) and 5 were for BBQ (a party outside)
depending on the weather conditions. They are described by three attributes: whether it is clear or
raining, whether the wind is strong or weak, and whether the temperature is hot, good, or cold.
Instance
1
2
3
4
5
6
7
8
9
10
Weather
clear
clear
clear
clear
raining
clear
raining
raining
raining
clear
Wind
weak
weak
strong
weak
weak
strong
strong
strong
weak
strong
Temp
good
hot
good
cold
hot
cold
good
cold
cold
hot
Class
BBQ
BBQ
BBQ
BBQ
BBQ
InRoomParty
InRoomParty
InRoomParty
InRoomParty
InRoomParty
These friends want to build a decision tree to help making decisions for whether to have the party
outside for BBQ or stay inside a room for the party in the community centre. Which attribute should
be chosen for the root of the decision tree if the impurity function p(BBQ) p(InRoomParty) is used?
Show your working.
Weather: 6/10 * (4/6 * 2/6) + 4/10 * (1/4 * 3/4) = 5/24 = 20.83%
Wind: 5/10 * (4/5 * 1/5) + 5/10 * (1/5 * 4/5) = 4/25 = 16.0%
Temp: 3/10 * (2/3 * 1/3) + 3/10 * (2/3 * 1/3) + 4/10 * (1/4 * 3/4) = 7/40 = 5/24 = 20.83%
Wind has the lowest score, therefore the algorithm will use Wind at the root.
COMP 307
Page 8 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
SPARE PAGE FOR EXTRA ANSWERS
Cross out rough working that you do not want marked.
Specify the question number for work that you do want marked.
COMP 307
Page 9 of 31
continued...
Question 3. Neural Networks
[15 marks]
(a) [8 marks] Consider the following feed forward network which uses the sigmoid/logistic transfer
function (see Appendix B),
x2
x1
1
W13
W15
W23
W14
W13
W14
W15
W23
W24
W25
W36
2
W24 W25
= −1.1
= 3.5
= −2.2
= −3.7
= 0.5 Weights
= 1.5
= 1.8
W46 = −2.2
3
4
W36
5
W46
W56
W56 = 8.1
b3 = 0.5
b4 = 1.0
b5 = 1.5
b6 = −4.5
6
Biases
(i) What will be the output of node 6 (O6 ) for the input vector (0.0, 0.0)?
(ii) What will be the new value of weight w56 after one epoch of training using the back propagation
algorithm? Assume that the training set consists of only the vector (0.0, 0.0, 0.0) corresponding
to the two input feature values and the target output value, and that the learning rate η is 0.25.
Show your working.
O1 = I1 = O2 = I2 = 0; I3 = b3 = 0.5; O3 = f (0.5) = 0.62; O4 = f (1.0) = 0.73; O5 =
f (1.5) = 0.82
I6 = 0.62 · 1.8 + 0.73 · (−2.2) + 0.82 · 8.1 + (−4.5) = 1.652; O6 = f (1.652) = 0.83
β 6 = 0 − 0.83 = −0.83
∆W56 = ηO5 O6 (1 − O6 ) β 6 = 0.25 × 0.82 × 0.83(1 − 0.83)(−0.83) = −0.024
(W56 )new = (W56 )old + ∆W56 = 8.1 − 0.024 = 8.076
COMP 307
Page 10 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(b) [4 marks] Peter Smith has developed a neural network classifier for use in the fruit industry. The
classifier is intended to detect potential rotten fruit from good fruit using images of individual fruit
items. The process involves the extraction of 4 features from each image, and the use of a standard
multilayer feed forward neural network trained with the back propagation learning algorithm for
classification. There are 500 examples in total, of which 50 are used for network training and
450 for testing. The network architecture he used is 4-23-2. After training for 18,000 epochs, the
network classifier still performed badly on the test set. Suggest four good ways for improving the
performance.
(1) re-split the data sets and use more examples for training; (2) use fewer hidden nodes; (3)
Train fewer epochs; (4) any other reasonable suggestions, e.g. get more and better features, or use
a validation set to control overfitting.
(c) [3 marks] The standard back propagation algorithm does not specify when to stop the network
training process. Briefly state three commonly used termination criteria in network training using
the back propagation algorithm.
(1) epoch/cycle control strategy; (2) error control strategy; (3) proportion/accuracy control strategy (4) user control strategy; (5) validation control/early stopping strategy.
COMP 307
Page 11 of 31
continued...
SPARE PAGE FOR EXTRA ANSWERS
Cross out rough working that you do not want marked.
Specify the question number for work that you do want marked.
COMP 307
Page 12 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
Question 4. Evolutionary Computation and Learning
[20 marks]
(a) [3 marks] Genetic algorithms and genetic programming are two techniques in evolutionary
computation and learning. State six additional techniques in evolutionary computation and learning.
evolutionary strategy, evolutionary programming, classifier systems, particle swarm optimisation, differential evolution, evolutionary multi-objective optimisation, artificial immune systems,
ant colony optimisation, etc.
(b) [6 marks] In evolutionary algorithms, the three main genetic operators, elitism, crossover and
mutation, are often used to generate new candidate solutions. Briefly describe what each operator
does and the main purpose for each of them.
Elitist or reproduction, is the operator that directly copies a small set of best individuals from
the current generation to the next generation. The main purpose is to make sure the evolution does
not make the best solution worse.
Crossover is the operator that combines genetic materials from different individuals to form new
candidate solutions. The main goal is to integrate the advantages of existing individual solutions
to form better solutions.
Mutation is the operator that randomly change a part of the selected individual from the
population. The main purpose is to maintain the diversity of a population.
COMP 307
Page 13 of 31
continued...
(c) [5 marks] Briefly describe the general evolutionary process in Evolutionary Algorithms. Draw
a figure if necessary.
this should include initialisation, evaluation, selection, mating and when to stop.
COMP 307
Page 14 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(d) [6 marks] The standard tree-based genetic programming (GP) approach has been applied to
many classification tasks. In this approach, each evolved program typically returns a single floating
point number. One of the key issues here is to use a strategy to translate the single output value of
an evolved classifier program into a set of class labels.
(i) In Assignment 2, GP was used to evolve a classifier to categorise the 699 instances in the Wisconsin medical data set into either the benign class or the malignant class. Suggest a strategy
(rule) for translating the single program output into the above two classes.
(ii) Assuming you are going to use tree-based GP for a classification problem with five classes:
class1, class2, class3, class4 and class5. Suggest a mapping strategy that can translate the
single program output value into the four classes above.
(iii) State the advantages and limitations of your strategy for part (ii).
(i) For binary classification, the natural translation would be: if the program output value is
positive, then the instance associated with the inputs terminals is classified as class 1; otherwise,
class 2.
(ii) If progOut < T1, class 1; else if progOut <= T2, class 2; else if progOut <= T3, class 3;
else class 4.
(iii) Advantage: easy to set up and use; limitations: the class boundaries are fixed; the boundaries
need to be predefined; class orders are fixed, etc.
COMP 307
Page 15 of 31
continued...
SPARE PAGE FOR EXTRA ANSWERS
Cross out rough working that you do not want marked.
Specify the question number for work that you do want marked.
COMP 307
Page 16 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
Question 5. Uncertainty and Belief Nets
[28 marks]
(a) [3 marks] Suppose that 5% of people on campus have colds. The likelihood of coughing given
that one has a cold is 0.8 (that is, 80% of people with colds cough), whereas the likelihood of
coughing without having a cold is 0.3.
Alice is coughing: what is the probability that she has a cold? Show your working.
t=true, f=false. Use Bayes.
P(cold = t|cough = t) =
P(cough = t|cold = t) P(cold = t)
P(cough = t)
Denominator is P(cough = t|cold = t) P(cold = t) + P(cough = t|cold = f ) P(cold = f )
P(cold = t|cough = t) =
.8 × .05
.8 × .05 + .3 × .95
(b) [6 marks] The questions below relate to the following belief networks:
In each case, answer either true or false.
Reminder: the ⊥
⊥ symbol stands for “independent”, and the vertical line stands for “given”.
• F⊥
⊥ A True
• F⊥
⊥ A | D False - easy
• F⊥
⊥ A | G False - “explaining away” connection via D
• F⊥
⊥ G | H False - F is a cause of G...
• F⊥
⊥ A | G, H False - G tells you something about D...
• U⊥
⊥ Y | W False - W makes V and X dependent...
COMP 307
Page 17 of 31
continued...
Suppose that a dataset is to be used for classifying email as Spam or not. Each email will be
classified on the basis of four particular words (such as “friend”, “invest”, “lectures” and “mum”).
The spam variable S takes values ∈ {true, false}, and variables W1 , W2 , W3 , W4 are used to denote
the presence (or otherwise) of the words. The W variables take the value 1 if the associated word is
present in the email, and 0 if it isn’t.
Under the “Naı̈ve Bayes” assumption, the data set can be modelled via the following belief net
structure shown here.
(c) [3 marks] Why do you think this is “Naı̈ve”?
Assumes words are independent (given S value), which isn’t true of most text.
Suppose that the data you have does in fact obey the Naı̈ve Bayes assumption, but that you don’t
know this is true, so you set about building the belief net’s structure from scratch. One way to build
a Belief Net structure is to do it incrementally, adding variables one by one and including links as
you go for each dependence that is observed in the data.
(d) [5 marks] Suppose you build a belief net by adding nodes in the following order: W1 first, then
W2 , W3 , S, and finally W4 . Draw the belief net that would be constructed.
If build arrows must be “top-down” (as in the AIMA e.g.) then each of the first 3 W depends on
all those before it, since they’ll be correlated as data obeys the graphs from (c). S depends on all
3 of them. But W4 will only depend on S. (alt: some students allowed arrow direction to be
inferred too, giving a different solution)
COMP 307
Page 18 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(e) [5 marks] Different orderings will give different network structures. Would you suggest using
the network you gave in (d), or a different one, to compute P(S | W1 , W2 , W3 , W4 )? Explain your
reasoning carefully.
(ans depends on student’s ans to prev question). Generally: the (c) network is easy and has
fewest free parameters, which is appealing (esp. since real application would use lots of words).
Many answered that the net from words to S would be best: might be bearable for this example
but would not scale up - think about the size of the factor...
(f) [3 marks] Draw the structure of the Belief Net corresponding to the following factorization:
P( A, B, C, D, E, F, G,H, I, J ) =
P( A) P( B| A) P(C ) P( D |C ) P( E| B, D ) P( F ) P( G | E, F ) P( H ) P( I | F, H ) P( J | I )
too easy!
(g) [3 marks] List the independence relationship(s) that are enforced between G and J in the above
factorisation.
G ⊥
⊥ J | F, and G ⊥
⊥ J | I. But note many students mis-interpreted the question, which is about
the factorisation (not the Net they drew).
COMP 307
Page 19 of 31
continued...
Question 6. Inference
[17 marks]
(a) [3 marks] The SUM - PRODUCT algorithm is run on a “factor graph” derived from the underlying
belief net. What operation is carried out by a variable node in a factor graph, in the following
cases:
• If the node’s value is observed: description of a vector of zeros with one one...
• If the node’s value is not observed:
(or ones if none)
elt-wise product of all OTHER incoming messages
(b) [4 marks] What operation is carried out by a factor node that is not a leaf (ie. not a terminal
node)? It may help to give a simple example.
description of the multiplication of factor by incoming msgs, followed by summing out...
(c) [2 marks] What major constraint on the structure of the network must be obeyed in order for the
SUM - PRODUCT algorithm to yield the correct result?
no loops
COMP 307
Page 20 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
The lectures discussed two approaches to classification: a discriminative approach and a generative
one.
(d) [4 marks] Describe the essential difference between these approaches, and give one example of
each.
Discriminative maps features to class, e.g. neural nets or kNN or GP. Generative models how
class causes features, and inverts that mapping to get class from features, e.g. Naive Bayes
(and others...). NB: is NOT just a probabilistic vs non distinction (discrim can (should!) be
probabilistic too).
(e) [4 marks] Suggest what advantages and disadvantages the discriminative approach might have,
compared to the generative one.
Various answers possible. Eg. Discriminative simpler to grasp and often simpler to code,
does not depend on modelling how the “input” data is generated. Once trained, typically very
fast at classification of new instances. But does not use knowledge of how features caused by
class, which (if known) is valuable info to use. Training can be complex and slow (e.g. NNs, GPs).
COMP 307
Page 21 of 31
continued...
Question 7. Modelling sequences
[15 marks]
Consider the following PGM which represents weather Wt being Rainy (R) vs Sunny (S) on successive days (only the first few are shown here). Suppose that the weather plays a role in determining
whether one’s mood mt is Happy (H) vs Grumpy (G) each day.
The transition probabilities for weather (as they would appear in Genie for instance) are
Wt :
Wt+1 :
For example, P(Wt+1 = S | Wt = R) = 0.3, meaning if it is rainy on one day the probability of it
being sunny on the next day is 30%.
(a) [4 marks] What is the probability that the weather on day 2 was Rainy (i.e. W2 = R), given that
it was Rainy on day 1 and Sunny on day 3?
In other words, what is P(W2 = R | W1 = R, W3 = S)? Show your working.
Can do by Bayes direct. Alternative (and simple) way is to look at the ratio:
cancels...
}|
{
z
P(W2 = R | W1 = R, W3 = S)
P(W2 = R, W1 = R, W3 = S) P(W1 = R, W3 = S)
=
P(W2 = S | W1 = R, W3 = S)
P(W2 = S, W1 = R, W3 = S) P(W1 = R, W3 = S)
P(W1 = R, W2 = R, W3 = S)
=
P(W1 = R, W2 = S, W3 = S)
P(W1 = R) P(W2 = R | W1 = R) P(W3 = S | W2 = R)
=
P(W1 = R) P(W2 = S | W1 = R) P(W3 = S | W2 = S)
|
{z
}
cancels...
0.7 × 0.3
=
0.3 ∗ 0.9
7
=
9
So if that’s the ratio, and the top and bottom must sum to one, we have
P(W2 = R | W1 = R, W3 = S) =
COMP 307
7
7+9
Page 22 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
(b) [5 marks] What is the probability of Rain under the stationary distribution?
1/4, as per the calculation done in lecture
(c) [1 mark] In the case that weather is not directly observed but moods generally are, what is the
name given to this structure of belief net?
Hidden Markov model
(d) [5 marks] Suppose you no longer believed the current transition probabilities for weather were
correct, and you acquired a data set consisting of a long series of mood observations: M1 , M2 , M3 , . . .,
and so on. In words, describe how you could go about improving the transitions probabilities from
this data.
Description of EM learning in an HMM. We assume an initial factor relating weather as cause
to mood as effect, and use this to run Sum-Product and get partial counts. Add Laplace smoothing
to those, to get improved factors (both for the transitions we want and also the “emissions” from
weather to mood). Need for iteration.
COMP 307
Page 23 of 31
continued...
SPARE PAGE FOR EXTRA ANSWERS
Cross out rough working that you do not want marked.
Specify the question number for work that you do want marked.
COMP 307
Page 24 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
Question 8. Planning
[20 marks]
In Classical Planning, two ways of deriving a plan are known as forward and backward chaining.
(a) [2 marks] Planning is more difficult if there are several states that “look the same”, meaning they
cannot be distinguished by the sensors available to the agent. This is sometimes called “perceptual
aliasing”. To be able to plan in this case, an agent needs to work with a different representation from
the one it would use if it could always tell exactly what state it is in. What is the key difference
between the two representations?
Need to represent Belief states, not just states.
(b) [3 marks] What are the consequences for the efficiency of classical planning?
Planning has the potential to become exponentially harder, due to the “curse of dimensionality”:
the space of Belief states is exponentially larger than the space of states (unless we can find some
strong valid assumptions to reduce it).
(c) [3 marks] In many scenarios the actual effects of actions may be somewhat stochastic and hence
not certain to match the intended effects. In such a scenario, what general effect do (i) actions and
(ii) sensors have upon Belief states?
Actions may “expand” belief state to include more possible actual states (although note it’s
possible actions could shrink it too), while sensing always “shrinks it”, essentially by ruling out
possibilities.
COMP 307
Page 25 of 31
continued...
A Markov Decision Process (MDP) is defined by these four quantities:
• states, s1 , s2 , . . .
• actions, a1 , a2 . . .
• rewards for each state, r1 , r2 . . .
• transition probabilities Ps0 |s,a
In an MDP, the “Value” V (st ) of the state st at time t is defined to be the expected sum of future
rewards R(st ), R(st+1 ), R(st+2 ), . . ., where each of these rewards is discounted according to how
far into the future it is. The Value depends on the current policy, usually denoted π. π a|s is the
probability of carrying out action a when in state s.
(d) [6 marks] Justify the use of the following equation (known as the “Back-Up” equation):
V π (s) = R(s) + γ ∑ π a|s ∑ Ps0 |s,a V π (s0 )
a
s0
(Awkwardly, the word “Justify” is ambiguous - “Explain” would have been better. We had
to mark both interpretations as fairly as possible. Full marks for picking the equation apart,
explaining each component, followed by brief comment on how it can be used to.
COMP 307
Page 26 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
It seems that in order to apply Value Iteration to an MDP we need to know the transition probabilities,
both to discover V (s) for all states via Value Iteration, and then to use those V (s) values to decide
which action to carry out in a given state. However, suppose instead that we start off not knowing
the transition probabilities. Two options might be:
• we could learn the transition probabilities from data, and then do standard Value Iteration; OR
• instead of finding V (s), we could learn the analagous value of carrying out each possible action
in each state, often called a “Q value”. The expected sum of discounted future rewards from
carrying out action a in state s is denoted Q(s, a).
(e) [3 marks] How might you go about learning the transition probabilities of an MDP from data?
Simply count the occurences of each outcome state given each starting state and the action, add
Laplace smoothing, and normalise appropriately to get the probability.
(f) [3 marks] (Hard) Can you suggest a way to learn the Q values from experience?
Various answers are possible - full marks for a cogent argument / suggestions that were on-topic
(not necessarily perfect/correct).
COMP 307
Page 27 of 31
continued...
Question 9. Philosophy of AI
[10 marks]
Choose ONE of the below to discuss. For your choice, outline the issue at stake and indicate whether
you find the argument compelling or dubious, and why. Answer at a depth appropriate to the marks
on offer for this question.
• Searle’s “Chinese Room” thought experiment (against functionalism); OR
• the “brain replacement” thought experiment (in favour of functionalism); OR
• In recent years the use of probabilistic inference in learned models has become widespread in
AI. Noam Chomsky has argued that the field’s heavy use of probability-based models to pick
regularities in masses of data is unlikely to yeild the explanatory insight that science ought to
offer. (Others such as Peter Norvig have argued the converse, that developing and implementing
such models is actually a promising way to arrive at insight).
Feel free to write your answer on other paper rather than the box, if necessary.
(a wide variety of answers are possible here - marks awarded for a coherent argument, and ability
to see multiple sides of the issue, indicative of having given it some thought. Regurgitating only
the question + opinion without argument got half marks.)
COMP 307
Page 28 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
...
***************
COMP 307
Page 29 of 31
SPARE PAGE FOR EXTRA ANSWERS
Cross out rough working that you do not want marked.
Specify the question number for work that you do want marked.
COMP 307
Page 30 of 31
continued...
Student ID: . . . . . . . . . . . . . . . . . . . . . . . .
Appendix for COMP307 exam
(You may tear off this page if you wish.)
A
Some Formulae You Might Find Useful
p(C | D ) =
p( D |C ) p(C )
p( D )
f ( xi ) =
1
1 + e − xi
(1)
(2)
Oi = f ( Ii ) = f (∑ wk→i · ok + bi )
(3)
∆wi→ j = ηoi o j (1 − o j ) β j
(4)
∑ w j → k o k (1 − o k ) β k
(5)
k
βj =
k
β j = dj − oj
B
(6)
Sigmoid/Logistic Function
COMP 307
Page 31 of 31
continued...
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement