Student ID: . . . . . . . . . . . . . . . . . . . . . . . . T E W H A R E W Ā N A N G A O T E Ū P O K O O T E I K A A M Ā U I VUW VICTORIA UNIVERSITY OF WELLINGTON EXAMINATIONS – 2014 TRIMESTER ONE COMP 307 ************ WITH SOLUTIONS ************ INTRODUCTION TO ARTIFICIAL INTELLIGENCE Time Allowed: THREE HOURS Instructions: Closed Book. There are a total of 180 marks on this exam. Attempt all questions. Only silent non-programmable calculators or silent programmable calculators with their memories cleared are permitted in this examination. Non-electronic foreign language translation dictionaries may be used. The appendix on the last sheet can be removed for reference for questions 2-4. Questions 1. Search [25] 2. Machine Learning Basics [30] 3. Neural Networks [15] 4. Evolutionary Computation and Learning [20] 5. Uncertainty and Belief nets [28] 6. Inference [17] 7. Modelling sequences [15] 8. Planning [20] 9. Philosophy of AI [10] Question 1. Search [25 marks] Based on the figure below, answer questions (a) and (b). A B E C H D I J F K L G M N O (a) [2 marks] Assuming that you are using breadth first search, state the search order/path using the letters in the nodes. A, B, E, C, D, F, G, H, I, J, K, L, M, N, O (b) [4 marks] Assuming that you are using iterative deepening search, state the search order/path using the letters in the nodes. Limit = 0: A Limit = 1: A, B, E Limit = 2: A, B, C, D, E, F, G Limit = 3: A, B, C, H, I, D J, K, E, F, L, M, G, N, O COMP 307 Page 2 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (c) [4 marks] Hill climbing is a basic local search technique. (i) Describe the main idea of this technique. Draw a figure if necessary. (ii) State a major limitation of this technique. (iii) State a solution to avoid (or at least reduce the degree of) the limitation in part (ii). (i) HC is a local search technique and aims to find the best state according to an objective function. It only keeps one state and its evaluation/performance, and choose the best successor. (ii) Can easily stuck to local maximum/optima. (iii) Adding a random/diversity component like simulated annealing. (d) [5 marks] Gradient descent search and (genetic) beam search are two heuristic search methods. (i) Briefly describe the differences between them. (ii) State a machine learning paradigm/technique that uses each of the two methods. (i) local/global; partial/candidate; single/multiple solutions within a particular experiment run, etc. (ii) NNs/EAs COMP 307 Page 3 of 31 continued... The figure below shows a route finding example from a part of Romania’s city map, as we discussed during the lectures. The numbers on left part (in the box) show the distances between every two cities, while the numbers on the right part show the straight line distance between each city to the final goal city Bucharest. Assuming that the initial city is Arad (and the goal city is Bucharest), answer questions (f) and (g). Oradea 71 Neamt 87 Zerind 151 75 Iasi Arad 140 92 Sibiu 99 Fagaras 118 Vaslui 80 Rimnicu Vilcea Timisoara 111 Lugoj 97 142 211 Pitesti 70 98 Mehadia 146 75 Drobeta 85 101 Hirsova Urziceni 86 138 Bucharest 120 90 Craiova Eforie Giurgiu Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj 366 0 160 242 161 176 77 151 226 244 Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind (e) [4 marks] Assuming you are using the greedy (best first) search technique, describe your search path. Show your working. Draw a figure if necessary. aaa COMP 307 Page 4 of 31 continued... 241 234 380 100 193 253 329 80 199 374 Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (f) [6 marks] Assuming you are using the A* search technique, describe your search path. Show your working. Draw a figure if necessary. aaa COMP 307 Page 5 of 31 continued... Question 2. Machine Learning Basics [30 marks] (a) [4 marks] There are several different paradigms in machine learning. State the name of an algorithm or approach used in each of the following paradigms: (i) Case based learning (ii) Induction learning (iii) Connectionist learning (iv) Evolutionary learning (b) [4 marks] The data sets are typically separated into a training set and a test set in (supervised) machine learning systems. Briefly describe the role of each of them. (i) A training set is a collection of instances that are directly used for learning a machine/concept/classifier, A test set is a collection of instances that are used to measure the performance of the learned machine/concept (c) [4 marks] Briefly describe the K Nearest Neighbour method used for classification tasks. Each unseen instance (in the test set) is compared with all the instances in the training set to calculate the distance (typically Euclidean distance) or similarity for all the training instances, Find the “nearest neighbour” (instance) from the training set based on some distance/similarity measures, Then choose the class label of the nearest neighbour as the class label of the unseen instance in the test set. COMP 307 Page 6 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (d) [5 marks] The k-means method is widely used in machine learning. (i) State whether this method is typically used for classification, clustering, or association rules. (ii) This method has an important assumption. State this assumption. (iii) Briefly describe the process of this method. It is a clustering method, and assumes the number of clusters is known in advance. The following steps are typically applied: (1) Set k initial ”means” randomly from the data set (shown in color). (2) Create k clusters by associating every instance with the nearest mean based on a distance measure. (3) Replace the old means with the centroid of each of the k clusters (as the new means). (4) Repeat the above two steps until convergence. (e) [4 marks] Briefly describe the K-fold cross validation method used in machine learning experiments. chop the available data into n equal chunks; For each chunk in turn, treat it as the test set, treat other n-1 chunks as the training set, and the classifier trained from the training set is applied to the test set; The process is repeated n times with each chunk used exactly once as the test test; The n results are “averaged”. COMP 307 Page 7 of 31 continued... (f) [9 marks] Consider the following data set with 10 instances describing whether a group of friends want to have a party in a large room at a local community centre or going outside for BBQ, of which 5 instances were for InRoomParty (at the community centre) and 5 were for BBQ (a party outside) depending on the weather conditions. They are described by three attributes: whether it is clear or raining, whether the wind is strong or weak, and whether the temperature is hot, good, or cold. Instance 1 2 3 4 5 6 7 8 9 10 Weather clear clear clear clear raining clear raining raining raining clear Wind weak weak strong weak weak strong strong strong weak strong Temp good hot good cold hot cold good cold cold hot Class BBQ BBQ BBQ BBQ BBQ InRoomParty InRoomParty InRoomParty InRoomParty InRoomParty These friends want to build a decision tree to help making decisions for whether to have the party outside for BBQ or stay inside a room for the party in the community centre. Which attribute should be chosen for the root of the decision tree if the impurity function p(BBQ) p(InRoomParty) is used? Show your working. Weather: 6/10 * (4/6 * 2/6) + 4/10 * (1/4 * 3/4) = 5/24 = 20.83% Wind: 5/10 * (4/5 * 1/5) + 5/10 * (1/5 * 4/5) = 4/25 = 16.0% Temp: 3/10 * (2/3 * 1/3) + 3/10 * (2/3 * 1/3) + 4/10 * (1/4 * 3/4) = 7/40 = 5/24 = 20.83% Wind has the lowest score, therefore the algorithm will use Wind at the root. COMP 307 Page 8 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . SPARE PAGE FOR EXTRA ANSWERS Cross out rough working that you do not want marked. Specify the question number for work that you do want marked. COMP 307 Page 9 of 31 continued... Question 3. Neural Networks [15 marks] (a) [8 marks] Consider the following feed forward network which uses the sigmoid/logistic transfer function (see Appendix B), x2 x1 1 W13 W15 W23 W14 W13 W14 W15 W23 W24 W25 W36 2 W24 W25 = −1.1 = 3.5 = −2.2 = −3.7 = 0.5 Weights = 1.5 = 1.8 W46 = −2.2 3 4 W36 5 W46 W56 W56 = 8.1 b3 = 0.5 b4 = 1.0 b5 = 1.5 b6 = −4.5 6 Biases (i) What will be the output of node 6 (O6 ) for the input vector (0.0, 0.0)? (ii) What will be the new value of weight w56 after one epoch of training using the back propagation algorithm? Assume that the training set consists of only the vector (0.0, 0.0, 0.0) corresponding to the two input feature values and the target output value, and that the learning rate η is 0.25. Show your working. O1 = I1 = O2 = I2 = 0; I3 = b3 = 0.5; O3 = f (0.5) = 0.62; O4 = f (1.0) = 0.73; O5 = f (1.5) = 0.82 I6 = 0.62 · 1.8 + 0.73 · (−2.2) + 0.82 · 8.1 + (−4.5) = 1.652; O6 = f (1.652) = 0.83 β 6 = 0 − 0.83 = −0.83 ∆W56 = ηO5 O6 (1 − O6 ) β 6 = 0.25 × 0.82 × 0.83(1 − 0.83)(−0.83) = −0.024 (W56 )new = (W56 )old + ∆W56 = 8.1 − 0.024 = 8.076 COMP 307 Page 10 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (b) [4 marks] Peter Smith has developed a neural network classifier for use in the fruit industry. The classifier is intended to detect potential rotten fruit from good fruit using images of individual fruit items. The process involves the extraction of 4 features from each image, and the use of a standard multilayer feed forward neural network trained with the back propagation learning algorithm for classification. There are 500 examples in total, of which 50 are used for network training and 450 for testing. The network architecture he used is 4-23-2. After training for 18,000 epochs, the network classifier still performed badly on the test set. Suggest four good ways for improving the performance. (1) re-split the data sets and use more examples for training; (2) use fewer hidden nodes; (3) Train fewer epochs; (4) any other reasonable suggestions, e.g. get more and better features, or use a validation set to control overfitting. (c) [3 marks] The standard back propagation algorithm does not specify when to stop the network training process. Briefly state three commonly used termination criteria in network training using the back propagation algorithm. (1) epoch/cycle control strategy; (2) error control strategy; (3) proportion/accuracy control strategy (4) user control strategy; (5) validation control/early stopping strategy. COMP 307 Page 11 of 31 continued... SPARE PAGE FOR EXTRA ANSWERS Cross out rough working that you do not want marked. Specify the question number for work that you do want marked. COMP 307 Page 12 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . Question 4. Evolutionary Computation and Learning [20 marks] (a) [3 marks] Genetic algorithms and genetic programming are two techniques in evolutionary computation and learning. State six additional techniques in evolutionary computation and learning. evolutionary strategy, evolutionary programming, classifier systems, particle swarm optimisation, differential evolution, evolutionary multi-objective optimisation, artificial immune systems, ant colony optimisation, etc. (b) [6 marks] In evolutionary algorithms, the three main genetic operators, elitism, crossover and mutation, are often used to generate new candidate solutions. Briefly describe what each operator does and the main purpose for each of them. Elitist or reproduction, is the operator that directly copies a small set of best individuals from the current generation to the next generation. The main purpose is to make sure the evolution does not make the best solution worse. Crossover is the operator that combines genetic materials from different individuals to form new candidate solutions. The main goal is to integrate the advantages of existing individual solutions to form better solutions. Mutation is the operator that randomly change a part of the selected individual from the population. The main purpose is to maintain the diversity of a population. COMP 307 Page 13 of 31 continued... (c) [5 marks] Briefly describe the general evolutionary process in Evolutionary Algorithms. Draw a figure if necessary. this should include initialisation, evaluation, selection, mating and when to stop. COMP 307 Page 14 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (d) [6 marks] The standard tree-based genetic programming (GP) approach has been applied to many classification tasks. In this approach, each evolved program typically returns a single floating point number. One of the key issues here is to use a strategy to translate the single output value of an evolved classifier program into a set of class labels. (i) In Assignment 2, GP was used to evolve a classifier to categorise the 699 instances in the Wisconsin medical data set into either the benign class or the malignant class. Suggest a strategy (rule) for translating the single program output into the above two classes. (ii) Assuming you are going to use tree-based GP for a classification problem with five classes: class1, class2, class3, class4 and class5. Suggest a mapping strategy that can translate the single program output value into the four classes above. (iii) State the advantages and limitations of your strategy for part (ii). (i) For binary classification, the natural translation would be: if the program output value is positive, then the instance associated with the inputs terminals is classified as class 1; otherwise, class 2. (ii) If progOut < T1, class 1; else if progOut <= T2, class 2; else if progOut <= T3, class 3; else class 4. (iii) Advantage: easy to set up and use; limitations: the class boundaries are fixed; the boundaries need to be predefined; class orders are fixed, etc. COMP 307 Page 15 of 31 continued... SPARE PAGE FOR EXTRA ANSWERS Cross out rough working that you do not want marked. Specify the question number for work that you do want marked. COMP 307 Page 16 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . Question 5. Uncertainty and Belief Nets [28 marks] (a) [3 marks] Suppose that 5% of people on campus have colds. The likelihood of coughing given that one has a cold is 0.8 (that is, 80% of people with colds cough), whereas the likelihood of coughing without having a cold is 0.3. Alice is coughing: what is the probability that she has a cold? Show your working. t=true, f=false. Use Bayes. P(cold = t|cough = t) = P(cough = t|cold = t) P(cold = t) P(cough = t) Denominator is P(cough = t|cold = t) P(cold = t) + P(cough = t|cold = f ) P(cold = f ) P(cold = t|cough = t) = .8 × .05 .8 × .05 + .3 × .95 (b) [6 marks] The questions below relate to the following belief networks: In each case, answer either true or false. Reminder: the ⊥ ⊥ symbol stands for “independent”, and the vertical line stands for “given”. • F⊥ ⊥ A True • F⊥ ⊥ A | D False - easy • F⊥ ⊥ A | G False - “explaining away” connection via D • F⊥ ⊥ G | H False - F is a cause of G... • F⊥ ⊥ A | G, H False - G tells you something about D... • U⊥ ⊥ Y | W False - W makes V and X dependent... COMP 307 Page 17 of 31 continued... Suppose that a dataset is to be used for classifying email as Spam or not. Each email will be classified on the basis of four particular words (such as “friend”, “invest”, “lectures” and “mum”). The spam variable S takes values ∈ {true, false}, and variables W1 , W2 , W3 , W4 are used to denote the presence (or otherwise) of the words. The W variables take the value 1 if the associated word is present in the email, and 0 if it isn’t. Under the “Naı̈ve Bayes” assumption, the data set can be modelled via the following belief net structure shown here. (c) [3 marks] Why do you think this is “Naı̈ve”? Assumes words are independent (given S value), which isn’t true of most text. Suppose that the data you have does in fact obey the Naı̈ve Bayes assumption, but that you don’t know this is true, so you set about building the belief net’s structure from scratch. One way to build a Belief Net structure is to do it incrementally, adding variables one by one and including links as you go for each dependence that is observed in the data. (d) [5 marks] Suppose you build a belief net by adding nodes in the following order: W1 first, then W2 , W3 , S, and finally W4 . Draw the belief net that would be constructed. If build arrows must be “top-down” (as in the AIMA e.g.) then each of the first 3 W depends on all those before it, since they’ll be correlated as data obeys the graphs from (c). S depends on all 3 of them. But W4 will only depend on S. (alt: some students allowed arrow direction to be inferred too, giving a different solution) COMP 307 Page 18 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (e) [5 marks] Different orderings will give different network structures. Would you suggest using the network you gave in (d), or a different one, to compute P(S | W1 , W2 , W3 , W4 )? Explain your reasoning carefully. (ans depends on student’s ans to prev question). Generally: the (c) network is easy and has fewest free parameters, which is appealing (esp. since real application would use lots of words). Many answered that the net from words to S would be best: might be bearable for this example but would not scale up - think about the size of the factor... (f) [3 marks] Draw the structure of the Belief Net corresponding to the following factorization: P( A, B, C, D, E, F, G,H, I, J ) = P( A) P( B| A) P(C ) P( D |C ) P( E| B, D ) P( F ) P( G | E, F ) P( H ) P( I | F, H ) P( J | I ) too easy! (g) [3 marks] List the independence relationship(s) that are enforced between G and J in the above factorisation. G ⊥ ⊥ J | F, and G ⊥ ⊥ J | I. But note many students mis-interpreted the question, which is about the factorisation (not the Net they drew). COMP 307 Page 19 of 31 continued... Question 6. Inference [17 marks] (a) [3 marks] The SUM - PRODUCT algorithm is run on a “factor graph” derived from the underlying belief net. What operation is carried out by a variable node in a factor graph, in the following cases: • If the node’s value is observed: description of a vector of zeros with one one... • If the node’s value is not observed: (or ones if none) elt-wise product of all OTHER incoming messages (b) [4 marks] What operation is carried out by a factor node that is not a leaf (ie. not a terminal node)? It may help to give a simple example. description of the multiplication of factor by incoming msgs, followed by summing out... (c) [2 marks] What major constraint on the structure of the network must be obeyed in order for the SUM - PRODUCT algorithm to yield the correct result? no loops COMP 307 Page 20 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . The lectures discussed two approaches to classification: a discriminative approach and a generative one. (d) [4 marks] Describe the essential difference between these approaches, and give one example of each. Discriminative maps features to class, e.g. neural nets or kNN or GP. Generative models how class causes features, and inverts that mapping to get class from features, e.g. Naive Bayes (and others...). NB: is NOT just a probabilistic vs non distinction (discrim can (should!) be probabilistic too). (e) [4 marks] Suggest what advantages and disadvantages the discriminative approach might have, compared to the generative one. Various answers possible. Eg. Discriminative simpler to grasp and often simpler to code, does not depend on modelling how the “input” data is generated. Once trained, typically very fast at classification of new instances. But does not use knowledge of how features caused by class, which (if known) is valuable info to use. Training can be complex and slow (e.g. NNs, GPs). COMP 307 Page 21 of 31 continued... Question 7. Modelling sequences [15 marks] Consider the following PGM which represents weather Wt being Rainy (R) vs Sunny (S) on successive days (only the first few are shown here). Suppose that the weather plays a role in determining whether one’s mood mt is Happy (H) vs Grumpy (G) each day. The transition probabilities for weather (as they would appear in Genie for instance) are Wt : Wt+1 : For example, P(Wt+1 = S | Wt = R) = 0.3, meaning if it is rainy on one day the probability of it being sunny on the next day is 30%. (a) [4 marks] What is the probability that the weather on day 2 was Rainy (i.e. W2 = R), given that it was Rainy on day 1 and Sunny on day 3? In other words, what is P(W2 = R | W1 = R, W3 = S)? Show your working. Can do by Bayes direct. Alternative (and simple) way is to look at the ratio: cancels... }| { z P(W2 = R | W1 = R, W3 = S) P(W2 = R, W1 = R, W3 = S) P(W1 = R, W3 = S) = P(W2 = S | W1 = R, W3 = S) P(W2 = S, W1 = R, W3 = S) P(W1 = R, W3 = S) P(W1 = R, W2 = R, W3 = S) = P(W1 = R, W2 = S, W3 = S) P(W1 = R) P(W2 = R | W1 = R) P(W3 = S | W2 = R) = P(W1 = R) P(W2 = S | W1 = R) P(W3 = S | W2 = S) | {z } cancels... 0.7 × 0.3 = 0.3 ∗ 0.9 7 = 9 So if that’s the ratio, and the top and bottom must sum to one, we have P(W2 = R | W1 = R, W3 = S) = COMP 307 7 7+9 Page 22 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . (b) [5 marks] What is the probability of Rain under the stationary distribution? 1/4, as per the calculation done in lecture (c) [1 mark] In the case that weather is not directly observed but moods generally are, what is the name given to this structure of belief net? Hidden Markov model (d) [5 marks] Suppose you no longer believed the current transition probabilities for weather were correct, and you acquired a data set consisting of a long series of mood observations: M1 , M2 , M3 , . . ., and so on. In words, describe how you could go about improving the transitions probabilities from this data. Description of EM learning in an HMM. We assume an initial factor relating weather as cause to mood as effect, and use this to run Sum-Product and get partial counts. Add Laplace smoothing to those, to get improved factors (both for the transitions we want and also the “emissions” from weather to mood). Need for iteration. COMP 307 Page 23 of 31 continued... SPARE PAGE FOR EXTRA ANSWERS Cross out rough working that you do not want marked. Specify the question number for work that you do want marked. COMP 307 Page 24 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . Question 8. Planning [20 marks] In Classical Planning, two ways of deriving a plan are known as forward and backward chaining. (a) [2 marks] Planning is more difficult if there are several states that “look the same”, meaning they cannot be distinguished by the sensors available to the agent. This is sometimes called “perceptual aliasing”. To be able to plan in this case, an agent needs to work with a different representation from the one it would use if it could always tell exactly what state it is in. What is the key difference between the two representations? Need to represent Belief states, not just states. (b) [3 marks] What are the consequences for the efficiency of classical planning? Planning has the potential to become exponentially harder, due to the “curse of dimensionality”: the space of Belief states is exponentially larger than the space of states (unless we can find some strong valid assumptions to reduce it). (c) [3 marks] In many scenarios the actual effects of actions may be somewhat stochastic and hence not certain to match the intended effects. In such a scenario, what general effect do (i) actions and (ii) sensors have upon Belief states? Actions may “expand” belief state to include more possible actual states (although note it’s possible actions could shrink it too), while sensing always “shrinks it”, essentially by ruling out possibilities. COMP 307 Page 25 of 31 continued... A Markov Decision Process (MDP) is defined by these four quantities: • states, s1 , s2 , . . . • actions, a1 , a2 . . . • rewards for each state, r1 , r2 . . . • transition probabilities Ps0 |s,a In an MDP, the “Value” V (st ) of the state st at time t is defined to be the expected sum of future rewards R(st ), R(st+1 ), R(st+2 ), . . ., where each of these rewards is discounted according to how far into the future it is. The Value depends on the current policy, usually denoted π. π a|s is the probability of carrying out action a when in state s. (d) [6 marks] Justify the use of the following equation (known as the “Back-Up” equation): V π (s) = R(s) + γ ∑ π a|s ∑ Ps0 |s,a V π (s0 ) a s0 (Awkwardly, the word “Justify” is ambiguous - “Explain” would have been better. We had to mark both interpretations as fairly as possible. Full marks for picking the equation apart, explaining each component, followed by brief comment on how it can be used to. COMP 307 Page 26 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . It seems that in order to apply Value Iteration to an MDP we need to know the transition probabilities, both to discover V (s) for all states via Value Iteration, and then to use those V (s) values to decide which action to carry out in a given state. However, suppose instead that we start off not knowing the transition probabilities. Two options might be: • we could learn the transition probabilities from data, and then do standard Value Iteration; OR • instead of finding V (s), we could learn the analagous value of carrying out each possible action in each state, often called a “Q value”. The expected sum of discounted future rewards from carrying out action a in state s is denoted Q(s, a). (e) [3 marks] How might you go about learning the transition probabilities of an MDP from data? Simply count the occurences of each outcome state given each starting state and the action, add Laplace smoothing, and normalise appropriately to get the probability. (f) [3 marks] (Hard) Can you suggest a way to learn the Q values from experience? Various answers are possible - full marks for a cogent argument / suggestions that were on-topic (not necessarily perfect/correct). COMP 307 Page 27 of 31 continued... Question 9. Philosophy of AI [10 marks] Choose ONE of the below to discuss. For your choice, outline the issue at stake and indicate whether you find the argument compelling or dubious, and why. Answer at a depth appropriate to the marks on offer for this question. • Searle’s “Chinese Room” thought experiment (against functionalism); OR • the “brain replacement” thought experiment (in favour of functionalism); OR • In recent years the use of probabilistic inference in learned models has become widespread in AI. Noam Chomsky has argued that the field’s heavy use of probability-based models to pick regularities in masses of data is unlikely to yeild the explanatory insight that science ought to offer. (Others such as Peter Norvig have argued the converse, that developing and implementing such models is actually a promising way to arrive at insight). Feel free to write your answer on other paper rather than the box, if necessary. (a wide variety of answers are possible here - marks awarded for a coherent argument, and ability to see multiple sides of the issue, indicative of having given it some thought. Regurgitating only the question + opinion without argument got half marks.) COMP 307 Page 28 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . ... *************** COMP 307 Page 29 of 31 SPARE PAGE FOR EXTRA ANSWERS Cross out rough working that you do not want marked. Specify the question number for work that you do want marked. COMP 307 Page 30 of 31 continued... Student ID: . . . . . . . . . . . . . . . . . . . . . . . . Appendix for COMP307 exam (You may tear off this page if you wish.) A Some Formulae You Might Find Useful p(C | D ) = p( D |C ) p(C ) p( D ) f ( xi ) = 1 1 + e − xi (1) (2) Oi = f ( Ii ) = f (∑ wk→i · ok + bi ) (3) ∆wi→ j = ηoi o j (1 − o j ) β j (4) ∑ w j → k o k (1 − o k ) β k (5) k βj = k β j = dj − oj B (6) Sigmoid/Logistic Function COMP 307 Page 31 of 31 continued...

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement