POMDP Planning for Robust Robot Control

POMDP Planning for Robust Robot Control
POMDP Planning for Robust Robot Control
Joelle Pineau
School of Computer Science
McGill University
Montreal QC CANADA H3A 2A7
jpineau@cs.mcgill.ca
Geoff Gordon
Center for Automated Learning and Discovery
Carnegie Mellon University
Pittsburgh PA 15232
ggordon@cs.cmu.edu
Abstract
POMDPs provide a rich framework for planning and control in partially
observable domains. Recent new algorithms have greatly improved the scalability of POMDPs, to the point where they can be used in robot applications.
In this paper, we describe how approximate POMDP solving can be further
improved by the use of a new theoretically-motivated algorithm for selecting
salient information states. We present the algorithm, called PEMA, demonstrate competitive performance on a range of navigation tasks, and show how
this approach is robust to mismatches between the robot’s physical environment and the model used for planning.
1 Introduction
The Partially Observable Markov Decision Process (POMDP) has long been recognized as a rich framework for real-world planning and control problems, especially in robotics. However exact solutions are typically intractable for all but the
smallest problems. The main obstacle is that POMDPs assume that world states
are not directly observable, therefore plans are expressed over information states.
The space of information states is the space of all beliefs a system might have
about the world state. Information states are easy to calculate from sensor measurements, but planning over them is generally considered intractable, since the
number of information states grows exponentially with planning horizon.
Recent point-based techniques for approximating POMDP solutions have proven
effective for scaling-up planning in partially observable domains [5,10,11]. These
reduce computation by optimizing a value function over a small subset of information states (or beliefs). Often, the quality of the solution depends on which beliefs
were selected, but most techniques use ad-hoc methods for selecting beliefs.
In this paper, we describe a new version of the point-based value approximation
which features a theoretically-motivated approach to belief point selection. The
main insight is to select points which minimize a bound on the error of the value
approximation. This allows us to solve large problems with fewer points than
previous algorithms, which leads to faster planning times. Furthermore because
a reachability analysis is used to select candidate points, we restrict the search to
relevant dimensions of the belief, thereby alleviating the curse of dimensionality.
The new algorithm is key to the successful control of an indoor mobile service
robot, designed to seek and assist the elderly in residential environments. The
experiments we present show the robustness of the approach to a variety of challenging factors, including limited sensing, sensor noise, and inaccurate models.
2 Background
The Partially Observable Markov Decision Process (POMDP) provides a general
framework for acting optimally in partially observable domains. It is well-suited
1
to a great number of robotics problems where decision-making must be robust to
sensor noise, stochastic controls, and poor models. This section first establishes
the basic terminology and essential concepts pertaining to POMDPs.
2.1 Basic POMDP Terminology
We assume the standard formulation, whereby a POMDP is defined by the n-tuple:
{S, A, Z, b0 , T, O, R}. The first three components, S A and Z denote finite, discrete sets, where S is the set of states, A is the set of actions, and Z is the set of
observations. In general, it is assumed that the state at a given time t, st , is not
observable, but can be partially disambiguated through the observation zt . The
next three quantities, b0 , T , and O define the probabilistic world model that underlies the POMDP: b0 describes the probability that the domain is in each state
at time t = 0; T (s, a, s0 ) describes the state-to-state transition probabilities (e.g.
robot motion model); O(s, a, z) describes the observation probability distribution
(e.g. sensor model). And R(s, a) : S × A −→ < is a (bounded) reward function
quantifying the utility of each action for each state.
2.2 Belief Computation
POMDPs assume that the state st is not directly observable, but instead the agent
perceives observations {z1 , . . . , zt } which convey information about the state.
From these, the agent can compute a belief, or probability distribution over possible world states: bt (s) = P r(st = s | zt , at−1 , zt−1 , . . . , a0 ). Because POMDPs
are instances of Markov processes, the belief bt at time t can be calculated recursively, using only the belief one time step earlier, bt−1 , along with the most recent
action at−1 and observation zt : X
O(s0 , at−1 , zt ) T (s, at−1 , s0 ) bt−1 (s0 )
bt (s) = τ (bt−1 , at−1 , zt ) :=
s0
P r(zt |bt−1 , at−1 ).
(1)
This is equivalent to the Bayes filter, and in robotics, its continuous generalization forms the basis of the well-known Kalman filter. In many large robotics applications, tracking the belief can be computationally challenging. However in
POMDPs, the bigger challenge is the generation of an action-selection policy. We
assume throughout this paper that the belief can be computed accurately, and focus
on the problem of finding good policies.
2.3 Policy Computation
The POMDP framework’s primary purpose is to optimize an action-selection policy, of the form: π(b) −→ a, where b is a belief distribution and a is the action
chosen by the policy π. We say that a policy π ∗ (bt ) is optimal when the expected
future discounted reward is maximized:
#
" T
X
t−t0 ∗
(2)
γ
rt bt .
π (bt ) = argmax Eπ
π
t=t
0
Computing an optimal policy over all possible beliefs can be challenging [2], and
so many recent POMDP approximations have been proposed which gain computational advantage by applying value updates at a few specific belief points [5, 7,
10, 11]. These techniques differ in how they select the belief points, but all use the
2
same procedure for updating the value over a fixed set of points. The key to updating a value function over a fixed set of beliefs, B = {b0 , b1 , ..., bq }, is in realizing
that the value function contains at most one α-vector for each belief point, thus
yielding a fixed-size solution set: Γt = {α0 , α1 , . . . , αq }.
The standard procedure for point-based value update is the following. First we
generate intermediate sets Γa,∗
and Γa,z
t
t , ∀a ∈ A, ∀z ∈ Z:
Γa,∗
t
Γa,z
t
← {αa,∗ }, where αa,∗ (s) = R(s, a)
←
{αa,z
i
| αi ∈ Γt−1 }, where
αa,z
i (s)
(3)
X
=γ
T (s, a, s0 )O(s0 , a, z)αi (s0 ).
s0 ∈S
Next, we take the expectation over observations and construct Γbt , ∀b ∈ B:
X
X
argmax
α(s)b(s). (4)
Γbt ← {αa,b | a ∈ A}, where αa,b = Γa,∗
+
t
a,z
z∈Z α∈Γt
s∈S
Finally, we find the best action for each belief point:
Γt
← {αb | b ∈ B}, where αb = argmax
α∈Γbt
X
α(s)b(s).
(5)
s
Because the size of the solution set Γt is constant, the point-based value update
can be computed in polynomial time. And while these operations preserve only
the best α-vector at each belief point b ∈ B, an estimate of the value function at
any belief in the simplex (including b ∈
/ B) can be extracted from the set Γt :
X
α(s)b(s).
(6)
Vt (b) = max
α∈Γt
s∈S
2.4 Error Bound on Point-Based Value Updates
The point-based value update operation is an integral part of many approximate
POMDP solvers. As shown in [5], given a fixed belief set B and planning horizon
t, the error over multiple value updates is bounded by1 :
kVtB − Vt∗ k∞ ≤
(Rmax − Rmin ) maxb0 ∈∆ minb∈B kb − b0 k1 .
(1 − γ)2
where b0 ∈ ∆ is the point where the point-based update makes its worst error in
value update, and b ∈ B is the closest (1-norm) sampled belief to b0 . Now let α be
the vector that is maximal at b, and α0 be the vector that would be maximal at b0 .
Then, we can show equivalently that
(b0 ) ≤
≤
≤
α0 · b0 − α · b0
0
(α0 −
( α) · (b − b)
max
P
( R1−γ
− αi )(b0i − bi ) b0i ≥ bi
Rmin
i
( 1−γ − αi )(b0i − bi ) b0i < bi .
1 The error bound proven in [5] depends on the sampling density over the belief simplex ∆. But
when the initial belief b0 is known, it is not necessary to sample all of ∆ densely. Instead, we can
¯ densely; the error bound holds on ∆.
¯
sample the set of reachable beliefs ∆
3
3 Error-Minimization Point Selection
Many recent point-based value approximations, which show good empirical success, use poorly informed heuristics to select belief points. We now describe a new
algorithm for selecting provably good belief points. The algorithm directly uses
¯ which most reduce the
the error bound above to pick those reachable beliefs b ∈ ∆
error bound. Figure 1a shows the tree of reachable beliefs, starting with the initial
belief (top node). Building the tree (to a finite depth) is easily done by recursively
using Equation 1.
b0
...
1 0
1 1
ba z
...
1 q
...
...
...
...
ba z ba z
...
ba z ba z
p 0
p 1
ba z
p q
ba z
0 0 ap zq
(b)
...
...
(a)
0 0 a0 z0
0 q
...
...
ba z
...
ba z
...
0 1
...
0 0
...
ba z ba z
Figure 1: (a) The set of reachable beliefs. Each node corresponds to a specific belief, and increasing depth corresponds to an increasing plan horizon.(b) Pearl the Nursebot interacting
with patients in a nursing facility.
Applying point-based value updates to all reachable beliefs would guarantee optimal performance, but at the expense of computational tractability: a planning
problem of horizon t has O(|A||Z|t ) reachable beliefs. So we select from our
reachable beliefs those most likely to minimize the error in our value function.
Given the belief tree in Figure 1a, we consider three sets of nodes. Set 1 includes
all points already in B (in this example b0 and ba0 ,z0 ). Set 2 contains the set of
candidates from which we will select new points to be added to B. We call this set
the fringe (denoted B̄). Set 3 contains all other reachable beliefs.2
Now we need to decide which belief b should be removed from the fringe B̄ and
added to the set of active points B. Every new point added to B should improve
our estimate of the value function as much as possible. To find the point most
likely to do this, we consider the theoretical analysis of Section 2.4. Consider
b0 ∈ B̄, a belief point candidate, and b ∈ B, some belief which we have already
selected. While one could simply pick the candidate b0 ∈ B̄ with the largest error
bound, (b0 ), this would go against the most useful insight from earlier work on
point-based approaches: namely that reachability considerations are important. So
we need to factor in the probability of each candidate belief point occurring. We
first note that the error bound at any given belief point b in the tree can be evaluated
from that of its immediate descendants:
X
(b) = max
¯
O(b, a, z) (τ (b, a, z))
(7)
a∈A
z∈Z
2 In
Figure 1a, the fringe (B̄) is restricted to the immediate descendants of the points in B. The rest
of the paper proceeds on this assumption, but we could assume a deeper fringe.
4
Hallway
Tiger−grid
Hallway2
2.5
Reward
−5
0.5
2
0.3
0.4
1.5
−10
0.3
1
0.2
0.2
−15
0.1
0.5
0
0
10
Tag
0.4
0.1
1
10
2
10
3
10
0
0
10
# belief points
1
10
0
0
10
2
10
# belief points
20
60
1
10
2
10
3
10
400
15
Error
2
10
500
15
40
10
300
10
200
20
5
5
100
10
0
0
10
1
10
# belief points
50
30
−20
0
10
# belief points
1
10
2
10
# belief points
3
10
0
0
10
1
10
0
0
10
2
10
# belief points
1
10
# belief points
2
10
0
0
10
1
10
2
10
3
10
# belief points
Figure 2: Policy performance (top row) and estimate of the bound on the error (bottom row)
for selected belief points
where τ (b, a, z) is the belief update equation (Eqn 1), and (τ (b, a, z)) is evaluated
as in Section 2.4 (unless τ (b, a, z) ∈ B, in which case (τ (b, a, z)) = 0). So we
use Equation 7 to find the existing point b ∈ B with the largest error bound, then
pick as a new point its descendant τ (b, a, z) which has the largest impact on ¯(b).
Points on the fringe are picked one a time, allowing us to look deep in the tree; in
the experiments presented below, beliefs at 40+ levels are in fact selected.
This concludes the presentation of our new error-minimization point selection
technique. In practice, the addition of new points is always interleaved with the
point-based value updates described in Section 2.3 to form a full POMDP solution. The complete approach, called PEMA (Point-based Error Minimization
Algorithm), is now evaluated empirically in a series of robot control experiments.
4 Empirical evaluation
We begin our empirical evaluation with a few well-studied maze navigation domains. Most have been used strictly in simulation, but feature robot-like assumptions, such as non-deterministic motion and noisy sensors. The Tiger-grid, Hallway and Hallway2 problems are described in [3]. The Tag domain was introduced
in [5]. The goal of these preliminary experiments is simply to compare the performance of PEMA with earlier POMDP approximations on standard problems.
More extensive robot navigation domains are presented in the following section.
Error estimates. A first set of results on PEMA’s performance are shown in
Figure 2. For each problem domain, we first plot PEMA’s reward performance
as a function of the number of belief points (top graphs), and then plot the error
estimate of each point selected according to the order in which points were picked
(bottom graphs). As shown in these, PEMA is able to solve all four problems with
relatively few beliefs (sometimes fewer than the number of states).
Considering the error bound graphs, we see that overall there seems to be rea-
5
sonably good correspondence between an improvement in performance, and a decrease in the error estimates. We can conclude from these plots that the error bound
used by PEMA is quite informative in guiding exploration of the belief simplex.3
Comparative analysis. While the results outlined above show that PEMA is
able to handle a wide spectrum of large-scale POMDP domains, it is also useful to compare its performance to that of alternative approaches, on the same set
of problems. Figure 3 compares both reward performance and policy size4 (# of
nodes in controller) for a few recent POMDP algorithms, on the three larger problems (Hallway, Hallway2, and Tag). The algorithms included in this comparison
were selected simply based on the availability of published results for this set of
problems.
0.7
0.4
0.6
0.35
0.4
0.3
−10
0.25
Reward
Reward
0.2
0.15
0.2
−12
−14
0.1
0.1
0.05
0
0
QMDP
BPI
PBUA
PBVI
HSVI
Perseus PEMA
3.5
QMDP
BPI
PBUA
PBVI
HSVI
−18
Perseus PEMA
3.5
2
1.5
1
0.5
BPI
PBUA
PBVI
HSVI
Perseus PEMA
2
1.5
1
0
BPI
PBVI
BBSLS
HSVI
Perseus PEMA
3
2.5
0.5
QMDP
QMDP
3.5
3
log (Policy size)
3
2.5
0
−16
log (Policy size)
Reward
−8
0.3
0.5
log (Policy size)
−6
2.5
2
1.5
1
0.5
QMDP
BPI
PBUA
PBVI
HSVI
Perseus PEMA
0
QMDP
BPI
PBVI
BBSLS
HSVI
Perseus PEMA
Figure 3: Results for standard POMDP domains. Left column: Hallway problem. Middle
column: Hallway2 problem. Right column: Tag problem.
As is often the case, these results show that there is not a single algorithm that is
best for solving all problems, so it is difficult to draw broad generalizations. But
we can point out a few salient effects. First, the baseline QMDP [3] approximation
is clearly outclassed by other more sophisticated methods. We also observe that
some of the algorithms achieve sub-par performance in terms of expected reward:
BPI [9] (on Hallway2 and Tag)5 , PBVI [5] (on Tag) and BBSLS [1] (on Tag).
While each of these is theoretically able to reach optimal performance, they would
require larger controllers (and therefore longer computation time) to do so.
The remaining algorithms—HSVI [10], Perseus [11], and PEMA—offer com3 While the decrease in error over a fixed point (e.g. b ) is monotonic, the decrease in error over
0
each new points (in the order it was added) is not necessarily monotonic, which explains the large
jumps in the bottom graphs. These jumps suggest that PEMA could be improved by maintaining a
deeper fringe of candidate belief points, in which case the time spent selecting points would have to
be carefully balanced with the time spent planning. Currently, we spend less than 1% of computation
time selecting belief points; the rest is spent estimating the value function.
4 The results were computed on different platforms, so time comparisons are difficult. The size of
the final policy is often a useful indicator of computation time, but should be considered with care.
5 Better results for BPI have since been published in [8].
6
parable performance. HSVI offers good control performance on the full range of
tasks, but requires bigger controllers. HSVI and PEMA share many similarities:
both use an error bound to select belief points. HSVI’s upper-bound is tighter than
PEMA’s, but requires costly LP solutions. PEMA solves problems with fewer belief points, we believe this is because it updates all belief points more frequently,
thus generalizing better in poorly explored areas of the belief simplex.
Between Perseus and PEMA, the trade-offs are less clear: the planning time, controller size and performance quality are quite comparable. These two approaches
in fact share many similarities. Perseus uses the same point-based backups as in
PEMA (see Section 2.3), but it differs in both how the set of belief points is selected (Perseus uses random exploration traces), and the order in which it updates
the value at those points (also randomized). The effect of these differences is hard
to narrow. We did experiment informally with Perseus-type random updates within
PEMA, but this did not yield significant speed-up. It is likely that randomizing
value updates is not as beneficial when carefully picking a small set of essential
points. We speculate that PEMA will scale better to higher dimensions because of
the selective nature of the belief sampling. This is the subject of ongoing work.
5 Robotic applications
Much of the algorithmic development described in this paper is motivated by our
need for high-quality robust planning for interactive mobile robots. In particular,
we are concerned with the problem of controlling a nursing assistant robot. This
is an important technical challenge arising from the Nursebot project [6]. This
project aims to develop personalized robotic technology that can improve the level
of personal care and services for elderly individuals. The robot Pearl (Fig. 1b)
is the main experimental platform used in this project. It is equipped with standard indoor navigation abilities and is programmed with the CARMEN toolkit [4].
An important task for this robot is to provide timely cognitive reminders (e. g.
medications to take, appointments to attend, etc.) to its target population. It is
therefore crucial that the robot be able to find the person whenever it is time to
issue a reminder. We model this task as a POMDP, and use PEMA to optimize a
strategy with which the robot can robustly find the person, even under very weak
assumptions over the person’s initial location and ease of mobility.
We begin by considering the environment in which the robot operates. Figure 4
shows a 2D robot-generated map of its physical environment. The goal is for the
robot to navigate in this environment until it finds the patient and then deliver the
appropriate reminder. To successfully find the patient, the robot needs to systematically explore the environment, while reasoning about both its spatial coverage
and the likely motion pattern of the person.
5.1 POMDP Modeling
To model this task as a POMDP, we assume a state space consisting of two features: RobotPosition, and PersonPosition. Each feature is expressed through a
fixed discretization of the environment (roughly 25 cells for each feature, or 625
total states.) We assume the person and robot move freely, constrained only by
walls and obstacles. The robot’s motion is deterministic (as a function of the action={North, South, East, West}). A fifth action (DeliverMessage) concludes the
7
scenario if applied when the robot and person are in the same location. We assume the person’s motion is stochastic, and in one of two modes: (1) whenever
the person is far from the robot, s/he moves according to Brownian motion (i. e.
in each cardinal direction with P r = 0.1 or stays in place), this corresponds to a
random walk and is a conservative assumption regarding people’s motion; or (2)
whenever the robot is within sight (< 4m), the person tries to avoid the robot and
moves away from it (with noise), which makes the task more challenging.
The observation function has two parts: what the robot senses about its own position, and what it senses about the person’s position. First we assume that the
robot’s position is fully known; this is reasonable since planning is done at a much
coarser resolution (2m), than the typical localization precision (10cm). When
testing policies however, probabilistic localization is performed by the CARMEN
toolkit, and the robot’s belief incorporates any positional uncertainty. For the person’s position, we assume that the robot perceives nothing unless the person is
within 2 meters. This is plausible given the robot’s sensors. Even at short-range,
there is a small probability (P r = 0.01) that the robot will miss the person.
The reward function is straightforward: R = −1 for any motion, R = 10 when
the robot decides to DeliverMessage and is within range (<2m) of the person, and
R = −100 when the robot decides to DeliverMessage in the person’s absence.
The task terminates when the robot successfully delivers the message. We assume
a discount factor proportional to the map’s resolution (γ = 0.98).
With these POMDP parameters, we can run PEMA to optimize the robot’s control strategy. Given the complexity of POMDP planning we do assume that PEMA
will be used as an off-line algorithm to optimize the robot’s performance prior
to deployment. The results presented below describe the performance of an optimized control policy when tested onboard the CARMEN simulator.
5.2 Experimental Results
We first consider PEMA’s performance on this task, as a function of planning time.
As shown in Figure 4a, PEMA is in fact able to solve the problem within 1800 seconds, using only 128 belief points. In comparison, an MDP-type approximation
(in this case the QMDP technique [3]) proves to be inadequate for a problem exhibiting such complex uncertainty over the person’s position. Using PEMA, the
patient was found in 100% of trials, compared to 35% for QMDP.
Figure 4 shows PEMA’s policy through five snapshots from one run. The policy
is optimized for any start positions (for both the person and the robot); the execution trace in Figure 4 is one of the longer ones since the robot searches the entire
environment before finding the person. In this scenario, the person starts at the
far end of the left corridor. The person’s location is not shown in the figure since
it is not observable by the robot. The figure instead shows the belief over person
positions, represented by a distribution of point samples (grey dots). We see the
robot starting at the far right end of the corridor (Fig. 4b), moving towards the
left until the room’s entrance (Fig. 4c), and searching the entire room (Fig. 4d).
Once sufficiently certain that the person is not there, it exits the room (Fig. 4e),
and moves towards the left until it finally finds the person at the end of the corridor
(Fig. 4f).
8
−5
−10
PEMA
QMDP
−15
−20
REWARD
It is interesting to compare snapshots
(b) and (d). The robot position in both is
practically identical. Yet in (b) the robot
chooses to go up into the room, whereas
in (d) the robot chooses to move toward
the left. This is a direct result of planning over beliefs, rather than over states.
These results show that PEMA is able
to handle realistic domains. In particular, throughout these experiments, the
robot simulator was in no way constrained
to behave as described in our POMDP
model. For example the robot’s actions
often had stochastic effects, the robot’s
position was not always fully observable, and belief tracking had to be performed asynchronously (i. e. not a straight
alternation of actions and observations).
Despite this mismatch between the model
assumed for planning and the execution
environment, the control policy optimized
by PEMA successfully completed the task.
−25
−30
−35
−40
−45
−50 −1
10
0
1
10
2
10
10
TIME (secs)
3
10
4
10
5.3 Robustness to modeling errors
Like most POMDP solvers, PEMA assumes exact knowledge of the POMDP
model. In reality, this model is often
hand-crafted and may bear substantial
error. In our experience, such a mismatch between model and the real system does not necessarily render our solution useless. The robustness built in to
POMDPs to overcome state uncertainty
often goes a long way towards overcoming model uncertainty. Nonetheless, there
are cases where a poor model can be
catastrophic. In this section, we try to
gain a better understanding of the impact of errors in the model we used for
the Find-the-patient domain.
Our model assumes that the robot can
see the patient with P r = 0.99, whenever s/he is within 2m. We use this parameter both for solving and tracking.
But it could be that in fact the person is
only detected with P r = 0.8.
9
Figure 4: Find-the patient domain: (a) Performance results. (b)-(f) Sample trajectory.
Prmodel (z)
0.99
0.90
0.80
0.70
0.99
-9.7
-12.0
-9.7
-17.8
Prreal (z)
0.90
0.80
-11.3
-13.2
-13.1
-15.6
-11.5
-13.1
-19.4
-22.0
0.70
-15.5
-19.0
-14.5
-22.6
Table 1: Sensitivity analysis over observation probabilities. (CI for all: [0.7,1.4])
What would be the loss in performance, compared to if we had planned and
tracked with the correct parameter? Table 1 examines the effects of this type of
modeling error. It shows the performance (avg. sum of rewards over 1000 trajectories) when applying PEMA and tracking the belief with the sensor accuracy in the
left column, but testing with the accuracy in the top row. The main diagonal contains cases where the model is correct. These results suggest two things. First, as
expected, performance degrades as the real noise level increases (i.e. left-to right
effect for any given row.) Second, and this was not anticipated, the dominating
performance factor is in fact the noise in the assumed model: regardless of what
conditions are used for testing, results are better for some values of Prmodel (0.99
and 0.8) and worse for others (0.9 and 0.7). We hypothesize that this happens
because in some models, PEMA did not have sufficient belief points to perform
well (all policies were optimized with |B|=512). When we repeated experiments
for Prmodel (z)=0.9 with more beliefs points, the performance improved (for all
Prreal (z)) to the level of the top row. This suggest that in some domains it may be
best to optimize policies assuming false models (e. g. low sensor noise), because
an equally good policy can be obtained with fewer belief points. We are currently
investigating this, as well as the impact of modeling errors in the transition model.
6 Conclusion
This paper describes a new algorithm for planning in partially observable domains,
which features a theoretically-motivated technique for selecting salient information states. This improves the scalability of the approach, to the point where it
can be used to control a robot seeking a missing person. We also demonstrate that
the algorithm is robust to noise in the assumed model. Future work focuses on
improving performance under even weaker modeling assumptions.
References
[1] D. Braziunas and C. Boutilier. Stochastic local search for POMDP controllers. In AAAI, 2004.
[2] A. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, exact method
for partially observable Markov decision processes. In Proceedings of the Thirteenth Conference
on Uncertainty in Artificial Intelligence (UAI), pages 54–61, 1997.
[3] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially obsevable environments: Scaling up. In Proceedings of Twelfth International Conference on Machine
Learning, pages 362–370, 1995.
[4] M. Montemerlo, N. Roy, and S. Thrun. Perspectives on standardization in mobile robot programming: The Carnegie Mellon navigation (CARMEN) toolkit. In Proceedings of IROS, 2003.
[5] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for
POMDPs. In Proceedings of IJCAI, 2003.
[6] J. Pineau, M. Montermerlo, M. Pollack, N. Roy, and S. Thrun. Towards robotic assistants in
nursing homes: challenges and results. Robotics and Autonomous Systems, 42(3-4), 2003.
[7] K.-M. Poon. A fast heuristic algorithm for decision-theoretic planning. Master’s thesis, HongKong Univ. of Science and Technology, 2001.
[8] P. Poupart. Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov
Decision Processes. PhD thesis, University of Toronto, 2005.
[9] P. Poupart and C. Boutilier. Bounded finite state controllers. In Advances in Neural Information
Processing Systems (NIPS), volume 16, 2004.
[10] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. of UAI, 2004.
[11] N. Vlassis and M. T. J. Spaan. A fast point-based algorithm for POMDPs. In Belgian-Dutch
Conference on Machine Learning, 2004.
10
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising