Implementing a POMDP (well, a MOMDP) where actions must be taken in a specific order #492
Replies: 4 comments 13 replies
-
Hi @jmuchovej, I was able to get your MWE working. It appears that NativeSARSOP and SARSOP give different results (@WhiffleFish FYI). More on that below. I ran solver = SARSOP.SARSOPSolver(; verbose=true)
policy = solve(solver, pomdp)
for (b, s, a, r) in stepthrough(pomdp, policy, updater(policy), b1, "b,s,a,r", max_steps=10)
@show s
@show a
@show r
end and got
Is that the desired behavior? One thing to keep in mind about SARSOP is that it will only calculate policies for beliefs reachable from the initial belief. So when you have a problem with deterministic dynamics, it might be a good idea to do A couple of followup questions:
Estimates of what for a single action?
Why were you expecting it to map to a distribution of actions? Every POMDP has at least one deterministic optimal policy, so POMDPs.jl returns just one action from a call to Now, more on why NativeSARSOP apparently may not have worked: If you run
These hashes should be the same. |
Beta Was this translation helpful? Give feedback.
-
Yeah, unfortunately POMDPXFiles does not have support for this, so you would have to enforce it in another way if you want to use SARSOP.jl |
Beta Was this translation helpful? Give feedback.
-
Ok, after that long detour down the
I do not think that the So, after solving the POMDP with To fix this, you should think of the |
Beta Was this translation helpful? Give feedback.
-
Ok, I'll think about the beliefs a little more later, but will respond to the action value issues.
This is a bit harder than one would think because most POMDP solvers try very hard to only do the computation they need to find the best policy. So, if actions are not in the best policy, it will ignore them.
SARSOP maintains an upper and lower bound on the value function and stops when they are within some epsilon. I think the reason that it has stopped is because it has proven that it has found an optimal policy, not because of an "exploration reward". I think the easiest hack to get Q-values from every action would be to create a special POMDP that only allows one action on the first step. This would mean augmenting the state to include a first step flag, which would roughly double the size of the state space (it may be possible to just add one state, but this would be conceptually more confusing), but you can at least get something working on small problems. Then you can do something like: values = Dict()
for a in actions
am = FruitWorld(force_initial_action = a)
policy = solve(solver, am)
values[a] = value(policy, initialstate(am), a) # this actually does not work due to issue 504
end |
Beta Was this translation helpful? Give feedback.
-
Hi! So, I'm working on a MOMDP (specified as a POMDP) where agents must move to a given box, open it, then the contents can be "taken".
Right now, there are two problems I've been running into:
RockSample
).The problem should be setup as follows:
The action space is
Move(b), Take(b)
for every box (so with 3 boxes, you'd haveMove(1), Take(1), Move(2), Take(2), Move(3), Take(3)]
.The observation space is each of the possible items and a "null item", since you only observe the contents of the box on "move" (so, on
Move(1)
you see the contents ofbox1
, onMove(2)
you see the contents ofbox2
, etc.).The transition function is... (each of these are deterministic)
Move(b)
action, you move to that location and the box contents are updated/revealed. (So if you're areState(spawn, [🍋, 🍒, 🫐])
and takeMove(1)
(butbox1
actually has🍒
), then you'd end up atState(box1, [🍒, 🍒, 🫐])
.)Take(b)
action, if you're at that location then you move to the terminal state, otherwise you stay where you are. (So if you're atbox2
and takeTake(1)
, then you stay atbox2
with the box contents unchanged.)The state space is a MOMDP. Location is fully observable, but box contents are a belief. This is setup as
State(spawn, [🍋, 🍋, 🍋]), ..., State(box3, [🫐, 🫐, 🫐])
with the cross-product of items in boxes repeated for each location in the world.Here's a gist with a MWE. (Since it's far too much to post here.)
In case it's unclear in the script, the two beliefs you see
b0
andb1
are quite distinct.b0
is the uniform belief overspawn
locations whileb1
is heavily biased towardsspawn
locations with🍋
inbox1
(thus, the intuition goes that we should seeMove(1)
as the most likely move, because the agent believesbox1
has it's "desired" fruit.Additionally, note that the block where I'm doing 20 steps through the policy – the agent repeats
TakeAction(1)
(usually, if not, it's still aTakeAction
). There are at least two problems with this behavior:TakeAction
shouldn't be the most likely action because it's not at an eligible location (I tried creating a belief/state-space dependent set of actions, but it seems to go unused).[Move(1), Move(2), Take(2)]
is-4 + -4 * 0.99 + 90 * 0.99^2 = 80.249
vs[Move(3), Take(3)] = 85.1
– though[Move(3), Take(3)]
is the most efficient action set, I'd still expect the pressure fromb1
to push towards[Move(1), Move(2), Take(2)]
since the agent is practically certain🍋
are inbox1
.I've also slapped super negative rewards for behavior like this (taking non-target objects, taking when not at the location, etc.) but that usually generates a policy that maps to a single action rather than a proper distribution over actions.
There's a lot going on here, so definitely let me know how I can clarify things! 🙂
Beta Was this translation helpful? Give feedback.
All reactions