Implementing a POMDP (well, a MOMDP) where actions must be taken in a specific order #492

jmuchovej · 2023-05-18T22:00:34Z

jmuchovej
May 18, 2023

Hi! So, I'm working on a MOMDP (specified as a POMDP) where agents must move to a given box, open it, then the contents can be "taken".

Right now, there are two problems I've been running into:

When I use planners like SARSOP I end up with policies that only have estimates for a single action (which seems odd, considering the problem is pretty similar to RockSample).
Beliefs don't impact the policy's estimates (the numeric values change, but the magnitude of change doesn't intuitively check-out).

The problem should be setup as follows:

Place N (3 atm) boxes each with 1 of M (3 atm) items (sampled with replacement).
The agent "searches" the boxes, then takes the most desirable item (each item has a non-negative reward).

The action space is Move(b), Take(b) for every box (so with 3 boxes, you'd have Move(1), Take(1), Move(2), Take(2), Move(3), Take(3)].

The observation space is each of the possible items and a "null item", since you only observe the contents of the box on "move" (so, on Move(1) you see the contents of box1, on Move(2) you see the contents of box2, etc.).

The transition function is... (each of these are deterministic)

on any Move(b) action, you move to that location and the box contents are updated/revealed. (So if you're are State(spawn, [🍋, 🍒, 🫐]) and take Move(1) (but box1 actually has 🍒), then you'd end up at State(box1, [🍒, 🍒, 🫐]).)
on any Take(b) action, if you're at that location then you move to the terminal state, otherwise you stay where you are. (So if you're at box2 and take Take(1), then you stay at box2 with the box contents unchanged.)

The state space is a MOMDP. Location is fully observable, but box contents are a belief. This is setup as State(spawn, [🍋, 🍋, 🍋]), ..., State(box3, [🫐, 🫐, 🫐]) with the cross-product of items in boxes repeated for each location in the world.

Here's a gist with a MWE. (Since it's far too much to post here.)

In case it's unclear in the script, the two beliefs you see b0 and b1 are quite distinct. b0 is the uniform belief over spawn locations while b1 is heavily biased towards spawn locations with 🍋 in box1 (thus, the intuition goes that we should see Move(1) as the most likely move, because the agent believes box1 has it's "desired" fruit.

Additionally, note that the block where I'm doing 20 steps through the policy – the agent repeats TakeAction(1) (usually, if not, it's still a TakeAction). There are at least two problems with this behavior:

TakeAction shouldn't be the most likely action because it's not at an eligible location (I tried creating a belief/state-space dependent set of actions, but it seems to go unused).
The expected reward for [Move(1), Move(2), Take(2)] is -4 + -4 * 0.99 + 90 * 0.99^2 = 80.249 vs [Move(3), Take(3)] = 85.1 – though [Move(3), Take(3)] is the most efficient action set, I'd still expect the pressure from b1 to push towards [Move(1), Move(2), Take(2)] since the agent is practically certain 🍋 are in box1.

I've also slapped super negative rewards for behavior like this (taking non-target objects, taking when not at the location, etc.) but that usually generates a policy that maps to a single action rather than a proper distribution over actions.

There's a lot going on here, so definitely let me know how I can clarify things! 🙂

zsunberg · 2023-05-20T00:09:17Z

zsunberg
May 20, 2023
Maintainer

Hi @jmuchovej,

I was able to get your MWE working. It appears that NativeSARSOP and SARSOP give different results (@WhiffleFish FYI). More on that below.

I ran

solver = SARSOP.SARSOPSolver(; verbose=true)
policy = solve(solver, pomdp)
for (b, s, a, r) in stepthrough(pomdp, policy, updater(policy), b1, "b,s,a,r", max_steps=10)
    @show s
    @show a
    @show r
end

and got

s = State([1, 1], Fruit[🍒, 🫐, 🍋])
a = MoveAction(3)
r = -4.0
s = State([5, 1], Fruit[🍒, 🫐, 🍋])
a = TakeAction(3)
r = 90.0

Is that the desired behavior?

One thing to keep in mind about SARSOP is that it will only calculate policies for beliefs reachable from the initial belief. So when you have a problem with deterministic dynamics, it might be a good idea to do initialstate(p::FruitWorld) = Uniform(states(p)) to assure that it considers all possible states.

A couple of followup questions:

When I use planners like SARSOP I end up with policies that only have estimates for a single action (which seems odd, considering the problem is pretty similar to RockSample).

Estimates of what for a single action?

but that usually generates a policy that maps to a single action rather than a proper distribution over actions.

Why were you expecting it to map to a distribution of actions? Every POMDP has at least one deterministic optimal policy, so POMDPs.jl returns just one action from a call to action(policy, b).

Now, more on why NativeSARSOP apparently may not have worked: If you run has_consistent_distributions(pomdp), you will get some warnings. These can be traced back to the following problem:

julia> hash(State([1,1], Fruit[🍒, 🍋, 🍋]))
0xde400caa2c3ed43b

julia> hash(State([1,1], Fruit[🍒, 🍋, 🍋]))
0xd5b494f1ae9f9e89

These hashes should be the same.

5 replies

jmuchovej May 20, 2023
Author

Thanks for looking into this! 😀

Is that the desired behavior?

Given what b1 is (belief that 🍋 are the most likely item in box1), I would expect the action sequence to be Move(1), Move(2), Take(2). Under the uniform belief (so b0), I would have expected the action sequence we see Move(3), Take(3).

... it might be a good idea to do initialstate(p::FruitWorld) = Uniform(states(p)) to assure that it considers all possible states.

This would end up including states where the agent most certainly couldn't be, though? (The current initialstate(p::FruitWorld) ... gives a uniform distribution over the possible arrangement of fruits in boxes, but only at the spawn location, since the agent always knows its location.) The mixed observability of this comes from always knowing it's location but not the contents.

(Perhaps I'm confusing things, if so, definitely correct me! 🙂)

Estimates of what for a single action?

I might be mixing terminology here (still figuring out POMDPs/whatnot) – originally I was using
QMDP, for which actionvalues(policy, b) = [...] are Q-estimates, right? I turn this into a distribution via softmax and treat the softmax(actionvalues...) as the likelihood of an action given the belief under the policy.

Why were you expecting it to map to a distribution of actions?

Whoops! I misspoke. The "distribution" is the softmax I mentioned above. What I meant was that if I do actionvalues(policy, b), I end up with a Q-estimate (or analog from SARSOP) for each action under the given belief (which could then be turned into a "distribution").

Now, more on why NativeSARSOP apparently may not have worked: If you run has_consistent_distributions(pomdp), you will get some warnings. ...

Ooh. I see. Just to clarify, is the likely an issue in my implementation (e.g., do I need to implement convert_s or something similar?), or do you think this is likely an issue with the NativeSARSOP implementation?

zsunberg May 20, 2023
Maintainer

Given what b1 is (belief that lemon are the most likely item in box1), I would expect the action sequence to be Move(1), Move(2), Take(2). Under the uniform belief (so b0), I would have expected the action sequence we see Move(3), Take(3).

I need to think about this a bit more and read your original post (I did not read it that carefully the first time :))

This would end up including states where the agent most certainly couldn't be, though? (The current initialstate(p::FruitWorld) ... gives a uniform distribution over the possible arrangement of fruits in boxes, but only at the spawn location, since the agent always knows its location.) The mixed observability of this comes from always knowing it's location but not the contents.

Got it. Your understanding here is correct! That should work fine.

I might be mixing terminology here (still figuring out POMDPs/whatnot) – originally I was using
QMDP, for which actionvalues(policy, b) = [...] are Q-estimates, right? I turn this into a distribution via softmax and treat the softmax(actionvalues...) as the likelihood of an action given the belief under the policy.

This isn't typically done in POMDPs because there is always at least one optimal deterministic belief-based policy. Algorithms like SARSOP only focus on the actions that are possibly optimal and may have inaccurate estimates for other actions. For SARSOP, actionvalues will return accurate estimates of the Q-value for the best action and strictly under estimates of the Q-values for all other actions. QMDP will return strict over estimates for the Q values. See this https://algorithmsbook.com/files/chapter-21.pdf chapter (and Chapter 20 of that book) for more info.

Ooh. I see. Just to clarify, is the likely an issue in my implementation (e.g., do I need to implement convert_s or something similar?), or do you think this is likely an issue with the NativeSARSOP implementation?

This is an issue with the problem implementation. I see that you have implemented == for your state type. Whenever you implement ==, you should also consider implementing hash so that they are consistent. Probably a better option is to just use an immutable object like SVector instead of Vector for the boxes field of State. Then the default == and hash will work like you want, and the whole thing will avoid heap allocations completely.

zsunberg May 20, 2023
Maintainer

OK, I have played around with your example a bit more. There are some strange things going on. I have not isolated the problem, and it is too long to explain all my steps here, but there is some evidence that somewhere, the boxes field of a State is getting modified in-place.

Why don't you implement State as

struct State
    pos::Point
    boxes::SVector{Fruit, 3}
end

and get everything working again. Then we can continue to debug.

jmuchovej May 20, 2023
Author

So, in transition(p, s, a::MoveAction), I might be modifying the state in-place, specifically the lines where I'm updating the boxes. However, I've updated things so it uses an SVector instead of a Vector.

I've updated the MWE (same link) to use SVector on the State and transition(p, s, a::MoveAction) now uses MVectors then an SVector, so I believe no other in-place modifications are happening.

This change also should resolve the hash(State(...)) inequalities, too.

zsunberg May 21, 2023
Maintainer

Nice - now POMDPTools.showdistribution is working, so it is alot easier to see things.

zsunberg · 2023-05-20T18:25:14Z

zsunberg
May 20, 2023
Maintainer

I tried creating a belief/state-space dependent set of actions, but it seems to go unused

Yeah, unfortunately POMDPXFiles does not have support for this, so you would have to enforce it in another way if you want to use SARSOP.jl

0 replies

zsunberg · 2023-05-21T02:50:37Z

zsunberg
May 21, 2023
Maintainer

Ok, after that long detour down the Vector -> SVector road, I think I see what is going on. The key problem is

on any Move(b) action, you move to that location and the box contents are updated/revealed.

I do not think that the boxes field of the state should change when Move actions are taken. If I am interpreting correctly, currently, when the Move action is taken, the state is changed so that boxes is replaced by whatever is in p.boxes.

So, after solving the POMDP with p.boxes = [Box1(🍒), Box2(🍋), Box3(🍋)] and TARGET = 🍋, the policy knows that in this POMDP, boxes 2 and 3 will have lemons once they are Moved to regardless of the belief. Thus, it can get the maximum reward with Move(3) then Take(3). Move(3) forces the state to change to one where it can get a 🍋 and then Take(3) collects the reward. It knows that Move(1) will never do any good because it will always be a 🍒.

To fix this, you should think of the boxes part of the state as the truth. That should never change. The only thing that should "reveal" anything to the agent is the observation.

8 replies

zsunberg Jun 1, 2023
Maintainer

what I would (intuitively) expect for b2 is that the highest probability beliefs would have 🍋 in :box2, since that's what we observe.

When I run the mwe here and the code you pasted, I get b2.state_list[maxb2s] = State[State([5, 5], Fruit[🍒, 🍋, 🍋]), State([5, 5], Fruit[🫐, 🍋, 🍋])]. Both of these states have 🍋in box 2. Isn't that what you said you would expect above?

I am more concerned about the following though:

function POMDPs.transition(p::FruitWorld, s::State, a::MoveAction)
    if isterminal(p, s)
        return Deterministic(s)
    end

    box = p.boxes[a.targetbox]
    boxes = MVector{length(s.boxes), eltype(s.boxes)}(s.boxes)
    boxes[a.targetbox] = box.item
    boxes = SVector(boxes)
    sp = State(box.pos, boxes)
    return Deterministic(sp)
end

Intuitively I would expect Move(2) to mean "move to the location of box 2 and peak inside". That is not what you have encoded. You have encoded "move to the location of box 2, replace its contents with whatever is in p.boxes, and then peak inside". Is the replace functionality intended? If the replace functionality is not intended there should not be fruit in the boxes field in FruitWorld and the transition function should be

function POMDPs.transition(p::FruitWorld, s::State, a::MoveAction)
    if isterminal(p, s)
        return Deterministic(s)
    end

    position = p.boxes[a.targetbox].pos
    return Deterministic(State(position, s.boxes))
end

This is similar to how the tiger problem does not contain the door that the tiger is behind in the POMDP definition; that is only represented by the state.

jmuchovej Jun 1, 2023
Author

Both of these states have 🍋in box 2. Isn't that what you said you would expect above?

Yep! This is expected.

Intuitively I would expect Move(2) to mean "move to the location of box 2 and peak inside".

This is indeed what Move(2) should mean.

I guess I didn’t update the MWE when I posted this. When I was doing what you suggested (not updating the boxes within transition(...)), the belief doesn’t update (in terms of max probs) to what I’d intuitively expect (what we saw above).

I’ve updated the MWE to what I recall it being when I made my earlier comment. (Away from computer now, so I can’t test it to verify.)

jmuchovej Jun 1, 2023
Author

I just checked the code on the gist. It should now match what I was describing earlier. (That is, no box content updates in transition, using the initial belief that 🍋 are in :box3 but not [:box1, :box2] and visiting :box2 to find 🍋.)

The examples included within the gist demonstrate no belief updating (wrt box contents). I do get belief updating when using the transition function you were worried about. (This seems indicative that there's something going wrong with observation(...) or perhaps the distribution of transition(...), but I'm not sure which.)

I have seen that if, instead, I use the box contents a a.targetbox from sp instead of from the POMDP, I do get the intended belief updates, but I guess I might be reading the code wrong? This approach (to me) indicates that the observation is dependent on what the agent's sp will be – thus, if Move(3) is taken and the belief is State([5, 1], [🍋, 🍒, 🫐]) then in the world with p.boxes = [🍒, 🍋, 🍋] they will receive an observation of 🫐 rather than 🍋 which are what's truly in the :box3.

I suppose I'm missing something about what observation ought to encode – since I assumed it should provide some observation about the ground-truth of the world, not an observation about the believed state-of-the-world?

Additionally, I used this approach with SARSOP and QMDP. SARSOP continues to generate policies which only have Q-estimates for [Move(3), Take(2), Take(3)] for marginalbeliefs[1] and marginalbeliefs[3] ([1] believes 🍋 are most likely in :box1, while [3] believes 🍋 are most likely in :box3). QMDP provides estimates for all actions (as expected). The estimates for actions aren't what I'm "concerned about" (right now), rather that the estimates don't appear to differ considerably given the initial belief the solvers are given. (The uniform belief, [1], and [3] hardly differ in their estimates – which intuitively seems wrong.)

EDIT: To clarify, from my understanding of QMDP, initial beliefs won't impact the solver's solution (since it just does VI on the underlying MDP, so beliefs don't place a role in the solution) – but it's unclear why alternate initial beliefs would have such a negligible impact on the estimates.

For SARSOP, I know that reachability factors into the Q-estimates, but I guess I'm at a loss to why the initial belief appears to play no (or a minimal) role in the reachability of future beliefs and how actions from the 1st belief could not reach the 2nd belief (this is probably poorly explained, but what I'm thinking of is: Move(1) should be a valid action under any of the beliefs given, yet no SARSOP-based policy ever considers it a useful action).

zsunberg Jun 6, 2023
Maintainer

Ok, thanks for trusting me enough to implement the change :) Note that you had an error in the last block display(b4.state_list[maxb3s]): it says maxb3s when it should have said maxb4s

It seems that you have figured many things out but still have questions. Can you formulate the question concisely in two or three sentences?

It might be easier if you read about the tiger problem here: https://juliapomdp.github.io/POMDPs.jl/stable/def_pomdp/#Object-oriented (the original source is here). This is very similar to a FruitWorld with one box and way simpler to talk about since the state is just "left" or "right". In particular, I don't understand your marginalbeliefs variable and unfortunately cannot invest the time to understand it.

I spent a while trying to type out an answer and here are some snippets that might be helpful, but don't present a complete narrative:

in the world with p.boxes = [🍒, 🍋, 🍋]

This is NOT the right way to think about a POMDP. The right way is "when the state is State([1,1], [🍒, 🍋, 🍋])". In POMDPs.jl, the state is not part of the POMDP struct.

I suppose I'm missing something about what observation ought to encode – since I assumed it should provide some observation about the ground-truth of the world, not an observation about the believed state-of-the-world?

Yes! This is how POMDPs.jl generates observations:
s = State([1,1], [🍒, 🍋, 🍋]) # this is the ground-truth of the world!
rand(observation(m, Move(3), s))
One thing to note is that the initial belief does not have to be the same as the initial state distribution. You could have initialstate(p::FruitWorld) = Deterministic(...), but b = Uniform(states(pomdp)). See also the optional belief and initial state arguments for simulate. In other words, you don't have to use s = rand(b) in a simulation. You can just set s to be the true state that you want.

jmuchovej Jun 6, 2023
Author

I’ll give the TigerPOMDP a think tomorrow.

Re: “… The right way is "when the state is State([1,1], [🍒, 🍋, 🍋]) …”

What I meant here is that the belief update redistributes the existing distribution over the new location – rather than integrating the observation of the ground-truth and the new location. The only way I’ve been able to get the belief distribution to update in this way is to instead “observe” the contents of the state – which is counterintuitive (and wrong, to my knowledge).

Re: “… In particular, I don't understand your marginalbeliefs variable and unfortunately cannot invest the time to understand it.”

This is just intended as a more intuitive way to express the initial belief – e.g., with marginalbeliefs[1] ([.98 .01 .01; .02 .49 .49; .02 .49 .49]) would describe that the agent believes that 🍋 are the most likely contents of :box1 (98% vs 1% & 1%), and 🍒 & 🫐 are the most likely contents of :box2 and :box3 (49% each vs 2% for 🍋).

This gets redistributed across the belief space so that 98% of the beliefs about :box1 contain 🍋, and 49% of the beliefs about :box2 contain 🫐 or 🍒, and likewise from :box2 to :box3.

There are two problems I’m trying to resolve:

I need a policy provides Q-estimates for each action (ideally from a single policy, but covering some portion of the belief-space works too). The problem right now is that changes in beliefs either minimally impact estimates post-policy generation (like on the order of 1e-6) OR don’t lead to different policies even under dramatically different beliefs (like marginalbeliefs[3]).
Belief-updates don’t generate the appropriate updates when using the world-state in the observation.
(Odds are, this is highly related to (1), but I’m not sure at the moment.)

Generally, I would expect that a uniform initial belief would permit all states to be reachable. (Ergo not understanding why SARSOP policies would only generate a limited set of estimates since at least all Move(b) actions sensible to take from the initial state.)

I know that RockSample has an exploration reward (in a sense), so I wonder if that could be why the policy under SARSOP doesn’t “explore”. (Since the current reward structure only provides a single positive reward – thus all exploration will strictly yield worse total rewards than the most efficient action sequence.)

zsunberg · 2023-06-10T17:49:48Z

zsunberg
Jun 10, 2023
Maintainer

Ok, I'll think about the beliefs a little more later, but will respond to the action value issues.

I need a policy provides Q-estimates for each action

This is a bit harder than one would think because most POMDP solvers try very hard to only do the computation they need to find the best policy. So, if actions are not in the best policy, it will ignore them.

I know that RockSample has an exploration reward (in a sense), so I wonder if that could be why the policy under SARSOP doesn’t “explore”.

SARSOP maintains an upper and lower bound on the value function and stops when they are within some epsilon. I think the reason that it has stopped is because it has proven that it has found an optimal policy, not because of an "exploration reward".

I think the easiest hack to get Q-values from every action would be to create a special POMDP that only allows one action on the first step. This would mean augmenting the state to include a first step flag, which would roughly double the size of the state space (it may be possible to just add one state, but this would be conceptually more confusing), but you can at least get something working on small problems. Then you can do something like:

values = Dict() 
for a in actions
     am = FruitWorld(force_initial_action = a)
     policy = solve(solver, am)
     values[a] = value(policy, initialstate(am), a)  # this actually does not work due to issue 504
end

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing a POMDP (well, a MOMDP) where actions must be taken in a specific order #492

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Implementing a POMDP (well, a MOMDP) where actions must be taken in a specific order #492

jmuchovej May 18, 2023

Replies: 4 comments · 13 replies

zsunberg May 20, 2023 Maintainer

jmuchovej May 20, 2023 Author

zsunberg May 20, 2023 Maintainer

zsunberg May 20, 2023 Maintainer

jmuchovej May 20, 2023 Author

zsunberg May 21, 2023 Maintainer

zsunberg May 20, 2023 Maintainer

zsunberg May 21, 2023 Maintainer

zsunberg Jun 1, 2023 Maintainer

jmuchovej Jun 1, 2023 Author

jmuchovej Jun 1, 2023 Author

zsunberg Jun 6, 2023 Maintainer

jmuchovej Jun 6, 2023 Author

zsunberg Jun 10, 2023 Maintainer

jmuchovej
May 18, 2023

Replies: 4 comments 13 replies

zsunberg
May 20, 2023
Maintainer

jmuchovej May 20, 2023
Author

zsunberg May 20, 2023
Maintainer

zsunberg May 20, 2023
Maintainer

jmuchovej May 20, 2023
Author

zsunberg May 21, 2023
Maintainer

zsunberg
May 20, 2023
Maintainer

zsunberg
May 21, 2023
Maintainer

zsunberg Jun 1, 2023
Maintainer

jmuchovej Jun 1, 2023
Author

jmuchovej Jun 1, 2023
Author

zsunberg Jun 6, 2023
Maintainer

jmuchovej Jun 6, 2023
Author

zsunberg
Jun 10, 2023
Maintainer