Stationary Tiger Problem - Suboptimal Solution - Python QuickPOMDPs #432

kjsandbrink · 2022-10-02T02:37:42Z

kjsandbrink
Oct 2, 2022

Hi, I'm implementing a stable version of the Tiger POMDP in which there is no cost for listening or penalty for opening the wrong door, but the treasure remains at the same location for several turns, meaning that it should still be worth it to listen in order to locate the reward and harvest over the remaining steps. This is supposed to be a delayed reward task in which the reward is only revealed at the end of the problem. Here is the implementation of the problem:


def transition(s,a):
    return Deterministic(s)
    
def observation(s,a,sp):
    if a == "listen":
        if sp == "left":
            return SparseCat(["left", "right"], [1, 0]) # sparse categorical distribution
        else:
            return SparseCat(["right", "left"], [1, 0])
    else:
        return Uniform(["left", "right"])

def reward(s,a):
    if a == "listen":
        return 0
    elif s == a: # the tiger was found
        return 1
    else: # the tiger was escaped
        return 0

m = QuickPOMDP(
    states = ["left", "right"],
    actions = ["left", "right", "listen"],
    observations = ["left", "right"],
    initialstate = Uniform(["left", "right"]),
    discount = 0.95,

    transition = transition,

    observation = observation,

    reward = reward
)

The calculated alpha vectors are as follows:

[19.88158942 18.88158942]
[18.88158942 19.88158942]
[18.88158942 18.88158942]

So, it looks like there is no benefit to listening in this solution, which is equivalent to one in which the information from the rewards is used in computing the expected value. However, in the step throughs, it looks like the belief is not updated based on the reward (the correct behavior), and in the end frequently a solution obtaining 0 reward is returned (ie when the state is "right" and the solver continuously picks "left".) like in this example here:


s: right
a: left
o: right 

s: right
a: left
o: right 

s: right
a: left
o: left 

s: right
a: left
o: left 

s: right
a: left
o: right 

s: right
a: left
o: left 

s: right
a: left
o: left 

s: right
a: left
o: right 

s: right
a: left
o: right 

s: right
a: left
o: right 

Undiscounted reward was 0.0

Is there a way to implement this problem such that the reward information is not used in the computation of the alpha vectors? I'm expecting a solution where the agent listens once in the beginning and then picks the corresponding door for the remainder of trials. I'm using the QMDP Solver in the QuickPOMDPs implementation. Thanks!

Answered by zsunberg

Oct 3, 2022

Hi @kjsandbrink ! I think that this issue is that you are using the QMDP solver. QMDP is not guaranteed to give the optimal solution and in particular is not good at evaluating actions that involve active information gathering as described in section 21.1 of this book: https://algorithmsbook.com/files/dm.pdf. Are you able to run the SARSOP solver?

View full answer

zsunberg · 2022-10-03T19:02:24Z

zsunberg
Oct 3, 2022
Maintainer

Hi @kjsandbrink ! I think that this issue is that you are using the QMDP solver. QMDP is not guaranteed to give the optimal solution and in particular is not good at evaluating actions that involve active information gathering as described in section 21.1 of this book: https://algorithmsbook.com/files/dm.pdf. Are you able to run the SARSOP solver?

8 replies

kjsandbrink Feb 22, 2023
Author

Hi! Sorry to open this up again, but I have returned to the same problem and now face the issue that I would like the observation be dependent on the state before the action was taken instead of after, something which is not possible with the SARSOP solver. Do you have another solver that you would recommend here or is there a way of implementing it that I'm missing?

zsunberg Feb 23, 2023
Maintainer

@WhiffleFish Does NativeSARSOP support observations dependent on s, a, sp?

WhiffleFish Feb 23, 2023
Collaborator

No, NativeSARSOP observations are currently only dependent on a,sp. Are there any good ways to handle s,a,sp without blowing up the size of sparse tabular matrices?

zsunberg Feb 24, 2023
Maintainer

Yeah, you'd probably want to store it in a different data structure other than the SparseTabularPOMDP. If you want to support it (which I think would be cool), you should go back to the relevant backup equations and see what structure is most efficient. My hunch is that you would want to store a matrix with s' rows and o columns for every s and a.

kjsandbrink Feb 24, 2023
Author

Thank you so much for looking into this!
If it makes a difference, in my particular case, the observation is only dependent on s and a (not sp), so if there is a quick fix there let me know - otherwise I understand it could be more interesting to go straight to the more general case! Thanks so much again for your time and the great tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stationary Tiger Problem - Suboptimal Solution - Python QuickPOMDPs #432

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stationary Tiger Problem - Suboptimal Solution - Python QuickPOMDPs #432

kjsandbrink Oct 2, 2022

Replies: 1 comment · 8 replies

zsunberg Oct 3, 2022 Maintainer

kjsandbrink Feb 22, 2023 Author

zsunberg Feb 23, 2023 Maintainer

WhiffleFish Feb 23, 2023 Collaborator

zsunberg Feb 24, 2023 Maintainer

kjsandbrink Feb 24, 2023 Author

kjsandbrink
Oct 2, 2022

Replies: 1 comment 8 replies

zsunberg
Oct 3, 2022
Maintainer

kjsandbrink Feb 22, 2023
Author

zsunberg Feb 23, 2023
Maintainer

WhiffleFish Feb 23, 2023
Collaborator

zsunberg Feb 24, 2023
Maintainer

kjsandbrink Feb 24, 2023
Author