Stationary Tiger Problem - Suboptimal Solution - Python QuickPOMDPs #432
-
Hi, I'm implementing a stable version of the Tiger POMDP in which there is no cost for listening or penalty for opening the wrong door, but the treasure remains at the same location for several turns, meaning that it should still be worth it to listen in order to locate the reward and harvest over the remaining steps. This is supposed to be a delayed reward task in which the reward is only revealed at the end of the problem. Here is the implementation of the problem:
The calculated alpha vectors are as follows:
So, it looks like there is no benefit to listening in this solution, which is equivalent to one in which the information from the rewards is used in computing the expected value. However, in the step throughs, it looks like the belief is not updated based on the reward (the correct behavior), and in the end frequently a solution obtaining 0 reward is returned (ie when the state is "right" and the solver continuously picks "left".) like in this example here:
Is there a way to implement this problem such that the reward information is not used in the computation of the alpha vectors? I'm expecting a solution where the agent listens once in the beginning and then picks the corresponding door for the remainder of trials. I'm using the QMDP Solver in the QuickPOMDPs implementation. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
Hi @kjsandbrink ! I think that this issue is that you are using the QMDP solver. QMDP is not guaranteed to give the optimal solution and in particular is not good at evaluating actions that involve active information gathering as described in section 21.1 of this book: https://algorithmsbook.com/files/dm.pdf. Are you able to run the SARSOP solver? |
Beta Was this translation helpful? Give feedback.
Hi @kjsandbrink ! I think that this issue is that you are using the QMDP solver. QMDP is not guaranteed to give the optimal solution and in particular is not good at evaluating actions that involve active information gathering as described in section 21.1 of this book: https://algorithmsbook.com/files/dm.pdf. Are you able to run the SARSOP solver?