Pure Information Extraction Belief based Rewards #528
Replies: 2 comments 6 replies
-
Thanks for reaching out! I gave this a quick read. My first comment is that while it might be possible to do belief updates with an LSTM, you will likely be better off using a particle filter (easier to implement) or perhaps an array of Kalman filters (more difficult to implement, but more efficient). For the basic particle filter idea, see section 19.6 of algorithmsbook.com. There is a POMDPs-compatible julia implementation at https://github.com/JuliaPOMDP/ParticleFilters.jl. I would recommend starting as simple as possible. Just have one target, let the radar get range, angle, and doppler for the entire 360 degrees, and just have a single dummy action that does nothing. Then see if the belief updater can do what you want. Once you have that, you can think about rewards, etc. P.S. For more advanced combinations of particle filtering and learning, see https://arxiv.org/abs/2112.09456, https://arxiv.org/abs/2002.09884, https://arxiv.org/abs/2306.00249, and references therein. I think this problem can be solved adequately without any machine learning, but since you mentioned LSTMs, these demonstrate how the two can be combined. P.P.S. In the future, when you have multiple targets, one difficulty will be data association. If you know which aircraft you are getting returns from, the observation probability density will be straightforward; if the returns can be coming from different aircraft or even a new target, the observation probability density will be much more complex. |
Beta Was this translation helpful? Give feedback.
-
I'm going to wrap this up, I've learned a lot doing this but don't think my initial approaches were very smart. I'm working on a \rhoPOMDP looking solver that has far fewer hacks than what is going on here. I'll open a new discussion about that once it's playing nice. :) Thanks again for being so helpful! |
Beta Was this translation helpful? Give feedback.
-
Hello,
Context is an air traffic control scenario with a beam-agile radar. I'm also very new to Julia, and POMDPs.jl, so any "that's not how we do that" is very welcome :)
As I've currently framed it:
State: is 0-many targets with, at least, an X and Y coordinate.
Actions: involve looking in some cardinal direction. Currently Uniform(0,1), easily discretized, currently 0,1 are West, 0.25 is North, 0.5 is East, 0.75 is South.
Observations: are an array of 0-many (range,angle,doppler-velocity) tuples.
Belief: ? an LSTM's hidden state? Buffer of previous action-observation pairs? Some other RNN + Decoder?
Reward:??, Ideally "distance between belief and reality". I've tried reporting both belief and reality in different ways. My dream representation would just be a fairly fine-grained occupancy grid, and then some simple distance between the 'guess' from belief updater (or belief updater being run through a decoder for this purpose) and 'reality', which is state that has been turned into some sort of N-hot vector for N targets. Defining Reward in terms of (State, Action) elsewhere has involved pretty ugly hacks, I think here I may be able to just define it in terms of (State, Belief)?
The goal is to have some learned world model that reports what cells in a grid it believes are occupied, then based on this, solve to come up with a policy that will best update the model.
I've read about \rhoPOMDPs, where reward is \rho(s,b) instead of r(s,a). Trying to learn more about these I learned about Belief Dependent Rewards #387 and my current implementation used https://github.com/josh0tt/TO_AIPPMS/tree/main/SBO_AIPPMS/GP_BMDP_Rover this Belief Markov Decision Process Rover as a jumping off point, although currently the actual BMDP is totally ignored.
Trying to escape the chicken-and-egg situation of trying to both model the world and develop a policy, I'm using a random policy to try to generate sequences to learn a world model. https://github.com/navh/SkySurveillance.jl/blob/main/src/Flat_POMDP/flat_pomdp.jl
This is somewhat inspired by documentation mentioning that belief could be an RNN, but couldn't find much elaborating on what this looks like in implementation. Is putting learning in a solver like this appropriate?
Can I just put a belief dependent reward at the bottom of my POMDP instead of doing the BMDP two step if there is no goal other than learning a good world model?
Or do you have totally different ideas on how you would lay these components out?
Thanks, Amos
Beta Was this translation helpful? Give feedback.
All reactions