Pure Information Extraction Belief based Rewards #528

navh · 2023-11-27T14:35:11Z

navh
Nov 27, 2023

Hello,

Context is an air traffic control scenario with a beam-agile radar. I'm also very new to Julia, and POMDPs.jl, so any "that's not how we do that" is very welcome :)

As I've currently framed it:
State: is 0-many targets with, at least, an X and Y coordinate.
Actions: involve looking in some cardinal direction. Currently Uniform(0,1), easily discretized, currently 0,1 are West, 0.25 is North, 0.5 is East, 0.75 is South.
Observations: are an array of 0-many (range,angle,doppler-velocity) tuples.
Belief: ? an LSTM's hidden state? Buffer of previous action-observation pairs? Some other RNN + Decoder?
Reward:??, Ideally "distance between belief and reality". I've tried reporting both belief and reality in different ways. My dream representation would just be a fairly fine-grained occupancy grid, and then some simple distance between the 'guess' from belief updater (or belief updater being run through a decoder for this purpose) and 'reality', which is state that has been turned into some sort of N-hot vector for N targets. Defining Reward in terms of (State, Action) elsewhere has involved pretty ugly hacks, I think here I may be able to just define it in terms of (State, Belief)?

The goal is to have some learned world model that reports what cells in a grid it believes are occupied, then based on this, solve to come up with a policy that will best update the model.

I've read about \rhoPOMDPs, where reward is \rho(s,b) instead of r(s,a). Trying to learn more about these I learned about Belief Dependent Rewards #387 and my current implementation used https://github.com/josh0tt/TO_AIPPMS/tree/main/SBO_AIPPMS/GP_BMDP_Rover this Belief Markov Decision Process Rover as a jumping off point, although currently the actual BMDP is totally ignored.

Trying to escape the chicken-and-egg situation of trying to both model the world and develop a policy, I'm using a random policy to try to generate sequences to learn a world model. https://github.com/navh/SkySurveillance.jl/blob/main/src/Flat_POMDP/flat_pomdp.jl

This is somewhat inspired by documentation mentioning that belief could be an RNN, but couldn't find much elaborating on what this looks like in implementation. Is putting learning in a solver like this appropriate?

Can I just put a belief dependent reward at the bottom of my POMDP instead of doing the BMDP two step if there is no goal other than learning a good world model?

Or do you have totally different ideas on how you would lay these components out?

Thanks, Amos

zsunberg · 2023-11-27T18:34:05Z

zsunberg
Nov 27, 2023
Maintainer

Thanks for reaching out! I gave this a quick read. My first comment is that while it might be possible to do belief updates with an LSTM, you will likely be better off using a particle filter (easier to implement) or perhaps an array of Kalman filters (more difficult to implement, but more efficient). For the basic particle filter idea, see section 19.6 of algorithmsbook.com. There is a POMDPs-compatible julia implementation at https://github.com/JuliaPOMDP/ParticleFilters.jl.

I would recommend starting as simple as possible. Just have one target, let the radar get range, angle, and doppler for the entire 360 degrees, and just have a single dummy action that does nothing. Then see if the belief updater can do what you want. Once you have that, you can think about rewards, etc.

P.S. For more advanced combinations of particle filtering and learning, see https://arxiv.org/abs/2112.09456, https://arxiv.org/abs/2002.09884, https://arxiv.org/abs/2306.00249, and references therein. I think this problem can be solved adequately without any machine learning, but since you mentioned LSTMs, these demonstrate how the two can be combined.

P.P.S. In the future, when you have multiple targets, one difficulty will be data association. If you know which aircraft you are getting returns from, the observation probability density will be straightforward; if the returns can be coming from different aircraft or even a new target, the observation probability density will be much more complex.

5 replies

navh Nov 27, 2023
Author

Thanks for getting back so fast! I'm very intentionally trying to avoid an array of Kalman filters. I've arrived here after working on "scheduling" type problems allocating radar time to an array of filters. I'll check out these particle filtering references and see if there's some way I can leverage some ideas from these.

I'm currently struggling to come up with some way for my world model to output anything particle like, especially with multiple targets, and am instead spending a lot of time trying to reconcile actions that result in non-observations with empty occupancy matrices.

zsunberg Dec 5, 2023
Maintainer

Did you figure this out? Non/null observations should be fairly straightforward to model. Some action/state pairs should just return a distribution that has a high probability of a null observation.

navh Dec 6, 2023
Author

No. I am stubbornly coming to the conclusion that LSTMs can be finicky. Certainly the 'any number of targets -> ML -> fine-grained occupancy matrix' world model I dream of was not very plug-and-play.

I've ripped out the model and replaced it with some sort of a particle filter. I made data association trivial by having targets report an id, and just spinning up a new filter for each id.

I couldn't easily figure out how to take advantage of ParticleFilters.jl, mostly due to my own unfamiliarity with particle filters, so I've done some wheel-reinventing. The result is this "MultiFilterUpdater" in https://github.com/navh/SkySurveillance.jl/blob/41c50b2512d83112b0bb5d0ffd3d0e69fb5a46f2/src/Flat_POMDP/solver.jl

It's just an array of filters, I update/resample them all each timestep, and re-weight any with an observation. It's very much so a work in progress.

For now: I'm planning on just calling a reward function that takes belief as an argument

zsunberg Dec 7, 2023
Maintainer

Sounds like you are on the right track. So far I haven't seen any evidence that deep learning can outperform particle filtering for estimation on problems where it is relatively easy to apply particle filters. Particle filters have the remarkable property that their computational complexity, and, under certain conditions, convergence rate, is totally independent of the dimensionality of the state space. A nice paper combining particle filtering with learning is here: https://arxiv.org/abs/2002.09884

navh Dec 8, 2023
Author

Yes, I both overestimated the ease of learning world models and underestimated the ease of particle filters.

I'm still struggling with "How can I get belief and state together at reward time". I want to reward 'information gathering', my attempts so far at creating some ρ(s,b) have been awkward. I have this feeling that the parts I want are sitting right in front of me but I can't connect the dots.

I ended up writing a policy that, during training, was running the POMDP inside itself. Part way through this, I decided I understood the BMDP from the above rover/BetaZero.jl. I spent some time modifying this in a way I thought would be useful for me, but ended up back with "belief_reward" not referencing the underlying state in a way I can abuse.

https://github.com/navh/SkySurveillance.jl/blob/719f03462546a7687b82c1084e024975c2d9267d/src/Flat_POMDP/belief_pomdp.jl

What I've ended up with is the above pomdp wrapped pomdp. The parent runs both the 'actual' pomdp, the updater, and creates observations based on belief. 'generate_s' ends up calling underlying generate_s, generate_o, and update. Feels ham-fisted to jam belief in state just so avoid adding one arg to reward... Really what I've done is borrowed some of the API but really just baked the filter into my environment, again, something I set out trying to intentionally avoid.

All the data exists at the same point in time. Is there a more elegant way to glue it together?

My guess is the answer is "Don't bother with off-the-shelf solvers".

navh · 2024-01-08T16:42:22Z

navh
Jan 8, 2024
Author

I'm going to wrap this up, I've learned a lot doing this but don't think my initial approaches were very smart.

I'm working on a \rhoPOMDP looking solver that has far fewer hacks than what is going on here. I'll open a new discussion about that once it's playing nice. :)

Thanks again for being so helpful!

1 reply

zsunberg Jan 8, 2024
Maintainer

Glad to hear that you've made progress!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pure Information Extraction Belief based Rewards #528

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pure Information Extraction Belief based Rewards #528

navh Nov 27, 2023

Replies: 2 comments · 6 replies

zsunberg Nov 27, 2023 Maintainer

navh Nov 27, 2023 Author

zsunberg Dec 5, 2023 Maintainer

navh Dec 6, 2023 Author

zsunberg Dec 7, 2023 Maintainer

navh Dec 8, 2023 Author

navh Jan 8, 2024 Author

zsunberg Jan 8, 2024 Maintainer

navh
Nov 27, 2023

Replies: 2 comments 6 replies

zsunberg
Nov 27, 2023
Maintainer

navh Nov 27, 2023
Author

zsunberg Dec 5, 2023
Maintainer

navh Dec 6, 2023
Author

zsunberg Dec 7, 2023
Maintainer

navh Dec 8, 2023
Author

navh
Jan 8, 2024
Author

zsunberg Jan 8, 2024
Maintainer