-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First version of lock-based parallelization for vanilla MCTS. #86
base: master
Are you sure you want to change the base?
Conversation
…e with the vanilla version.
Thank you for the contribution, @kykim0 ! I won't be able to review this until later this week at the earliest. @WhiffleFish, @himanshugupta1009 since you have been working on similar parallelization for POMDP MCTS, would you be able to give a quick review to this in the meantime? (I know you guys haven't done many reviews before, so we can talk about what a review for this entails) @kykim0 I assume that the new code is always active, even when there is only one thread - is that correct? Can you report the run time for the current master branch to see if there is a significant slow down for the single-threaded case? Thanks! |
…lso, add a few optimizations.
Thanks for the feedback! I actually discovered sth quite interesting while trying to compare the run time for the single thread case, which is that when using a single thread, the Channel approach to implementing a type of producer-consumer interaction can be quite slow (though this is likely dependent on the MDP). This is presumably due to that for the single thread case, it can spend a lot of time context switching between the producer and consumer, whereas in case of multiple threads they can run in parallel. On the other hand, perhaps not surprisingly, the Channel approach was quite crucial for performance in case of multiple threads. I ended up adding the old iteration code back so that this is run in case the no. of threads is 1. I didn't see a good way to combine the two cases, so please let me know if you have a suggestion. It doesn't seem so bad the way I currently have them though. In terms of run time for the single thread case, the test case I mentioned above runs for about 1.09 s, whereas the old code 0.81 s, so the new code is a bit slower. I think the difference mainly comes from the logic to support lock-based parallel MCTS where we do sth like |
It's also worth mentioning that what is implemented in this PR is in one sense node-level locking e.g., backpropagating is done independently for different nodes, but in another sense is more akin to tree-level locking (very bad) e.g., having to lock the whole vectors when adding a new node. The latter is somewhat unavoidable given the current design of using vectors to represent tree nodes. This is in fact likely what's slowing things down when many threads are used. I think to really improve performance we might need redesign the tree so that each node is represented as a distinct instance. This way we can access different parts of the tree in a more asynchronous manner. |
…w (much) better parallelism and simplify the code.
I decided to bite the bullet and try the alternative design of representing each node in the tree as a separate object. This made the code (much) simpler especially the new logic to support multithreading. It also gives a lot better multithreading performance as e.g., we can lock a specific state/action node when backpropagating w/o locking the whole vectors. In fact, I saw about 2x improvement when multithreading is enabled with the new code for one of the test cases I tried. This was a fairly extensive change, but I'm quite happy with the new code which is simpler and easier to reason about. Please let me know if you have any feedback! |
@kykim0 Awesome! I am excited to look at it. |
Hey @kykim0, have you looked into using |
Thanks @WhiffleFish for review! I do remember reading about |
@kykim0 In my understanding (which is admittedly quite limited so correct me if I'm wrong), reentrant locks are safe if there’s a chance that a single thread will try to access a lock more than once without unlocking i.e. trying to reenter the already locked lock (hence “reentrant”). This seems to be more common in parallel cyclic graph traversals, but seeing as we’re traversing an acyclic tree for which our locking/unlocking operations do not occur over a span of multiple visited nodes reentrant locks might be unnecessarily safe if the code is correct. In contrast, a spin lock seems to loop in place checking for the lock to open effectively decreasing the time between a lock opening and a waiting thread engaging the lock. This would be inefficient if it’s left spinning doing nothing for a long time, but seeing as the locks here are only engaged for a very short period of time (about the duration of a few pushing operations) I would imagine it would be worth it to make the switch from reentrant locks. I originally tried using spin locks for parallel POMCPOW because they were mentioned in this parallel MCTS paper: In regards to using concurrent collections, I've never used the package so it could very well be that all this talk about locks is entirely unnecessary, so let me know if that works out! |
@WhiffleFish Thanks for the explanation! The difference makes a good sense. I'm wondering though whether we can generally assume that a given MDP is acyclic. In the paper you linked, they seem to have only used the game of Go for testing which indeed is acyclic, but in sth like grid world it is possible to revisit the same state during a rollout. It is still interesting to know that you saw a noticeable difference in performance using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kykim0 Thanks a bunch for your hard work on this! This is awesome work, however I don't feel comfortable merging such a big change in just yet since this is a stable package that people rely on. In particular, when I originally programmed this, I tried mutable node objects and they were much slower because of the additional heap allocations they required. That being said, I am definitely open to this change.
There are a few options for moving forward:
- Perform a thorough benchmark to compare the old tree approach with the new mutable node object approach. If the mutable node object approach is faster or the same, we can accept this PR nearly as-is. It would be best to make a commit to master that has performance tests in it and then merge that into this branch so we can easily compare master to that. (We would want to use a very computationally light MDP for this, even lighter than SimpleGridWorld - something where the state is an Int and the dynamics are sp = s+a.)
- Add a new solver in MCTS.jl. Later, if we realize this new approach is much better we could deprecate the original solver. (note: we can skip some secondary things like
@POMDP_require
) - Create a new package, MonteCarloTreeSearch.jl focused on readable, hackable, and parallelizable code. I'd recommend building this on CommonRLInterface.jl
Sorry for all the extra work - doing a thorough job on this will be worth it in the end!
|
||
return new(Dict{S, Int}(), | ||
# Accessors for state nodes | ||
@inline state(n::StateNode) = n.s_label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the @inline
docstring: "Small functions typically do not need the @inline
annotation". We should remove it because it just clutters up the code
@@ -37,7 +38,7 @@ policy = solve(solver, mdp) | |||
state = GridWorldState(1,1) | |||
|
|||
@testset "basic" begin | |||
a = @inferred action(policy, state) | |||
a = @inferred Symbol action(policy, state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this? It seems unnecessary and might cause a bit of pain if someone wants to change the problem type or copy and paste it in the future. In general, we don't want unnecessary code.
@kykim0 @WhiffleFish, regarding |
Thanks @zsunberg for the review! I agree with your concern, and it does make sense to be more careful with a stable package that people already use. I don't think the new design will be faster than the vector approach for the single thread case, because of, as you also mentioned, the additional heap allocations and the extra logic to support multithreading. But, the new design is quite crucial for multithreading performance as it allows more fine-grained use of locks, and there is also a clear benefit of having one API to support both cases. So, one question to ask ourselves is how much performance loss are we willing to live with for the single thread case for the benefit of having one API. For instance, would 0.81s vs.1.09s for 100,000 iterations with max depth of 20 for a SimpleGridWorld MDP be tolerable? This is rather a tricky question. Having said all that, I'm starting to think if your suggestion #3 is a better way forward. I don't necessarily need to submit this change to MCTS.jl for my specific use case (AST) but wanted to make the feature more broadly available. But, people who are already using MCTS.jl are likely content with the current set of features (e.g., single threaded execution), so it might make more sense to keep the code optimized for that use case. I took a brief look at CommonRLInterface.jl, and it seems to support AlphaZero.jl, so there is also the benefit of being able to test the parallel MCTS code for a broader set of environments. So, unless you feel strongly about other options, I'll hold off on this PR and explore #3. Is there a good place for the new package? Regarding |
end | ||
end | ||
for sim_task in sim_channel | ||
CPUtime_us() > timeout_us && break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be CPUtime anymore because we are multithreading - see how @WhiffleFish did it here: JuliaPOMDP/POMCPOW.jl#32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That looks like a nice find
As far as (2) vs (3), I don't really have a strong preference. I'd say go with your gut. The main advice I have is to take baby steps and be prepared to scrap large portions of what you do and start those portions over :) We are still learning a lot as we go. If you want to, it would certainly be nice to have (2) as an intermediate step, but we may delete it if (3) is successful.
Do you have permission to create JuliaPOMDP/MonteCarloTreeSearch.jl ? I'd recommend using PkgTemplates.jl to create all the structure.
I think we should investigate this more. I think we can avoid it. I think the issue is that right now you are just spawning Excited to make this work! :) |
Actually the pattern I was thinking of won't work because you don't have any control over which threads do which tasks. So as of this moment, I think we need to use reentrant locks. I am still curious though whether only having |
My intuition about several things has been wrong: it is slightly better to launch a large number of individual tasks, and it doesn't seem like SpinLock offers any performance advantage in this test: https://gist.github.com/zsunberg/9743519bfd1d9bb58e5c5c55a6e759b7 |
Thanks for the feedback (also for the nice comparison test)! I'll go ahead and start a new package JuliaPOMDP/MonteCarloTreeSearch.jl, and also see if we can do sth quick for #2. I'll keep this PR open though so I can come back to the code while porting things over :-) |
Looks like I do have permission to create a package under JuliaPOMDP but wasn't quite sure what the canonical settings are for JuliaPOMDP packages. Could you share with me sample code using |
I was using the same tasking method that @kykim0 used when I saw the performance improvement with SpinLocks. In fact, I fixed the deadlocks on my own MCTS fork and SpinLocks again offered a ~2x speed improvement over ReentrantLocks. mdp = SimpleGridWorld()
sol = MCTSSolver(n_iterations=100_000, depth=20, exploration_constant=5.0)
planner = solve(sol, mdp)
@benchmark action(planner, SA[1,2]) With 4 threads... That being said, seeing as this was quickly cobbled together with minimal testing I can't attest to how safe my code is. |
@kykim0 , sorry I do not have example code. In general, we want the following things
Most things can be added later if you don't get it exactly right the first time. |
Hmmm... I guess we should figure out exactly when deadlocks will happen. i.e. what does "Recursive use" mean? This seems to always cause a deadlock if run on a single task: l = Threads.SpinLock()
lock(l)
lock(l) But that seems pretty easy to avoid. It also seems that trying to lock from different tasks on the same thread will cause a deadlock: l = Threads.SpinLock()
function f(l)
lock(l)
sleep(5)
unlock(l)
end
Threads.@spawn f(l)
sleep(1)
lock(l)
println("done") # never prints this if run with one thread on my machine Then the question is how could you ever safely use a |
I went ahead and asked about it here: https://discourse.julialang.org/t/when-can-tasks-yield-or-how-to-use-spinlocks-safely/73740 |
I spent some time reading up on this today, and another architecture we should consider for tree parallelism is:
|
I tried the fix by @WhiffleFish, which was to not unnecessarily grab the node lock in the
The difference between (b) and (c) i.e., ReentrantLock vs. SpinLock is interesting, but I find that between (a) and (b) equally interesting, which suggests that lock contention can have quite an impact on performance (sounds obvious but interesting to see with actual numbers). Perhaps what all this means is that we should go with a lock-free approach if we want to optimize for performance. Ideally, I think it'd be nice to support both an approach that maintains 'data integrity' at the expense of (hopefully) a bit of performance loss and an approach that optimizes for performance. Based on our discussion so far, I think we want to try
On a related note, I created a new package https://github.com/JuliaPOMDP/MonteCarloTreeSearch.jl. I'm looking into implementing sth like what I've done in this PR on CommonRLInterface. It'll be exciting to try our code for a broader set of environments like AlphaZero.jl! |
Just wanted to share this very interesting thread on lock performance: https://discourse.julialang.org/t/poor-performance-of-lock-l-do-closures-caused-by-poor-performance-of-capturing-closures/42067/13. It might be worth trying the Base.@lock approach and see if that makes much difference in our use case. On a related note, I've been a bit occupied with other things so that I wasn't able to make more progress on this i.e., porting code over to MonteCarloTreeSearch.jl, but hope to revisit in the next couple weeks :-) |
@kykim0 , thanks for sharing the thread. That is interesting. Excited for when you are able to get back to the MCTS! |
This is an initial implementation of a lock-based parallelization for vanilla MCTS. It is similar to the tree parallelization with node-level locking, though some differences due to how tree nodes are implemented may exist.
I've done a fairly extensive testing to check program correctness as well as performance. For this initial change though, I focused more on correctness than performance. The latter can be improved by e.g., using a more fine-grained use of locks or a lock-free approach which will be looked into next.
On an Ubuntu machine w/ i7-6700K CPU @ 4.00GHz 4 cores, the time to solve SimpleGridWorld w/ n_iterations=100_000, depth=20, exploration_constant=5.0 was roughly:
t1: 3.137 s, t2: 2.015 s, t3: 1.708 s, t4: 1.966 s, t5: 2.698 s, t6: 5.545 s, t7: 6.670 s.
In other words, using 3-4 threads finished quickest, and using too many threads took longer than using a single thread presumably because of thread contentions.