-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDP Generation #46
Merged
MDP Generation #46
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gets rid of any protocol data stored on the block. This enables future bit-packing of the state. Also, parents are not ordered any more. This enables future merging of isomorphic DAGs.
Run into issue with pynauty certificate. Two non-isomorphic graphs yield the same certificate: pdobsan/pynauty#33 I'll now try applying the canonical relabelling within State.compress() instead.
Next step is to implement rewards
The selfish mining model was hard-coded in mdp.py. I looked there and reimplemented it w.r.t. the new model spec API in sm.py. New step is to make the exploration (mdp.py) work for generic model specifications. Then start thinking about rewards.
The model unrelated parts of the exploration now live in compiler.py. The reimplementation lacks two important features relative to mdp.py but improves elsewhere. - state compression is missing; nothing is truncated so far - termination is missing; exploration continues forever - mdp matrix generation certainly was broken in mdp.py and I tried to fix it. Needs testing though.
Inspired by old state compression in sm.py
To avoid giving rewards twice, common history truncation becomes mandatory.
Traditional and proposed models do not agree for gamma 0.5 ... 0.9, alpha 0.25 ... 0.35. PTO revenue is higher for our model. Reward per progress is higher in the traditional model. Actually, PTO-optimal policy against proposed model performs worse than honest wrt. reward per progress. In this commit I added steady-state weighted PTO revenues to the pipeline. First results on small problems look like PTO transformation, value iteration, steady state calculation, and reward per progress calculation do what they should for the traditional model. I guess most likely option now is that the proposed model violates some assumptions of PTO.
Results confirm value iteration. Speed is about the same for both algorithms. PI is a bit faster for small traditional problems, VI is a bit faster for own model. Most importantly, changing to policy iteration does not solve the old problem that PTO produces higher revenue for own model than traditional model, while reward per progress is lower (sub-honest) in our model for e.g. alpha=.33, gamma=0.75.
Following a5a5a0 and 4cafea, I thought the last source of error is in the calculation of reward per progress. To rule this out I now tried to 1. Use the new policy_evaluation(reachable_only=True) on PTO mdp for small theta, note down number of iterations. 2. Do backpropagation in the ARR mdp for that many steps. 3. Calculate steady state in the ARR mdp 4. Divide steady-state weighted reward by steady-state weighted progress With this, the problem seems to be gone. At least on one difficult instance in a notebook. Still have to integrate this into the pipeline. commit 4cafead (HEAD -> mdp-gen, origin/mdp-gen) Author: Patrik Keller <git@pkel.dev> Date: Thu Sep 14 13:33:58 2023 +0200 mdp. draft policy_iteration Results confirm value iteration. Speed is about the same for both algorithms. PI is a bit faster for small traditional problems, VI is a bit faster for own model. Most importantly, changing to policy iteration does not solve the old problem that PTO produces higher revenue for own model than traditional model, while reward per progress is lower (sub-honest) in our model for e.g. alpha=.33, gamma=0.75. commit a5a5a09 Author: Patrik Keller <git@pkel.dev> Date: Wed Sep 13 21:06:19 2023 +0200 mdp. investigate unexpected results Traditional and proposed models do not agree for gamma 0.5 ... 0.9, alpha 0.25 ... 0.35. PTO revenue is higher for our model. Reward per progress is higher in the traditional model. Actually, PTO-optimal policy against proposed model performs worse than honest wrt. reward per progress. In this commit I added steady-state weighted PTO revenues to the pipeline. First results on small problems look like PTO transformation, value iteration, steady state calculation, and reward per progress calculation do what they should for the traditional model. I guess most likely option now is that the proposed model violates some assumptions of PTO.
Delta to traditional model is gone, finally.
Co-authored-by: roibarzur <roi.barzur@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://arxiv.org/abs/2309.11924