Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDP Generation #46

Merged
merged 125 commits into from
Apr 29, 2024
Merged

MDP Generation #46

merged 125 commits into from
Apr 29, 2024

Conversation

pkel
Copy link
Owner

@pkel pkel commented Oct 31, 2023

This gets rid of any protocol data stored on the block. This enables
future bit-packing of the state. Also, parents are not ordered any more.
This enables future merging of isomorphic DAGs.
Run into issue with pynauty certificate. Two non-isomorphic graphs yield the
same certificate: pdobsan/pynauty#33

I'll now try applying the canonical relabelling within State.compress()
instead.
Next step is to implement rewards
The selfish mining model was hard-coded in mdp.py. I looked there and
reimplemented it w.r.t. the new model spec API in sm.py.

New step is to make the exploration (mdp.py) work for generic model
specifications. Then start thinking about rewards.
The model unrelated parts of the exploration now live in compiler.py.
The reimplementation lacks two important features relative to mdp.py but
improves elsewhere.
- state compression is missing; nothing is truncated so far
- termination is missing; exploration continues forever
- mdp matrix generation certainly was broken in mdp.py and I tried to
  fix it. Needs testing though.
Inspired by old state compression in sm.py
To avoid giving rewards twice, common history truncation becomes
mandatory.
pkel and others added 25 commits October 31, 2023 09:45
Traditional and proposed models do not agree for gamma 0.5 ... 0.9,
alpha 0.25 ... 0.35. PTO revenue is higher for our model. Reward per
progress is higher in the traditional model. Actually, PTO-optimal
policy against proposed model performs worse than honest wrt. reward per
progress.

In this commit I added steady-state weighted PTO revenues to the
pipeline. First results on small problems look like PTO transformation,
value iteration, steady state calculation, and reward per progress
calculation do what they should for the traditional model. I guess most
likely option now is that the proposed model violates some assumptions
of PTO.
Results confirm value iteration. Speed is about the same for both
algorithms. PI is a bit faster for small traditional problems, VI is
a bit faster for own model.

Most importantly, changing to policy iteration does not solve the old
problem that PTO produces higher revenue for own model than traditional
model, while reward per progress is lower (sub-honest) in our model for
e.g. alpha=.33, gamma=0.75.
Following a5a5a0 and 4cafea, I thought the last source of error is in
the calculation of reward per progress.

To rule this out I now tried to
  1. Use the new policy_evaluation(reachable_only=True) on PTO mdp for small
     theta, note down number of iterations.
  2. Do backpropagation in the ARR mdp for that many steps.
  3. Calculate steady state in the ARR mdp
  4. Divide steady-state weighted reward by steady-state weighted progress

With this, the problem seems to be gone. At least on one difficult
instance in a notebook. Still have to integrate this into the pipeline.

commit 4cafead (HEAD -> mdp-gen, origin/mdp-gen)
Author: Patrik Keller <git@pkel.dev>
Date:   Thu Sep 14 13:33:58 2023 +0200

    mdp. draft policy_iteration

    Results confirm value iteration. Speed is about the same for both
    algorithms. PI is a bit faster for small traditional problems, VI is
    a bit faster for own model.

    Most importantly, changing to policy iteration does not solve the old
    problem that PTO produces higher revenue for own model than traditional
    model, while reward per progress is lower (sub-honest) in our model for
    e.g. alpha=.33, gamma=0.75.

commit a5a5a09
Author: Patrik Keller <git@pkel.dev>
Date:   Wed Sep 13 21:06:19 2023 +0200

    mdp. investigate unexpected results

    Traditional and proposed models do not agree for gamma 0.5 ... 0.9,
    alpha 0.25 ... 0.35. PTO revenue is higher for our model. Reward per
    progress is higher in the traditional model. Actually, PTO-optimal
    policy against proposed model performs worse than honest wrt. reward per
    progress.

    In this commit I added steady-state weighted PTO revenues to the
    pipeline. First results on small problems look like PTO transformation,
    value iteration, steady state calculation, and reward per progress
    calculation do what they should for the traditional model. I guess most
    likely option now is that the proposed model violates some assumptions
    of PTO.
Delta to traditional model is gone, finally.
Co-authored-by: roibarzur <roi.barzur@gmail.com>
@pkel
Copy link
Owner Author

pkel commented Apr 29, 2024

I think this is mostly done. The ideas live on in #48 and #49. Going to merge

@pkel pkel merged commit 13102a1 into master Apr 29, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant