Real-Time Dynamic Programming #49

pkel · 2024-04-29T12:06:15Z

While working on #48 I came to the conclusion that modern RL algorithms might be overkill for my type of problem. I went back to the tabular solving approach kicked-off in #46. I came up with a new solving algorithm that is similar to value iteration but

samples exploration paths from a dynamic environment
builds the tabular state space on the fly
does dynamic programming state-value updates in the meantime

According to Sutton and Barto book on RL, this falls into the broad category of "Asynchronous Dynamic Programming". After some googling, I think I've implemented Real-Time Dynamic Programming.

The results seem promising. I can now handle a non-truncated hence infinite state space instance of the generic DAG model for Nakamoto/Bitcoin.

No glue whether the algorithm is called like that. I do value iteration but chose the order of state-value updates randomly, weighted by current state value estimates, the estimated optimal policy and an exploration term.

Use pickle instead of numpy adjacency matrix. Adjacency matrix based load save was O(n^2) in memory and time!

We can now handle non-truncated models with monte carlo value iteration!

…values

and make this apparent in the mdp.tab data type

pkel · 2024-08-25T08:32:41Z

I initially was hyped about this RTDP thing because

it does exploration on the fly
does not use state approximations
Barto/Sutton provide proof that it converges to the optimal policy.

After implementing the algorithm I

observed that it does not converge, instead stops exploring new states
tried to fix it and failed
noticed that the convergence is only guaranteed if all states are visited regularly (maybe all states reachable by optimal policy would be enough)
concluded that if all states are visited regularly I could just as well use traditional dynamic programming, e.g. value iteration.

Merging/closing this now, as I'm about to explore a somewhat separate idea which re-uses parts of the tooling.

This was referenced Apr 29, 2024

MDP Generation #46

Merged

Dynamic environment with probabilistic termination implemented in Rust #48

Merged

pkel force-pushed the rtdp branch from 54e2776 to f7dbf28 Compare April 29, 2024 12:44

pkel added 16 commits April 29, 2024 14:56

Draft monte carlo value iteration

16baa30

No glue whether the algorithm is called like that. I do value iteration but chose the order of state-value updates randomly, weighted by current state value estimates, the estimated optimal policy and an exploration term.

Record statistics about how often states are visited

b6b6318

Guide exploration along honest policy

329b488

Clearly separate PTO from the solving algorithm

cc63083

Warm the agent up with honest policy

a5ccd8c

Memoization

37207f3

Use "Exploring Starts" method in 50% of the episodes played

3b19ceb

Be smart about choosing random start states.

e0671d6

Store state hashes, not full states

ec68ba7

Speed up sm.py state editor.

bfab49d

Use pickle instead of numpy adjacency matrix. Adjacency matrix based load save was O(n^2) in memory and time!

Take progress from defender chain instead of common chain.

5268dab

We can now handle non-truncated models with monte carlo value iteration!

Implement fair shutdown for sm.py

be20d16

Use fair shutdown for initial state value estimates

558103e

Calculate expected future progress during optimization

80906d6

Add missing dependency

49d6591

It's called real time dynamic programming, I think.

9e7e90b

pkel force-pushed the rtdp branch from f7dbf28 to 9e7e90b Compare April 29, 2024 12:56

pkel added 10 commits May 6, 2024 22:20

Draft RTDP evaluation script

13c051d

compare rtdp start values to vi start values instead of steady state …

b1b55ef

…values

add rtdp generic experiment; reveal bug in sm.SelfishMining.honest()

74b9f62

Fix honest action in generic model

478c48c

Extract working MDP from RTDP agent; report size of policy induced MC

4485d42

Add todo note

6cae85a

Try analysing steady states; fail

7fee1ef

remove steady state related stuff

2ad88c8

rethink exploring starts in rtdp

b3d63f1

improve exploring starts logic

1f0fbb4

pkel added 3 commits May 12, 2024 16:10

Revise shutdown and initial value estimate

837a97e

add rtdp debug notebook

9a9c407

Revise shutdown of aft20 model, add reminder for generic model

7f0ecd5

pkel force-pushed the rtdp branch from 6d7621d to 7f0ecd5 Compare May 17, 2024 09:08

pkel added 3 commits May 17, 2024 15:50

run policy evaluation for rtdp measurements

cd4a8ae

Enumerate available actions in mdp.Compiler;

e130f72

and make this apparent in the mdp.tab data type

Evaluate rtdp policy in tabular mdp

7a6fba9

pkel merged commit f38ddc7 into master Aug 25, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-Time Dynamic Programming #49

Real-Time Dynamic Programming #49

pkel commented Apr 29, 2024 •

edited

Loading

pkel commented Aug 25, 2024 •

edited

Loading

Real-Time Dynamic Programming #49

Real-Time Dynamic Programming #49

Conversation

pkel commented Apr 29, 2024 • edited Loading

pkel commented Aug 25, 2024 • edited Loading

pkel commented Apr 29, 2024 •

edited

Loading

pkel commented Aug 25, 2024 •

edited

Loading